Authors: Farnoosh Abbas Aghababazadeh, Kewei Ni, Nasim Bondar Sahebi
Contact: farnoosh.abbasaghababazadeh@uhn.ca, kewei.ni@uhn.ca, nasim.bondarsahebi@uhn.ca
Description: A distributed framework for multivariable predictive modeling of Immuno-Oncology (IO) response, enabling parallelized model training across multiple datasets using Apache Spark, with strict adherence to data privacy.
This repository implements a distributed Spark-based pipeline for multivariable analysis of immune-related RNA signatures and their predictive power for IO therapy response. Key features include:
- Center-specific training of XGBoost models with no data sharing
- Tree-based model aggregation to build a global model
- Independent model validation using public and private cohorts
- Reproducible and scalable deployment using Pixi, Python, Apache Spark and R/SparkR
Apache Spark is required for distributed model training.
-
Install Spark:
Download and extract Spark 3.2.1 with Hadoop 3.2:
https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
-
Set Spark environment in your R script (
Train_Distributed_XGBoost.r
):Sys.setenv(SPARK_HOME = "/your/local/path/spark-3.2.1-bin-hadoop3.2") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
-
Install required R packages:
install.packages("SparkR")
Distributed_XGBoost/
├── config/ # Optional YAML configs (not required for MV)
├── data/ # Raw data, processed objects, results folders
│ ├── rawdata/
│ ├── procdata/
│ └── results/
│ ├── local/
│ ├── global/
│ └── validation/
├── workflow/scripts/ # R and Python scripts for modeling
│ ├── Compute_GeneSigScore.r
│ ├── Create_train_set.r
│ ├── Train_Distributed_XGBoost.r
│ ├── Aggregate_model.py
│ └── Validate_global_model.r
├── docs/ # Markdown-based documentation
│ └── README.md # Project overview and setup instructions
└── pixi.toml # Pixi environment specification
Pixi is required to run this project. If you haven't installed it yet, follow these instructions
git clone https://github.com/bhklab/PredictIO-MV-Dist.git
cd PredictIO-MV-Dist
Full documentation will be available in the docs/
folder or via published GitHub Pages.
Start by downloading and organizing the raw input datasets as described in data/rawdata/README.md
.
For data download and processing, please refer to the univariable repository:
🔗 https://github.com/bhklab/PredictIO-MV-Dist