Distributed Multivariable Predictive Modelling for Immuno-Oncology Response

Authors: Farnoosh Abbas Aghababazadeh, Kewei Ni, Nasim Bondar Sahebi

Contact: farnoosh.abbasaghababazadeh@uhn.ca, kewei.ni@uhn.ca, nasim.bondarsahebi@uhn.ca

Description: A distributed framework for multivariable predictive modeling of Immuno-Oncology (IO) response, enabling parallelized model training across multiple datasets using Apache Spark, with strict adherence to data privacy.

Project Overview

This repository implements a distributed Spark-based pipeline for multivariable analysis of immune-related RNA signatures and their predictive power for IO therapy response. Key features include:

Center-specific training of XGBoost models with no data sharing
Tree-based model aggregation to build a global model
Independent model validation using public and private cohorts
Reproducible and scalable deployment using Pixi, Python, Apache Spark and R/SparkR

Spark Environment Setup

Apache Spark is required for distributed model training.

Install Spark:

Download and extract Spark 3.2.1 with Hadoop 3.2:

https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz

Set Spark environment in your R script (Train_Distributed_XGBoost.r):

Sys.setenv(SPARK_HOME = "/your/local/path/spark-3.2.1-bin-hadoop3.2")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

Install required R packages:
```
install.packages("SparkR")
```

Repository Structure

Distributed_XGBoost/
├── config/              # Optional YAML configs (not required for MV)
├── data/                # Raw data, processed objects, results folders
│   ├── rawdata/
│   ├── procdata/
│   └── results/
│       ├── local/
│       ├── global/
│       └── validation/
├── workflow/scripts/    # R and Python scripts for modeling
│   ├── Compute_GeneSigScore.r
│   ├── Create_train_set.r
│   ├── Train_Distributed_XGBoost.r
│   ├── Aggregate_model.py
│   └── Validate_global_model.r
├── docs/                # Markdown-based documentation
│   └── README.md        # Project overview and setup instructions 
└── pixi.toml            # Pixi environment specification

Set Up

Prerequisites

Pixi is required to run this project. If you haven't installed it yet, follow these instructions

Getting Started

Clone and Run

git clone https://github.com/bhklab/PredictIO-MV-Dist.git
cd PredictIO-MV-Dist

Documentation

Full documentation will be available in the docs/ folder or via published GitHub Pages.

Start by downloading and organizing the raw input datasets as described in data/rawdata/README.md.

For data download and processing, please refer to the univariable repository:
🔗 https://github.com/bhklab/PredictIO-MV-Dist

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github/workflows		.github/workflows
data		data
docs		docs
workflow		workflow
.bhklab-template-builder-answers.yml		.bhklab-template-builder-answers.yml
.gitattributes		.gitattributes
.gitignore		.gitignore
mkdocs.yaml		mkdocs.yaml
pixi.lock		pixi.lock
pixi.toml		pixi.toml
ruff.toml		ruff.toml
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Multivariable Predictive Modelling for Immuno-Oncology Response

Project Overview

Spark Environment Setup

Repository Structure

Set Up

Prerequisites

Getting Started

Clone and Run

Documentation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

bhklab/PredictIO-MV-Dist

Folders and files

Latest commit

History

Repository files navigation

Distributed Multivariable Predictive Modelling for Immuno-Oncology Response

Project Overview

Spark Environment Setup

Repository Structure

Set Up

Prerequisites

Getting Started

Clone and Run

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages