ONT-MHC-genotyper

A Nextflow pipeline for MHC genotyping from Oxford Nanopore sequencing data using Fluidigm barcode-based demultiplexing. Originally developed by DHO in experiment 31570.

Overview

This pipeline processes MiSeq amplicon data sequenced on Oxford Nanopore Technologies (ONT) platforms to determine MHC alleles. It performs:

Barcode-based demultiplexing using Fluidigm tags
Read orientation and quality control
Primer filtering and trimming
Reference-based allele calling
Generation of allele count matrices

Features

Automated demultiplexing: Uses Fluidigm barcode sequences for sample identification
Primer-aware processing: Filters and trims reads based on MHC-specific primers
Full-span alignment: Ensures only complete amplicon sequences are counted
Flexible output: Generates both detailed per-sample results and aggregated pivot tables
Multiple environment support: Run locally, on HPC clusters, or in containers

Requirements

System Requirements

Unix-like operating system (Linux, macOS)
Nextflow (>=21.04.0)
Python (>=3.9)
8GB RAM minimum (more recommended for large datasets)
4 CPU cores recommended

Software Dependencies

The pipeline requires the following tools, which can be installed via conda/mamba or pixi:

vsearch (>=2.21.0)
bbmap (>=39.00)
seqkit (>=2.10.0)
samtools (>=1.17)
pigz
Python packages: pandas, pysam, xlsxwriter, biopython, openpyxl, numpy

Installation

Option 1: Using Pixi (Recommended)

# Clone the repository
git clone https://github.com/dhoconno/ONT-MHC-genotyper.git
cd ONT-MHC-genotyper

# Install pixi if not already installed
curl -fsSL https://pixi.sh/install.sh | bash

# Install dependencies
pixi install

Option 2: Using Conda/Mamba

# Clone the repository
git clone https://github.com/dhoconno/ONT-MHC-genotyper.git
cd ONT-MHC-genotyper

# Create conda environment
conda env create -f environment.yml

# Activate environment
conda activate mhc-genotyping

Option 3: Using Docker

# Pull the Docker image (once available)
docker pull dhoconno/ont-mhc-genotyper:latest

Quick Start

Prepare your input files:
- SUP basecalled and demultiplexed FASTQ files (one per Fluidigm barcode)
- Sample mapping CSV file
- Reference FASTA file
- (Optional) Custom primer sequences

Create a sample mapping file (sample_mapping.csv):

tag,GS ID
FLD0041,Sample_001
FLD0042,Sample_002
FLD0043,Sample_003

Run the pipeline:

nextflow run workflow/mhc_genotyping.nf \
  --barcode_dir /path/to/barcode/files \
  --reference /path/to/reference.fasta \
  --sample_sheet sample_mapping.csv \
  --outdir results

Pipeline Parameters

Required Parameters

--barcode_dir: Directory containing SUP basecalled and demultiplexed FASTQ files (*.fastq.gz), one per Fluidigm barcode
--reference: Path to reference FASTA file containing MHC allele sequences
--sample_sheet: CSV file mapping Fluidigm barcode tags to sample names

Optional Parameters

--outdir: Output directory (default: results)
--primers: FASTA file with primer sequences (default: ref/mhc_specific_primers.fa)
--fluidigm_barcodes: Fluidigm barcode file (default: ref/fluidigm.txt)
--min_reads: Minimum reads for allele calling (default: 10)
--mismatch: Maximum mismatches allowed in primer matching (default: 2)

Resource Parameters

--max_memory: Maximum memory (default: '8.GB')
--max_cpus: Maximum CPUs (default: 4)
--max_time: Maximum time (default: '12.h')

Input File Formats

Barcode FASTQ Files

Standard FASTQ format (can be gzipped)
Must be SUP basecalled and demultiplexed
Named by Fluidigm barcode (e.g., FLD0041.fastq.gz)

Sample Mapping CSV

tag,GS ID
FLD0041,Sample_001
FLD0042,Sample_002

Reference FASTA

Standard FASTA format with MHC allele sequences:

>Mamu-A1*001:01
ATGCGGGTCACGGCGCCCCGAACCCTCCTCCTGCTGCTCTCGGCGGCCCTGGCCCTGACCGAGACCTGGGCCGGCTCCCACTCCATGAGGTATTTCTCCACATCCGTGTCCCGGCCCGGCCGCGGGGAGCCCCGCTTCATCGCCGTGGGCTACGTGGACGACACGCAGTTCGTGCGGTTCGACAGCGACGCCGCGAGCCAGAGGATGGAGCCGCGGGCGCCGTGGATAGAGCAGGAGGGGCCGGAGTATTGGGACCGGGAGACACGGAACGCCAAGGGCCACGCACAGACTGACCGAGAGAACCTGCGGATCGCGCTCCGCTACTACAACCAGAGCGAGGCCGGGTCTCACACCCTCCAGAGGATGTACGGCTGCGACGTGGGGCCGGACGGGCGCCTCCTCCGCGGGCATGACCAGTCCGCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCCGCGGACACGGCGGCTCAGATCACCCAGCGCAAGTTGGAGGCGGCCCGTGCGGCGGAGCAGCTGAGAGCCTACCTGGAGGGCACGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAGGAGACGCTGCAGCGCGCGGAACACCCAAAGACACACGTGACCCACCACCCCCTCTCTGACCATGAGGCCACCCTGAGGTGCTGGGCCCTGGGCTTCTACCCTGCGGAGATCACACTGACCTGGCAGCGGGATGGCGAGGACCAAACTCAGGACACCGAGCTTGTGGAGACCAGGCCAGCAGGAGATGGAACCTTCCAGAAGTGGGCAGCTGTGGTGGTGCCTTCTGGAGAAGAGCAGAGATACACGTGCCATGTGCAGCACGAGGGGCTGCCGGAGCCCCTCACCCTGAGATGGGAGCCGTCTTCCCAGTCCACCGTCCCCATCGTGGGCATTGTTGCTGGCCTGGCTGTCCTAGCAGTTGTGGTCATCGGAGCTGTGGTCGCTGCTGTGATGTGTAGGAGGAAGAGCTCAGG

Fluidigm Barcode File

Tab-separated file with Fluidigm barcode names and sequences:

Mid Name	Mid Sequence
FLD0041	GTATGAGCAC
FLD0042	CGAGTGCTGT

Output Files

The pipeline generates organized output in the following structure:

results/
├── 11_aggregated/
│   ├── all_samples_counts.csv    # Aggregated allele counts
│   └── sample_summary.txt         # Summary statistics
├── 12_pivot_table/
│   ├── allele_counts_pivot.xlsx  # Excel pivot table
│   └── pivot_summary.txt          # Pivot table summary
└── pipeline_info/
    ├── execution_report.html      # Detailed execution report
    ├── execution_timeline.html    # Visual timeline
    └── execution_trace.txt        # Resource usage trace

Key Output Files

all_samples_counts.csv: CSV file with allele counts per sample

sample,allele,count
Sample_001,Mamu-A1*001:01,1523
Sample_001,Mamu-B*001:01,892

allele_counts_pivot.xlsx: Excel spreadsheet with samples as columns and alleles as rows
execution_report.html: Interactive HTML report with pipeline statistics

Advanced Usage

Running with Different Profiles

# Run with Docker
nextflow run workflow/mhc_genotyping.nf -profile docker --barcode_dir input/

# Run on SLURM cluster
nextflow run workflow/mhc_genotyping.nf -profile slurm --barcode_dir input/

# Debug mode with detailed logging
nextflow run workflow/mhc_genotyping.nf -profile debug --barcode_dir input/

Custom Reference Files

To use your own reference sequences:

Create a FASTA file with your MHC allele sequences
Ensure sequence names follow standard nomenclature
Specify with --reference parameter

Adjusting Filtering Parameters

# Stricter primer matching (1 mismatch)
nextflow run workflow/mhc_genotyping.nf \
  --barcode_dir input/ \
  --mismatch 1

# Higher minimum read threshold
nextflow run workflow/mhc_genotyping.nf \
  --barcode_dir input/ \
  --min_reads 50

Troubleshooting

Common Issues

No reads passing filter
- Check primer sequences match your amplicons
- Verify read orientation
- Try increasing --mismatch parameter
Memory errors
- Increase --max_memory parameter
- Use -profile slurm for cluster execution
Missing alleles
- Verify reference FASTA contains expected sequences
- Check minimum read threshold

Getting Help

Check the [execution_report.html] for detailed error messages
Review the [execution_trace.txt] for resource usage
Open an issue on GitHub with the error message and trace file

Citation

If you use this pipeline in your research, please cite:

ONT-MHC-genotyper: A Nextflow pipeline for MHC genotyping from Oxford Nanopore sequencing data
[Citation details to be added upon publication]

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Contact

For questions or support, please open an issue on GitHub or contact the DHO Lab.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
bin		bin
ref		ref
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pixi.toml		pixi.toml
pyproject.toml		pyproject.toml

License

dhoconno/ONT-MHC-genotyper

Folders and files

Latest commit

History

Repository files navigation

ONT-MHC-genotyper

Overview

Features

Requirements

System Requirements

Software Dependencies

Installation

Option 1: Using Pixi (Recommended)

Option 2: Using Conda/Mamba

Option 3: Using Docker

Quick Start

Pipeline Parameters

Required Parameters

Optional Parameters

Resource Parameters

Input File Formats

Barcode FASTQ Files

Sample Mapping CSV

Reference FASTA

Fluidigm Barcode File

Output Files

Key Output Files

Advanced Usage

Running with Different Profiles

Custom Reference Files

Adjusting Filtering Parameters

Troubleshooting

Common Issues

Getting Help

Citation

License

Contributing

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages