A Nextflow pipeline for MHC genotyping from Oxford Nanopore sequencing data using Fluidigm barcode-based demultiplexing. Originally developed by DHO in experiment 31570.
This pipeline processes MiSeq amplicon data sequenced on Oxford Nanopore Technologies (ONT) platforms to determine MHC alleles. It performs:
- Barcode-based demultiplexing using Fluidigm tags
- Read orientation and quality control
- Primer filtering and trimming
- Reference-based allele calling
- Generation of allele count matrices
- Automated demultiplexing: Uses Fluidigm barcode sequences for sample identification
- Primer-aware processing: Filters and trims reads based on MHC-specific primers
- Full-span alignment: Ensures only complete amplicon sequences are counted
- Flexible output: Generates both detailed per-sample results and aggregated pivot tables
- Multiple environment support: Run locally, on HPC clusters, or in containers
- Unix-like operating system (Linux, macOS)
- Nextflow (>=21.04.0)
- Python (>=3.9)
- 8GB RAM minimum (more recommended for large datasets)
- 4 CPU cores recommended
The pipeline requires the following tools, which can be installed via conda/mamba or pixi:
- vsearch (>=2.21.0)
- bbmap (>=39.00)
- seqkit (>=2.10.0)
- samtools (>=1.17)
- pigz
- Python packages: pandas, pysam, xlsxwriter, biopython, openpyxl, numpy
# Clone the repository
git clone https://github.com/dhoconno/ONT-MHC-genotyper.git
cd ONT-MHC-genotyper
# Install pixi if not already installed
curl -fsSL https://pixi.sh/install.sh | bash
# Install dependencies
pixi install
# Clone the repository
git clone https://github.com/dhoconno/ONT-MHC-genotyper.git
cd ONT-MHC-genotyper
# Create conda environment
conda env create -f environment.yml
# Activate environment
conda activate mhc-genotyping
# Pull the Docker image (once available)
docker pull dhoconno/ont-mhc-genotyper:latest
-
Prepare your input files:
- SUP basecalled and demultiplexed FASTQ files (one per Fluidigm barcode)
- Sample mapping CSV file
- Reference FASTA file
- (Optional) Custom primer sequences
-
Create a sample mapping file (
sample_mapping.csv
):tag,GS ID FLD0041,Sample_001 FLD0042,Sample_002 FLD0043,Sample_003
-
Run the pipeline:
nextflow run workflow/mhc_genotyping.nf \ --barcode_dir /path/to/barcode/files \ --reference /path/to/reference.fasta \ --sample_sheet sample_mapping.csv \ --outdir results
--barcode_dir
: Directory containing SUP basecalled and demultiplexed FASTQ files (*.fastq.gz), one per Fluidigm barcode--reference
: Path to reference FASTA file containing MHC allele sequences--sample_sheet
: CSV file mapping Fluidigm barcode tags to sample names
--outdir
: Output directory (default:results
)--primers
: FASTA file with primer sequences (default:ref/mhc_specific_primers.fa
)--fluidigm_barcodes
: Fluidigm barcode file (default:ref/fluidigm.txt
)--min_reads
: Minimum reads for allele calling (default: 10)--mismatch
: Maximum mismatches allowed in primer matching (default: 2)
--max_memory
: Maximum memory (default: '8.GB')--max_cpus
: Maximum CPUs (default: 4)--max_time
: Maximum time (default: '12.h')
- Standard FASTQ format (can be gzipped)
- Must be SUP basecalled and demultiplexed
- Named by Fluidigm barcode (e.g.,
FLD0041.fastq.gz
)
tag,GS ID
FLD0041,Sample_001
FLD0042,Sample_002
Standard FASTA format with MHC allele sequences:
>Mamu-A1*001:01
ATGCGGGTCACGGCGCCCCGAACCCTCCTCCTGCTGCTCTCGGCGGCCCTGGCCCTGACCGAGACCTGGGCCGGCTCCCACTCCATGAGGTATTTCTCCACATCCGTGTCCCGGCCCGGCCGCGGGGAGCCCCGCTTCATCGCCGTGGGCTACGTGGACGACACGCAGTTCGTGCGGTTCGACAGCGACGCCGCGAGCCAGAGGATGGAGCCGCGGGCGCCGTGGATAGAGCAGGAGGGGCCGGAGTATTGGGACCGGGAGACACGGAACGCCAAGGGCCACGCACAGACTGACCGAGAGAACCTGCGGATCGCGCTCCGCTACTACAACCAGAGCGAGGCCGGGTCTCACACCCTCCAGAGGATGTACGGCTGCGACGTGGGGCCGGACGGGCGCCTCCTCCGCGGGCATGACCAGTCCGCCTACGACGGCAAGGATTACATCGCCCTGAACGAGGACCTGCGCTCCTGGACCGCCGCGGACACGGCGGCTCAGATCACCCAGCGCAAGTTGGAGGCGGCCCGTGCGGCGGAGCAGCTGAGAGCCTACCTGGAGGGCACGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAGGAGACGCTGCAGCGCGCGGAACACCCAAAGACACACGTGACCCACCACCCCCTCTCTGACCATGAGGCCACCCTGAGGTGCTGGGCCCTGGGCTTCTACCCTGCGGAGATCACACTGACCTGGCAGCGGGATGGCGAGGACCAAACTCAGGACACCGAGCTTGTGGAGACCAGGCCAGCAGGAGATGGAACCTTCCAGAAGTGGGCAGCTGTGGTGGTGCCTTCTGGAGAAGAGCAGAGATACACGTGCCATGTGCAGCACGAGGGGCTGCCGGAGCCCCTCACCCTGAGATGGGAGCCGTCTTCCCAGTCCACCGTCCCCATCGTGGGCATTGTTGCTGGCCTGGCTGTCCTAGCAGTTGTGGTCATCGGAGCTGTGGTCGCTGCTGTGATGTGTAGGAGGAAGAGCTCAGG
Tab-separated file with Fluidigm barcode names and sequences:
Mid Name Mid Sequence
FLD0041 GTATGAGCAC
FLD0042 CGAGTGCTGT
The pipeline generates organized output in the following structure:
results/
├── 11_aggregated/
│ ├── all_samples_counts.csv # Aggregated allele counts
│ └── sample_summary.txt # Summary statistics
├── 12_pivot_table/
│ ├── allele_counts_pivot.xlsx # Excel pivot table
│ └── pivot_summary.txt # Pivot table summary
└── pipeline_info/
├── execution_report.html # Detailed execution report
├── execution_timeline.html # Visual timeline
└── execution_trace.txt # Resource usage trace
-
all_samples_counts.csv: CSV file with allele counts per sample
sample,allele,count Sample_001,Mamu-A1*001:01,1523 Sample_001,Mamu-B*001:01,892
-
allele_counts_pivot.xlsx: Excel spreadsheet with samples as columns and alleles as rows
-
execution_report.html: Interactive HTML report with pipeline statistics
# Run with Docker
nextflow run workflow/mhc_genotyping.nf -profile docker --barcode_dir input/
# Run on SLURM cluster
nextflow run workflow/mhc_genotyping.nf -profile slurm --barcode_dir input/
# Debug mode with detailed logging
nextflow run workflow/mhc_genotyping.nf -profile debug --barcode_dir input/
To use your own reference sequences:
- Create a FASTA file with your MHC allele sequences
- Ensure sequence names follow standard nomenclature
- Specify with
--reference
parameter
# Stricter primer matching (1 mismatch)
nextflow run workflow/mhc_genotyping.nf \
--barcode_dir input/ \
--mismatch 1
# Higher minimum read threshold
nextflow run workflow/mhc_genotyping.nf \
--barcode_dir input/ \
--min_reads 50
-
No reads passing filter
- Check primer sequences match your amplicons
- Verify read orientation
- Try increasing
--mismatch
parameter
-
Memory errors
- Increase
--max_memory
parameter - Use
-profile slurm
for cluster execution
- Increase
-
Missing alleles
- Verify reference FASTA contains expected sequences
- Check minimum read threshold
- Check the [execution_report.html] for detailed error messages
- Review the [execution_trace.txt] for resource usage
- Open an issue on GitHub with the error message and trace file
If you use this pipeline in your research, please cite:
ONT-MHC-genotyper: A Nextflow pipeline for MHC genotyping from Oxford Nanopore sequencing data
[Citation details to be added upon publication]
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
For questions or support, please open an issue on GitHub or contact the DHO Lab.