NVD2 is a Nextflow pipeline focused primarily though not exclusively on finding human viruses in metagenomic samples. It leverages the battle-tested NCBI toolchain, including STAT and BLAST, to identify human viruses while minimizing false positives and false negatives. And for more general purpose classification, NVD2 implements a GOTTCHA2 subworkflow, which uses a carefully curated database of taxonomically diagnostic reference sequences spanning well-characterized taxa across the tree of life.
NVD2 was designed from the ground up to handle enormous datasets and performs particularly well with complex Illumina deep sequencing datasets like those from wastewater sewersheds. To perform well with these kinds of datasets, it must:
- Handle highly fragmented genome recovery.
- Be resilient to wild fluctuations in depth-of-coverage.
- Resolve ambiguities between closely related organisms with high sequence identity.
Many pipelines for classifying mixtures of organisms exist, but none satisfied these criteria to a satisfactory degree for human viruses--hence, NVD2 was born!
In addition to wastewater sequenced with Illumina, NVD2 also supports datasets sequenced on Oxford Nanopore instruments and has been tested on Nanopore libraries generated from air samples.
NVD2 set-up is a multi-phase process, including dependency setup, reference database setup, sample data setup, and run command construction.
A minimal NVD2 setup requires that Nextflow and Docker are installed. With both set up, the remainder of the pipeline's dependencies will be provided by a Docker image.
The pipeline also ships with a pyproject.toml
and pixi.lock
that can be used
to instantiate a reproducible environment with the
Pixi environment manager, which is a fast and modern
system for managing Conda dependencies. Users in the command line can cd
into
the repo project root dircetory and enter an environment with
pixi shell --frozen
. More information about using pixi
environments is
available in the Pixi docs.
Warning
The provided Pixi environment will not include NCBI's STAT tool, as it's not
distributed through Conda or PyPI and must instead be built from source. As
such, to use the nvd
subworkflow that implements STAT with two phases of
BLAST verification, users must use the pre-built Docker or Apptainer
containers for this project. Alternatively, the bundled
Containerfile can be built with Docker or Podman to provide
all dependencies, including STAT as well as those managed by Pixi.
The hardest/slowest part of NVD2 setup is getting the necessary reference datasets organized, though ideally, you'll only need to do it once.
Currently, NVD2 uses three datasets:
- A BLAST database built from the NCBI core_nt database. This database is used to triage putative taxonomic classifications as they make their way through the pipeline.
- An NCBI taxonomic classification database used by STAT.
- A GOTTCHA2 curated database.
Rather than bundling all three together, we've made each available individually so that users can choose which subworkflows they'd like to run. If they'd like to run the STAT+BLAST subworkflow, named nvd after NVD2's predecessor, users will need to download databases 1 and 2 above. To run GOTTCHA2, they will need the third.
All three are publicly available via wget
or curl
from the O'Connor
Laboratory's LabKey server, like so:
# download the STAT database
wget https://dholk.primate.wisc.edu/_webdav/dho/projects/lungfish/InfinitePath/public/%40files/release-v2.0.0/stat_db.tar.gz
# download the BLAST database
wget https://dholk.primate.wisc.edu/_webdav/dho/projects/lungfish/InfinitePath/public/%40files/release-v2.0.0/blast_db.tar.gz
# download the GOTTCHA2 database
wget https://dholk.primate.wisc.edu/_webdav/dho/projects/lungfish/InfinitePath/public/%40files/release-v2.0.0/gottcha2.tar.gz
(curl -fSsL
can be substituted for wget
in the above commands if desired)
Important
We strongly recommend users verify their database downloads with the md5
hashes available in checksum.txt
, which can be downloaded with
wget https://dholk.primate.wisc.edu/_webdav/dho/projects/lungfish/InfinitePath/public/%40files/release-v2.0.0/checksum.txt
.
Updates to the reference databases will also be reflected in CHANGELOG.md
,
available at the same endpoint as the databases and checksum text file.
Also at that endpoint, if desired, is a pre-built Apptainer image file for use on HPC cluster or other linux environments:
wget https://dholk.primate.wisc.edu/_webdav/dho/projects/lungfish/InfinitePath/public/%40files/release-v2.0.0/nvd2.sif
All TAR-archived reference databases must be extracted into directories with
tar xvf
before use with the NVD2 pipeline.
With the environment and source code set up, next you'll need to organize the file paths to your input FASTQ-formatted sequencing read files into a simple, CSV-formatted samplesheet. It must look like this:
sample_id,srr,platform,fastq1,fastq2
nanopore_test,,ont,nanopore.fastq.gz,
illumina_test,,illumina,illumina_R1.fastq.gz,illumina_R1.fastq.gz
sra_test,SRR33296246,,
Note that this example samplesheet is provided in the repo's assets directory for convenience.
With that, you're ready to run the pipeline!
Before you construct your run command, first answer the following questions:
- Are you interested only in human viruses? If so, include
--tools nvd
in yournextflow run
command (more on this below). - Are you interested in whatever's in your sample and not in human viruses in
particular? If so, use
--tools gottcha
. - Are you interested in both? If so, use
--tools nvd,gottcha
- Do you want the kitchen sink? We use
--tools all
for that.
Beyond the answers to these questions, most of the default NVD run command will simply be devoted to configuring paths to the required reference database files. Note that we plan to use presets to simply normal use-case run commands, but for now, an example run command to use all subworkflows will look like:
nextflow run dhoconno/nvd \
--tools all \
--samplesheet $YOUR_SAMPLESHEET \
--gottcha2_db $YOUR_REFERENCE_PATH/gottcha2/gottcha_db.species.fna \
--blast_db $YOUR_REFERENCE_PATH/blast_db \
--blast_db_prefix core_nt \
--stat_index $YOUR_REFERENCE_PATH/STAT_db/tree_index.dbs \
--stat_dbss $YOUR_REFERENCE_PATH/STAT_db/tree_filter.dbss \
--stat_annotation $YOUR_REFERENCE_PATH/STAT_db/tree_filter.dbss.annotation \
--human_virus_taxlist $YOUR_REFERENCE_PATH/STAT_db/human_viruses_taxlist.txt \
--experiment_id github_readme_test
(This command assumes you have set YOUR_SAMPLESHEET
and YOUR_REFERENCE_PATH
to the paths to your samplesheet CSV and the parent directory of your extracted
reference databases, respectively; you can replace them with whatever valid path
you have used for these files.)
Coming soon! For now, see our example run commands docs for some of the ways you might configure NVD2 in the command line and our contributor guide for how we recommend you work on the NVD2 codebase.
- Multi-platform sequencing support: Seamlessly processes both Illumina and Oxford Nanopore data with platform-specific optimizations
- Dual classification engines: Combines NCBI STAT+BLAST for human virus detection with GOTTCHA2 for comprehensive taxonomic profiling
- Industrial-scale data processing: Built from the ground up to handle massive wastewater datasets with complex read mixtures and variable coverage depths
- Smart contig assembly: Automatically assembles reads with SPAdes and filters contigs for optimal classification accuracy
- Two-phase BLAST verification: Uses both megablast and blastn with intelligent filtering to minimize false positives
- Advanced taxonomic filtering: Sophisticated lineage-based filtering with adjustable stringency for precise organism identification
- Human read scrubbing: Built-in capability to remove human sequences for privacy-compliant public data sharing
- Automated data deduplication: CLUMPIFY workflow removes PCR duplicates and optical duplicates to improve analysis accuracy and reduce disk usage for big datasets
- Enterprise data integration: Native LabKey LIMS integration with WebDAV file uploads and structured metadata management
- Comprehensive quality control: Read counting, contig metrics, and BLAST hit validation throughout the pipeline
- Flexible workflow orchestration: Mix-and-match subworkflows (nvd, gottcha, clumpify) based on research needs
- Production-ready deployment: Docker/Apptainer containerization with Pixi environment management for reproducible execution
- Intelligent error handling: Robust retry logic and graceful failure modes for reliable high-throughput processing
- SRA integration: Direct processing of NCBI SRA datasets alongside local FASTQ files
- Real-time validation: Pre-flight checks for database integrity, API connectivity, and experiment ID uniqueness
- Multi-format output: Generates taxonomic reports, FASTA sequences, and structured CSV files for downstream analysis
Coming soon!
- Marc Johnson and Shelby O'Connor, our partners in innovative pathogen monitoring from environmental samples.
- Kenneth Katz, NCBI, for developing NCBI STAT, maintaining pre-built databases for STAT, and helpful discusssions
- C. Titus Brown, for helpful discussions of using kmer classifiers as part of metagenomic workflows
- Development funded by Inkfish
See LICENSE
for more information. NVD2's predecessor used the
copyleft GPLv3 license, which means we
have to as well. This means you’re welcome to use it, share it, and modify it,
as long as any changes you distribute are also shared under the same
license.
By contributing to this project, you are thus locked into agreeing that your code will be released under GPLv3. This means:
- Share alike – if you share modified versions of this project, they must also be under GPLv3.
- Freedom to use – anyone can use the code for personal, educational, or commercial purposes, as long as they respect the license.
- Source availability – if you distribute the software (original or modified), you must also make the source code available under GPLv3.
- Community contributions – your pull requests and patches automatically become part of the GPLv3-licensed project.
- Commercial use - You’re free to use NVD2 code in commercial contexts, but you can’t combine it with proprietary software without open-sourcing that software under GPLv3 too.
- No Warranty and Liability - The software is provided as-is, without any warranty or liability.