Relative abundance calculations when using multiple reference databases #61

jamesck2 · 2025-07-24T17:35:19Z

jamesck2
Jul 24, 2025

Hi Sylph team,

Really nice tool! I'm interested in the feature that allows taxonomic profiling using >1 metagenome and >1 database, as described here. I was able to do this successfully with 66 soil metagenome samples and three reference databases (GlobDB, Fungal RefSeq, and IMG/VR v4). These were the commands that I used:

metagenome profiling:

sylph profile \
    -t 20 \
    databases/sylph/globdb_r226_sylph_c200.syldb \
    databases/sylph/fungi-refseq-2024-07-25-c200-v0.3.syldb \
    databases/sylph/imgvr_c200_v0.3.0.syldb \
    -1 reads/links/*_R1.fastq.gz -2 reads/links/*_R2.fastq.gz \
    -o sylph_result.all_samples_glob_fungi_imgvr.tsv

taxonomic profiling (with host information for viruses):

sylph-tax taxprof \
    sylph_result.all_samples_glob_fungi_imgvr.tsv \
    -o sylph_tax.all_samples_glob_fungi_imgvr. \
    -a \
    -t databases/sylph/sylph-tax/globdb_r226_sylph_tax.tsv.gz \
        databases/sylph/sylph-tax/fungi_refseq_2024-07-25_metadata.tsv.gz \
        databases/sylph/sylph-tax/IMGVR_4.1_metadata.tsv.gz

merge taxonomy profiles:

sylph-tax merge \
    --column relative_abundance \
    -o sylph_tax.all_samples_glob_fungi_imgvr.merged.relative_abundance.tsv \
    sylph_tax.all_samples_glob_fungi_imgvr.*

I also generated pavian-formatted outputs and used the pavian shiny app. This is a sankey diagram for one of the samples' relative abundance:

You may notice that the relative abundance of profiled viruses from IMG/VR is quite high, 47%! This is an exceptional case compared to my other samples, their ranges are around 15-35%.

While it's not improbable to have such a high relative abundance of viral OTUs in this dataset, given what I know about it, I just wanted to be sure that it's reasonable/fair to compare the viral OTUs to the prokaryotic OTUs in this way. In other words, is it likely that the high abundance of viruses is attributed to how the relative abundances were calculated for each database? And/or is it likely that it's attributed to some sort of database bias?

Thank you in advance for your help! Much appreciated.

Warm regards,
James

Answered by bluenote-1577

Jul 24, 2025

Hi James,

Your commands look reasonable.

The caveat: there are 2 types of abundances, sequence and taxonomic abundance. The pavian output uses taxonomic abundance. This makes viruses more abundant because it normalizes by genome length.

Ps: you may consider adjusting --min-number-kmers to a lower number when profiling vs viruses, the default is quite strict. See the cookbook

View full answer

bluenote-1577 · 2025-07-24T18:12:21Z

bluenote-1577
Jul 24, 2025
Maintainer

Hi James,

Your commands look reasonable.

The caveat: there are 2 types of abundances, sequence and taxonomic abundance. The pavian output uses taxonomic abundance. This makes viruses more abundant because it normalizes by genome length.

Ps: you may consider adjusting --min-number-kmers to a lower number when profiling vs viruses, the default is quite strict. See the cookbook

4 replies

bluenote-1577 Jul 24, 2025
Maintainer

Also, consider -u option since many things are often unclassified in soil samples

jamesck2 Jul 24, 2025
Author

I see. Thank you for the help. I'll adjust --min-number-kmers and add -u for viruses. If I were to plot the abundances on my own (viruses + proks.), without pavian, would you recommend that I use the sequence abundances in each sample and calculate relative abundances from those?

bluenote-1577 Jul 29, 2025
Maintainer

@jamesck2 I would use sequence or relative abundance depending on the biological question. There's no right answer here I think. E.g. If you care about copies of viruses (relative) vs total sequence length of viruses (sequence abundance)

jamesck2 Jul 30, 2025
Author

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Relative abundance calculations when using multiple reference databases #61

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Relative abundance calculations when using multiple reference databases #61

Uh oh!

jamesck2 Jul 24, 2025

Replies: 1 comment · 4 replies

Uh oh!

bluenote-1577 Jul 24, 2025 Maintainer

Uh oh!

bluenote-1577 Jul 24, 2025 Maintainer

Uh oh!

jamesck2 Jul 24, 2025 Author

Uh oh!

bluenote-1577 Jul 29, 2025 Maintainer

Uh oh!

jamesck2 Jul 30, 2025 Author

jamesck2
Jul 24, 2025

Replies: 1 comment 4 replies

bluenote-1577
Jul 24, 2025
Maintainer

bluenote-1577 Jul 24, 2025
Maintainer

jamesck2 Jul 24, 2025
Author

bluenote-1577 Jul 29, 2025
Maintainer

jamesck2 Jul 30, 2025
Author