Relative abundance calculations when using multiple reference databases #61
-
Hi Sylph team, Really nice tool! I'm interested in the feature that allows taxonomic profiling using >1 metagenome and >1 database, as described here. I was able to do this successfully with 66 soil metagenome samples and three reference databases (GlobDB, Fungal RefSeq, and IMG/VR v4). These were the commands that I used: metagenome profiling:
taxonomic profiling (with host information for viruses):
merge taxonomy profiles:
I also generated pavian-formatted outputs and used the pavian shiny app. This is a sankey diagram for one of the samples' relative abundance: ![]() You may notice that the relative abundance of profiled viruses from IMG/VR is quite high, 47%! This is an exceptional case compared to my other samples, their ranges are around 15-35%. While it's not improbable to have such a high relative abundance of viral OTUs in this dataset, given what I know about it, I just wanted to be sure that it's reasonable/fair to compare the viral OTUs to the prokaryotic OTUs in this way. In other words, is it likely that the high abundance of viruses is attributed to how the relative abundances were calculated for each database? And/or is it likely that it's attributed to some sort of database bias? Thank you in advance for your help! Much appreciated. Warm regards, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
Hi James, Your commands look reasonable. The caveat: there are 2 types of abundances, sequence and taxonomic abundance. The pavian output uses taxonomic abundance. This makes viruses more abundant because it normalizes by genome length. Ps: you may consider adjusting --min-number-kmers to a lower number when profiling vs viruses, the default is quite strict. See the cookbook |
Beta Was this translation helpful? Give feedback.
Hi James,
Your commands look reasonable.
The caveat: there are 2 types of abundances, sequence and taxonomic abundance. The pavian output uses taxonomic abundance. This makes viruses more abundant because it normalizes by genome length.
Ps: you may consider adjusting --min-number-kmers to a lower number when profiling vs viruses, the default is quite strict. See the cookbook