Skip to content

Instantly share code, notes, and snippets.

@gregcaporaso
Last active June 23, 2022 15:57
Show Gist options
  • Save gregcaporaso/f3c042e5eb806349fa18 to your computer and use it in GitHub Desktop.
Save gregcaporaso/f3c042e5eb806349fa18 to your computer and use it in GitHub Desktop.
vsearch-based sequence dereplication through generation of a biom table

This depends on biom version >= 2.1.5, < 2.2.0 and vsearch >= 1.7.0.

Please note that all of this is highly experimental. I'm keeping these notes as I work with this approach. For published work, I still recommend using standard pipelines, such as those in QIIME 1.9.1.

$ biom --version
biom, version 2.1.5

$ vsearch --version
vsearch v1.7.0_osx_x86_64, 16.0GB RAM, 8 cores
https://github.com/torognes/vsearch

First, dereplicate your sequences with vsearch. The --relabel_sha1 --relabel_keep parameters are option, but map your original sequence identifiers onto sequence sha1 values. We'll come back to that in a minute. This takes about a minute on a MiSeq run.

$ vsearch --derep_fulllength seqs --output rep-set.fna --uc vsearch-derep.uc --relabel_sha1 --relabel_keep
vsearch v1.7.0_osx_x86_64, 16.0GB RAM, 8 cores
https://github.com/torognes/vsearch

Reading file seqs 100%
26469 nt in 200 seqs, min 116, max 152, avg 132
Dereplicating 100%
Sorting 100%
43 unique sequences, avg cluster 4.7, median 2, max 67
Writing output file 100%
Writing uc file, first part 100%
Writing uc file, second part 100%

Next, run biom from-uc to generate a biom file.

$ biom from-uc -i vsearch-derep.uc -o vsearch-derep.biom --rep-set-fp rep-set.fna

Then, we can biom head the table to get a quick look at the data.

$ biom head -i vsearch-derep.biom
# Constructed from biom file
#OTU ID	f2	f1	f3	f4	p1
e84fcf85a6a4065231dcf343bb862f1cb32abae6	13.0	16.0	16.0	22.0	0.0
5525fb6dab7b6577960147574465990c6df070ad	0.0	0.0	0.0	0.0	19.0
eb3564a35320b53cef22a77288838c7446357327	0.0	0.0	0.0	0.0	0.0
418f1d469f08c99976b313028cf6d3f18f61dd55	0.0	0.0	0.0	0.0	0.0
2e3b2c075901640c4de739473f9246385430b1ed	0.0	3.0	3.0	0.0	0.0

Notice that the OTU identifiers (first field in each row after the first one) are different than what you're probably used to seeing. These are sha1 hashs of the representative sequences for each OTU. Because this is dereplication, all sequences in the OTU are 100% identical, so have the same sha1. By naming your OTUs this way, your OTU identifiers will be consistent across runs of vesearch --derep_fulllength, so you can safely merge tables (provided that your sample identifiers don't overlap unless they're supposed to.

You now have a biom file vsearch-derep.biom and the corresponding represenative sequences file rep-set.fna. These can be plugged into QIIME, or other programs that support biom. From here, we generally also want a tree for phylogenetic diversity calculations and taxonomic assignments for each dereplicated sequence.

Filtering OTUs that are present in only a single sample, and then dropping those representative sequences

This is important for the efficiency of the next steps. In my few experiences, this will reduce the number of OTUs (and therefore representative sequences) by 1-2 orders of magnitude. So, that obviously results in much faster alignment, tree building, and taxonomy assignment.

I do these steps with QIIME 1.9.1 as follows (I use ms2 as an abbreviation for minimum samples = 2):

filter_otus_from_otu_table.py -i vsearch-derep.biom -o vsearch-derep.ms2.biom -s 2
filter_fasta.py -f rep-set.fna -o rep-set.ms2.fna -b vsearch-derep.ms2.biom

Tree building for phylogenetic diversity calculations

To build a tree, I've been aligning sequences and masking the alignment with ssu-align, and then building a tree from the resulting bacterial sequences with FastTreeMP. Here are my steps:

ssu-align rep-set.ms2.fna ssu-align-ms2/

# mask the alignment and output as fasta
# as far as I can tell, ssu-mask can only write to current directory, hence my
# pair of cd commands here to keep results of ssu-* contained
cd ssu-align-ms2 ; ssu-mask --afa -a ssu-align-ms2.bacteria.stk ; cd ..

# build the tree
export OMP_NUM_THREADS=29; FastTreeMP -nosupport -fastest -nt ssu-align-ms2/ssu-align-ms2.bacteria.mask.afa > rep-set.ms2.tre

With about 700,000 sequences in rep-set.ms2.fna, the ssu-align step took about 16 hours, and the ssu-mask about 30 seconds. FastTreeMP with 32 threads took 2.5 hours.

I then use the resulting rep-set.ms2.tre for diversity analyses.

Taxonomy assignment

I use QIIME's RDP Classifier wrapper for this right now, training against the 99% Greengenes OTUs, and then add those assignments to my biom file.

parallel_assign_taxonomy_rdp.py -r /home/gregcaporaso/gg_13_8_otus/rep_set/99_otus.fasta -t /home/gregcaporaso/gg_13_8_otus/taxonomy/99_otu_taxonomy.txt -i rep-set.ms2.fna -o rdp-assigned-tax-99/ -O 20
biom add-metadata -i vsearch-derep.ms2.biom -o vsearch-derep.ms2.rdp-tax-99.biom --observation-metadata-fp rdp-assigned-tax-99/rep-set.ms2_tax_assignments.txt --observation-header ID,taxonomy --sc-separated taxonomy

I then use the resulting vsearch-derep.ms2.rdp-tax-99.biom for diversity analyses.

@marasodi
Copy link

Hi
I tried to convert uc to biom from the below command
biom from-uc -i vsearch-derep.uc -o vsearch-derep.biom --rep-set-fp rep-set.fna
and got this error ValueError: A query sequence was encountered that does not have an underscore. An underscore is required in all query sequence identifiers to indicate the sample identifier.
Any idea as to how to go about it/solve? Thanks for your assistance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment