DiscoVir

written by Lauren Krausfeldt & Poorani Subramanian - [email protected]

Description

This is a pipeline for exploring viruses (ssDNA, dsDNA phage, and giant DNA viruses) and viral diversity in metagenomes. It can be run in the cloud application Nephele (under Explore) or on HPC. More details here.

The pipeline accepts metagenomic assembly sequences (.fasta) and binary alignment map (.bam) files of the reads mapped back to the assemblies as input. (These files could be produced from the WGSA2 pipeline in Nephele¹ ). The output of this pipeline provides viral genomes found in the metagenome assembly, their taxonomy and level of completeness, viral functional genes and their abundances, and vOTU abundances and their host taxonomy.

The pipeline first searchs for viral genomes using geNomad², which also provides viral taxonomy and functional classification of each viral genomes. The viral genomes are also functionally classified with DRAM-v³ and (optionally) diamond⁴ using the nr database. Gene abundances per sample are produced from these outputs using VERSE⁵. From here, the user has the option to filter the resulting sequences based on completeness using CheckV⁶. Either the output of geNomad or CheckV is used to cluster viral genomes with BBTools dedupe⁷ and mmseqs⁸ to produce vOTUs⁹. Finally, abundances and host taxonomy of vOTUs are produced.

Files

Snakefile: pipeline script (reads in configs, commands for each pipeline step/rule)
- cluster_setup.smk: helper script for reading in cluster config file
project_config.yaml: for snakemake --configfile option. config file with details for a specific project - working/input/output directories, path to scripts and other configs, sample names, options for specific rules in the pipeline, etc.
locus.cluster_config.yaml: cluster configuration file for snakemake --cluster-config option. specifically for NIAID Locus HPC which uses UGE. (sets parameters for qsub command for each rule's job, and which environment modules to use)
locus_submit_vp.sh: batch job submit script for running the pipeline on Locus
scripts: see scripts README
docs/README_for_DiscoVir_outputs.md: explanation of outputs of the pipeline

Running the Pipeline

Inputs

The inputs to the pipeline are assembled contigs/scaffolds - one fasta file per sample; and bam files of reads aligned to the assemblies - one bam per sample. They should be located in (or symlinked to) a single directory, and the filenames should start with a unique per-sample name.

To run on Locus

Clone this repo locally:

git clone https://github.com/niaid/virome-pipeline

Copy over the project config file project_config.yaml and submit script locus_submit_vp.sh to your project working directory, and edit both with the details for your specific project.
- for the submit script, the main items to edit are:
  1. path to the project config file, email address
  2. the arguments for the snakemake command at the bottom of the script (see comments in the script)
- for the project config, the main items to edit are:
  - paths to input, output, and working directory and email
  - pipeline options detailed in the comments of the config file
Submit the job script:

qsub ./locus_submit_vp.sh

Success?

Notes

This is tested to run on NIAID's HPC Locus. However, it would be easy to adapt to another HPC that uses environment modules by making your own cluster config file (with the correct module names and job parameters), and your own job submit script (in particular modifying the $clustercmd for whatever job scheduler your HPC uses).
In the future, we will work on making it more general (perhaps using conda or a containerized workflow instead of environment modules)
Also, adding additional steps for specialized analysis and making the pipeline more flexible.

References

https://www.protocols.io/view/wgsa2-workflow-a-tutorial-n92ldm98xl5b/v1
Camargo, A. P., Roux, S., Schulz, F., Babinski, M., Xu, Y., Hu, B., ... & Kyrpides, N. C. (2023). Identification of mobile genetic elements with geNomad. Nature Biotechnology, 1-10. doi: 10.1038/s41587-023-01953-y.
Shaffer, M., Borton, M. A., McGivern, B. B., Zayed, A. A., La Rosa, S. L., Solden, L. M., ... & Wrighton, K. C. (2020). DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic acids research, 48(16), 8883-8900. doi: 10.1093/nar/gkaa621.
Buchfink, B., Reuter, K., & Drost, H. G. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature methods, 18(4), 366-368. doi: 10.1038/s41592-021-01101-x.
Zhu, Q., Fisher, S. A., Shallcross, J., & Kim, J. (2016). VERSE: a versatile and efficient RNA-Seq read counting tool. bioRxiv, 053306. doi: 10.1101/053306.
Nayfach, S., Camargo, A. P., Schulz, F., Eloe-Fadrosh, E., Roux, S., & Kyrpides, N. C. (2021). CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nature biotechnology, 39(5), 578-585. doi: 10.1038/s41587-020-00774-7.
https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/dedupe-guide/
Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11), 1026-1028. doi:10.1038/nbt.3988.
Roux, S., Adriaenssens, E. M., Dutilh, B. E., Koonin, E. V., Kropinski, A. M., Krupovic, M., ... & Eloe-Fadrosh, E. A. (2019). Minimum information about an uncultivated virus genome (MIUViG). Nature biotechnology, 37(1), 29-37. doi:10.1038/nbt.4306.
Shumate, A., & Salzberg, S. L. (2021). Liftoff: Accurate mapping of gene annotations. Bioinformatics, 37(12), 1639–1643. doi: 10.1093/bioinformatics/btaa1016.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DiscoVir

Description

Files

Running the Pipeline

Inputs

To run on Locus

Notes

References

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
docs		docs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
cluster_setup.smk		cluster_setup.smk
locus.cluster_config.yaml		locus.cluster_config.yaml
locus_submit_vp.sh		locus_submit_vp.sh
project_config.yaml		project_config.yaml

License

niaid/virome-pipeline

Folders and files

Latest commit

History

Repository files navigation

DiscoVir

Description

Files

Running the Pipeline

Inputs

To run on Locus

Notes

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages