ContamFilter: Genome Contamination Screening Pipeline Implemented in CWL for External Users

Introduction

Both technical and biological contaminants may appear in raw, assembled, and annotated sequence data. Contaminants may include ‘technical’ sequences such as vector and adapter sequence or ‘biological’ sequences derived from commensal or parasitic DNA isolated with the parent samples, or may result from mislabeling of read data. These may include mitochondrial sequence, rRNA, and can be viral, microbial, or eukaryotic derived sequence. Legitimate (non-contaminating) may be present in genomic read data or assemblies, representing horizontal gene transfer events.

NCBI implements a contamination-screening pipeline for screening of technical and biological contamination in genomic data (Fig. 1). The NCBI Internal implementation, called FSCR (currently at version 3.0), uses the GPipe Workflow Framwork and is implemented mainly in C++, using binary data structures (ASN.1) for improved performance.

The goal of this project is to implement an externalized version of the pipeline and provide functionality for searches using user-defined contamination files and taxonomic subsets. This pipeline uses the Common Workflow Language (CWL) with Docker, and aims to incorporate the original algorithms in a more accessible runtime environment.

Planned Pipeline

Currently Implemented Pipeline

The above image was generated from the workflow in this GitHub project, via the Common Workflow Language Viewer.

How to run

Installing Docker

Follow instructions to install Docker CE on the system.

# Get the ContamFilter Repo
git clone https://github.com/NCBI-Hackathons/ContamFilter.git

# Go to the docker dirctory
cd ContamFilter/docker/blast

# Build the Docker image
docker build -t blast .

# See that the Docker image is available
docker images

# Run an individual command to see if the image is working
docker run blast echo "hello!"

# Print a list of applications loaded on this image
docker run blast ls /home/ubuntu/bin/

Running CWL

# Create a Python envionment with a CWL runner.
virtualenv venv
. venv/bin/activate
pip install cwlref-runner
git clone https://github.com/NCBI-Hackathons/ContamFilter.git

# Edit the input parameters to provide the path to your sequence data
vi ContamFilter/cwl/workflow/contam_euk.yml

# Run the pipeline
cwl-runner ContamFilter/cwl/workflow/contam_euk.cwl ContamFilter/cwl/workflow/contam_euk.yml

Results

combined_hits.gff

##gff-version 3
Contig117697.1  .   region  1   37  .   .   .   Note=vecscreen_adaptors
Contig240124.2  .   region  37660   37707   .   .   .   Note=vecscreen_adaptors
Contig267453.2  .   region  5380    5428    .   .   .   Note=vecscreen_adaptors
Contig283602.1  .   region  110935  110984  .   .   .   Note=vecscreen_adaptors
Contig291655.1  .   region  47763   47805   .   .   .   Note=vecscreen_adaptors
Contig296353.2  .   region  21931   21963   .   .   .   Note=vecscreen_adaptors
Contig308584.2  .   region  4404    4441    .   .   .   Note=vecscreen_adaptors
Contig319333.1  .   region  19503   19535   .   .   .   Note=vecscreen_adaptors
Contig320691.1  .   region  9558    9590    .   .   .   Note=vecscreen_adaptors
Contig322210.1  .   region  19475   19522   .   .   .   Note=vecscreen_adaptors
Contig324115.2  .   region  10214   10246   .   .   .   Note=vecscreen_adaptors
Contig324115.2  .   region  10525   10557   .   .   .   Note=vecscreen_adaptors
Contig324115.2  .   region  63877   63921   .   .   .   Note=vecscreen_adaptors
Contig328006.2  .   region  20001   20036   .   .   .   Note=vecscreen_adaptors
Contig332010.1  .   region  37268   37303   .   .   .   Note=vecscreen_adaptors
Contig338096.1  .   region  52109   52156   .   .   .   Note=vecscreen_adaptors
Contig339890.2  .   region  2540    2575    .   .   .   Note=vecscreen_adaptors
Contig341351.1  .   region  10544   10602   100.000 .   .   Note=common_contam
Contig341351.1  .   region  10545   10602   .   .   .   Note=vecscreen_adaptors
Contig344819.2  .   region  24014   24057   .   .   .   Note=vecscreen_adaptors
Contig346119.1  .   region  7802    7859    .   .   .   Note=vecscreen_adaptors
Contig346119.1  .   region  7802    7860    100.000 .   .   Note=common_contam
Contig348759.1  .   region  51447   51496   .   .   .   Note=vecscreen_adaptors
Contig349176.2  .   region  58275   58322   .   .   .   Note=vecscreen_adaptors
Contig89794.1   .   region  21267   21299   .   .   .   Note=vecscreen_adaptors

Key tasks, current roadblocks.

Current status: CWL set up to run BLAST using test genomic sequence dataset against reference contaminant sequence files, providing output listing putative contaminant sequence hits, representing minimum functionality of NCBI internal contamination workflow.

Contamination sequence files include:

a common contaminants database that contains vector sequences, bacterial insertion sequences, E. coli and phage genomes
a database of adaptors linkers and primers
a database of mitochondrial genomes
the chromosomes of unrelated organisms

Roadblocks: NCBI Internal pipeline C++ code and toolkit is not accessible externally; mitigated by obtaining original shell scripts. Contamination sequence databases are not documented; creating doi's for contaminant database contents. Input data such as chromosome sequence is very large; currently using a subset contaminant database. Implementation of Docker ongoing.

Motivation

Implement a sequence contamination screening pipeline using Common Workflow Language (CWL)

Why CWL:

An emerging standard with no clear alternative winner for workflow languages at this time.
Has the most backend support: 13 products aim to support CWL, at least 3 good candidates available currently.
CWL will provide increased accessibility for the external community.
A simple model of computation with files on input, files on output, command line apps, and some notions of resources (memory, CPU) for job scheduling. This model has value on its own merits.
Existence of tooling such as graphical editors (very new) and repositories for publishing reusable workflow components.

Caveats for CWL:

Inadequate documentation: reference documentation is formal to the level of being obtuse, user guide is incomplete.
The simple model of computation does not match well to complex orchestrated processes such as collaborating microservices.
Significant constraints on inputs/outputs and difficulty of interacting with databases. Example: difficulty dealing with BLAST databases.
Existing repositories of workflows are anemic in content; for example, BLAST is not among the set of tools/workflow in the repositories.

Why Docker:

Natively supported by numerous tools, unlike alternatives such as Singularity. This may change in the future, but there is cost for delaying benefits.
Fine-grained isolation so we can know the dependencies of each of our processes. Compare with attempts to run GPipe pipelines in the QA environment, with months to identify what resources are needed, but NOT where they are used.
Reproducibility, both for reasons of scientific research and software testability.

Caveats for Docker:

Security concerns are valid.
Alternatives such as Singularity may gain traction and support in tooling, but that day is not today.

Why Contamination Screening as a tool for the external community:

Long-standing requests from submitters.
Faster feedback hence turnaround for submissions.
Lack of transparency: Documentation for the existing production process lacks full details and may have diverged from what is actually performed.
Savings on computational costs: Could use a signature to sign submissions which have been pre-screened.

Main impediments during this Hackathon:

The production screening pipeline is C++/ASN.1 and even basic tools are not available publicly.
Data objects, such as screening databases, are not available in public repositories: sequences not submitted to ID/Genbank, FASTA files are on private FTP rather than public repositories (e.g. Zenodo).
CWL documentation is inadequate.

Take home lessons:

Need to make our internal tools more accessible. Python tools are available via simple 'pip install'. The Core Toolkit is hard to install, and tools like alignment filtering/sorting/formatting (align_filter, align_sort, align_format) are not public.
Importance of understanding dependencies, including infrastructure dependencies.
Focused effort on one task. Creating a screening tool for public consumption has been long-requested task but never implemented.

References for the existing contamination process: https://www.ncbi.nlm.nih.gov/tools/vecscreen/contam/

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
cwl		cwl
doc		doc
docker/blast		docker/blast
scripts		scripts
webserver		webserver
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ContamFilter: Genome Contamination Screening Pipeline Implemented in CWL for External Users

Introduction

Planned Pipeline

Currently Implemented Pipeline

How to run

Installing Docker

Running CWL

Results

Key tasks, current roadblocks.

Motivation

About

Releases

Packages

Contributors 5

Languages

License

NCBI-Hackathons/ContamFilter

Folders and files

Latest commit

History

Repository files navigation

ContamFilter: Genome Contamination Screening Pipeline Implemented in CWL for External Users

Introduction

Planned Pipeline

Currently Implemented Pipeline

How to run

Installing Docker

Running CWL

Results

Key tasks, current roadblocks.

Motivation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages