16S rRNA microbiome data analysis workflow using DADA2 and R on a high performance cluster.
This repository contains essentially wrapper functions around DADA2 functions in order to streamline the workflow for cluster computing.
This package is meant to serve two purposes: be an R package and give structure
to an analysis. The project aims to follow an R package structure, which can be
downloaded and installed as such. Additionally, the users is expected to
download this repository and run make
and slurm commands to run scripts.
Table of Conents
install.packages("devtools")
devtools::install_github("erictleung/dada2HPCPipe")
This DADA2 workflow stems from the DADA2 tutorial and big data tutorial. You can find more information about the DADA2 package from its publication or from GitHub.
clean Remove data from test_data/, download/, and refs/
condar Install R and essential packages
dl-ref-dbs Download 16S reference databases (SILVA,RDP,GG)
help Help page for Makefile
install Install and update dada2HPCPipe package in R
setup Setup development environment with Conda
test Run DADA2 workflow with Mothur MiSeq test data
Here are instructions on how to get started on ExaCloud and setting up the development environment needed to run the DADA2 workflow.
Interactive Session
To run an interactive session, run the following:
srun --pty /usr/bin/bash
This will allow you to test your code and workflow without worrying about stressing out the head coordinating node.
Setup
Follow the instructions listed in this document to setup a modern development environment on the cluster. This isn't necessary if your development environment is on a cluster where you have root access or you're implementing this workflow locally.
Briefly, following the instructions linked above will give you the following:
- Miniconda, Python package and virtual environment management
- Linuxbrew, non-root package management on Linux systems
For this R workflow, you'll only really need to install Miniconda and R
essentials. The Anaconda environment has build an r-essential
package with R and most used R packages for data science.
Linuxbrew is useful to supplement commands and other software tools you might want under package management control.
In summary to setup the dependencies for DADA2, run the following.
make setup
make condar
The make setup
runs the following
# Download and install Miniconda
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh
# Say yes to adding Miniconda to .bash_profile
# Remove install file
rm Miniconda2-latest-Linux-x86_64.sh
To see changes, you'll first need to exit the cluster and log back in.
The make condar
installs R and other essential R packages, which is laid out
as the following
# Install R and relevant packages
conda install r-essentials
# For maintenance and update of all packages
conda update r-essentials
# For updating a single particular R package, replace XXXX
conda update r-XXXX
Slurm is the resource manager that I'll focus on for this workflow. Slurm stands for "Simple Linux Utility for Resource Management."
An example script might be this.
$ cat first_script.sh
#!/bin/bash
# Template for simple SLURM script
# SBATCH --job-name="Job Name"
# SBATCH --partition=exacloud
srun hostname
srun pwd
srun hostinfo
The quick answer on sbatch
vs srun
can be found here.
Below are some useful commands to use within Slurm using the script above.
# Submit your script, first_script.sh
sbatch first_script.sh
# Look at jobs in the queue
squeue
squeue -u $USER # Take a look at your specific jobs
You can use this website to help generate Slurm scripts. It is designed for another cluster, but it should at least help with the initial drafting of a submit script you'll want to use.
For more general resources on using Slurm, check out here, here, and here.
Source: http://www.cism.ucl.ac.be/Services/Formations/slurm/2016/slurm.pdf
Installing this package says it has trouble installing Bioconductor packages
There are two solutions for this. From within R, you can run the following
setRepositories(ind=1:2)
which will tell R to also include Bioconductor packages in its package search. See https://stackoverflow.com/a/34617938/6873133 for more information.
Additionally, you can install Bioconductor manually using the following within R
# Try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite()
and then using biocLite()
to install the missing packages. See
http://bioconductor.org/install/.
How do I update my packages?
For regular R package (i.e. non-Bioconductor packages), use conda
from the
terminal.
# XXX is the package name
conda install r-XXX
# For example, installing XML
conda install r-xml
But for Bioconductor packages, use biocLite()
from within R.
# Try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
# E.g. installing DESeq2
biocLite("DESeq2")