Nextflow pipeline for scaffolding genome assemblies with Hi-C reads
This pipeline requires the following inputs:
- A fasta file containing assembled contigs (
--contigs
) - Hi-C reads in paired-end fastq(.gz) format (
--r1Reads
and--r2Reads
)
It then performs the following tasks:
- Aligns the Hi-C reads to the contigs using chromap
- Scaffolds the contigs using yahs
- Prepares all the files you need to do manual curation in Juicebox
and produces the following outputs:
- Alignments in bam format (
out/chromap/aligned.bam
) - A scaffolded assembly in both agp and fasta formats
(
out/scaffolds/yahs.out_scaffolds_final.[agp,fa]
) .hic
and.assembly
files for loading in Juicebox Assembly Tools (out/juicebox_input/out_JBAT.[hic,assembly]
)
If you're running this on the Lewis cluster, I've already got a profile set up
with everything you need, so just add -profile lewis
to the command and
you're good to go.
This pipeline has the following dependencies:
Nextflow must be in your path. You can get nextflow to make a conda environment
containing chromap and yahs for you with -profile conda
(note one dash!).
JuicerTools is distributed as a jar file, so you need to tell the pipeline
where it is by adding the argument --juicerToolsJar /path/to/jar
(note two
dashes!). You can also add this stuff to a config file called nextflow.config
in the directory from which you're running it (see nextflow documentation).
nextflow run WarrenLab/hic-scaffolding-nf \
--contigs contigs.fa \
--r1Reads hic_reads_R1.fastq.gz \
--r2Reads hic_reads_R2.fastq.gz
N.B. The WarrenLab/hic-scaffolding-nf
is the name of this github
repository, not a local path on your machine. You do not need to download
any file in this repository; just tell nextflow to run
WarrenLab/hic-scaffolding-nf
and it will take care of downloading the pipeline
for you.
You'll need to add a couple options depending on your configuration (see section above).
If you want to specify an enzyme to YAHS, you can add, e.g.,
--extra-yahs-args "-e GATC"
In addition to the scaffolded assembly, this pipeline creates files you can use to manually curate the assembly in Juicebox Assembly Tools. When you are done with the curation, follow these instructions from the YAHS documentation:
Once completed editing, there should be a file named something like
out_JBAT.review.assembly
generated by Juicebox, which can be fed intojuicer post
command to generate AGP and FASTA files for the final genome assembly. You also need theout_JBAT.liftover.agp
coordinate file previously generated withjuicer pre
command.juicer post -o out_JBAT out_JBAT.review.assembly out_JBAT.liftover.agp contigs.fa
This will end up with two files
out_JBAT.FINAL.agp
andout_JBAT.FINAL.fa
. Together withhic-to-contigs.bin
or the original BED/BAM file, you can regenerate a HiC contact map for the final assembly as described in the previous section.