SpLitteR is a tool that uses synthetic long reads (SLRs) to improve the contiguity of HiFi assemblies. Given a SLR library and a HiFi assembly graph in the GFA format, SpLitteR resolves repeats in the assembly graph using linked-reads and generates a simplified (more contiguous) assembly graph with corresponding scaffolds.
- g++ (version 5.3.1 or higher)
- cmake (version 3.12 or higher)
- zlib
- libbz2
cd spades/assembler/
mkdir build && cd build && cmake ../src
make splitter
Now to run SpLitteR move to folder assembler/ and execute
build/bin/splitter
The tool requires
- Assembly graph file in GFA 1.0 format, with scaffolds included as path lines.
- SLR library in YAML format. The tool supports SLR libraries produced using 10X Genomics Chromium and UST TELL-Seq technologies. Other SLR technologies, such as stLFR or LoopSeq can potentially be used as an input if converted to 10X or TELL-Seq format.
SpLitteR supports LJA and Flye assembly graphs out of the box. Other assembly graphs should prefferably be converted into blunt format by e.g. GetBlunted utility.
TELL-Seq library should include barcodes, left reads, and right reads as three separate FASTQ files.
For example, if you have a TELL-Seq library
tellseq_reads_I1.fastq.gz
tellseq_reads_R1.fastq.gz
tellseq_reads_R2.fastq.gzYAML file should look like this:
[
{
orientation: "fr",
type: "tell-seq",
right reads: [
"/FULL_PATH_TO_DATASET/tellseq_reads_R2.fastq.gz"
],
left reads: [
"/FULL_PATH_TO_DATASET/tellseq_reads_R1.fastq.gz"
],
aux: [
"/FULL_PATH_TO_DATASET/tellseq_reads_I1.fastq.gz"
]
}
]10X library should be in FASTQ format with barcodes attached as BC:Z or BX:Z tags:
@COOPER:77:HCYNTBBXX:1:1216:22343:0 BX:Z:AAAAAAAAAACATAGT
CCAGGTAGGATTATGGAATTGGTATAAGCGATCAAACTCAATATTTTTGGTGCGGTGACAGACGCCTTCTGGCAGATGATGGGCTTGTCGTAAGTGTGGT
+
GGAGGGAAGGGGIGIIAGAGAGGGGGIAGGGGGGGAGGGGGGGGGGGGAAAGGAGGGGGIGIGGGGGGGAGGAGGIGAIAGGIGGGGIGGGGGGGGGGGG
For example, if you have an SLR library
lib_slr_1.fastq.gz
lib_slr_2.fastq.gzYAML file should look like this:
[
{
orientation: "fr",
type: "clouds10x",
right reads: [
"/FULL_PATH_TO_DATASET/lib_slr_2.fastq.gz"
],
left reads: [
"/FULL_PATH_TO_DATASET/lib_slr_1.fastq.gz"
]
}
]Synopsis: splitter <graph (in binary or GFA)> <SLR library description (in YAML)> <path to output directory> [OPTION...]
Main options:
-tNumber of threads to use (default: 1/2 of available threads)--mapping-kk-mer length for read mapping (default: 31)-Gmdbg|-GbluntAssembly graph type: mDBG (LJA) or blunted (Flye)-Mdiploid|-MmetaRepeat resolution mode (diploid or meta)--assembly-infoPath to metaFlye assembly_info.txt file (meta mode, metaFlye graphs only)
Barcode index construction:
--count-thresholdMinimum number of reads for barcode index--frame-sizeResolution of the barcode index--length-thresholdMinimum scaffold graph edge length (meta mode option)--linkage-distanceReads are assigned to the same fragment on long edges based on the linkage distance--min-read-thresholdMinimum number of reads for path cluster extraction--relative-score-thresholdRelative score threshold for path cluster extraction--sampling-factorDownsample input SLR reads by this factor
Repeat resolution:
--scoreScore threshold for link index.--tail-thresholdBarcodes are assigned to the first and last <tail_threshold> nucleotides of the edge.--scaffold-linksUse scaffold links in addition to graph links for repeat resolution
Developer options:
--refReference path for repeat resolution evaluation--bin-loadLoad read-to-graph alignment--debugProduce lots of debug data, save read-to-graph alignment--tmp-dirScratch directory to use-h, --helpPrint help message
Example command lines:
- Assembly produced LJA from HiFi diploid human dataset, with 10X SLR library (HPC compressed)
splitter lja_output/mdbg/mdbg.hpc.gfa 10x_dataset.yaml output -Mdiploid -Gmdbg - Assembly produced by metaFlye from metagenomic dataset, with TELL-Seq SLR library
splitter metaflye_output/assembly_graph.gfa tellseq_dataset.yaml output --assembly-info metaflye_output/assembly_info.txt -Mmeta -Gblunt
SpLitteR stores all output files in output directory <output_dir> , which is set by the user.
<output_dir>/assembly_graph.gfainput assembly graph in mDBG encoding<output_dir>/resolved_graph.gfaoutput assembly graph after repeat resolution<output_dir>/contigs.fastaoutput scaffolds
In addition
<output_dir>/edge_transform.tsvmap from input graph edges to resolved graph edges<output_dir>/vertex_stats.tsvStatistics for complex vertices<output_dir>/resolved_graph.fastaSequences of the resolved graph edges