- Split View
-
Views
-
Cite
Cite
Shengfeng Huang, Mingjing Kang, Anlong Xu, HaploMerger2: rebuilding both haploid sub-assemblies from high-heterozygosity diploid genome assembly, Bioinformatics, Volume 33, Issue 16, August 2017, Pages 2577–2579, https://doi.org/10.1093/bioinformatics/btx220
- Share Icon Share
Abstract
De novo assembly is a difficult issue for heterozygous diploid genomes. The advent of high-throughput short-read and long-read sequencing technologies provides both new challenges and potential solutions to the issue. Here, we present HaploMerger2 (HM2), an automated pipeline for rebuilding both haploid sub-assemblies from the polymorphic diploid genome assembly. It is designed to work on pre-existing diploid assemblies, which are typically created by using de novo assemblers. HM2 can process any diploid assemblies, but it is especially suitable for diploid assemblies with high heterozygosity (≥3%), which can be difficult for other tools. This pipeline also implements flexible and sensitive assembly error detection, a hierarchical scaffolding procedure and a reliable gap-closing method for haploid sub-assemblies. Using HM2, we demonstrate that two haploid sub-assemblies reconstructed from a real, highly-polymorphic diploid assembly show greatly improved continuity.
Source code, executables and the testing dataset are freely available at https://github.com/mapleforest/HaploMerger2/releases/.
Supplementary data are available at Bioinformatics online.
1 Introduction
There is an increasing demand for sequencing of heterozygous diploid genomes. However, since the era of Sanger sequencing, de novo assembly of heterozygous diploid genomes has been a difficult issue (Vinson et al., 2005). It becomes more challenging when using massive, short-read sequencing technologies (Zhang et al., 2012). Though several de novo assembly methods and post-assembly methods have been designed to improve heterozygous short-read assemblies (Gnerre et al., 2011; Huang et al., 2012; Kajitani et al., 2014; Pryszcz and Gabaldon, 2016; Safonova et al., 2015), these assemblies hardly reach the same level of quality as non-heterozygous assemblies. The latest high-throughput long-read sequencing technologies provide a promising approach to polymorphic assembly (Berlin et al., 2015; Chin et al., 2013; Koren et al., 2017; Xiao et al., 2016), especially when combined with heterozygosity-aware assembly algorithms (Chin et al., 2016). However, in long-read diploid assemblies, there are still assembly errors, and, more importantly, allelic relations between scaffolds might not be fully resolved. This is because, as the heterozygosity increases, alleles from the same locus are more likely to be mistaken as sequences from different loci.
Previously, we developed HaploMerger (HM), an automated pipeline to resolve allelic relations in polymorphic diploid assembly and output the reference haploid assembly (Huang et al., 2012). Since HM works on pre-existing diploid assemblies, it can be easily incorporated into any assembly pipelines. Thus far, HM has been used to create over ten published reference assemblies, including large draft genomes for amphioxus (∼450 Mb with ∼4% heterozygosity) and hookworms (∼330 Mb with <1% heterozygosity) (Huang et al., 2014; Schwarz et al., 2015).
We perceive that, because alleles often functionally complement each other in a highly polymorphic genome, both haploid sub-assemblies are necessary to represent the complete genomic landscape. Moreover, the presence of both haploid sub-assemblies enables the study of widespread heterozygosities in a highly polymorphic genome, including single nucleotide polymorphisms, indels, copy-number variation, structural variation and recent transposition.
Here, we provide HaploMerger2 (HM2), a major upgrade over the old pipeline, which we redesigned to reconstruct both haploid sub-assemblies from short-read and long-read diploid assemblies. HM2 can work with both heterozygosity-aware and -unaware genome assemblers and process both low and high heterozygosity assemblies. However, it is especially suitable for difficult tasks in which the diploid assemblies have high heterozygosity (≥3%). Compared with the old pipeline, HM2 also implements more flexible assembly error detection, a hierarchical scaffolding procedure and a reliable gap-closing method on haploid assemblies (Fig. 1). In this applications note, we describe the features and applications of HM2.
2 Software description
2.1 Preparation and requirements
An initial diploid assembly should first be generated by using de novo assemblers. To include as many alleles as possible into the diploid assembly, the de novo assembler should be run with stringent parameters (e.g. low error rates) or in the heterozygosity-aware mode, which forces alleles from the same locus to be assembled and outputted separately. To avoid false alignments, repetitive sequences in the diploid assembly, including simple repeats, transposable elements and highly duplicated coding sequences, should be soft-masked using WindowMasker and/or RepeatMask (Morgulis et al., 2006; Tarailo-Graovac and Chen, 2009). To achieve optimal specificity and sensitivity, knowledge of the allelic polymorphism rate and mutational biases is very important. For example, if the heterozygosity is 1% and the alignment threshold is set to 10%, many sequences will be falsely removed as alleles. On the other hand, if the heterozygosity is 10% and the alignment threshold is set to 1%, many true allele pairs will be undetected and remain in the haploid sub-assemblies. HM2 provides tools to infer proper parameters to handle these situations. Finally, due to algorithmic limitation, HM2 is more suitable to process diploid assemblies with an initial scaffold N50 size >100 Kb.
2.2 Detection and break-up of potential mis-joins
Mis-joins of unrelated genomic portions can be detected by examining the alignments between allelic scaffolds. In the old HM pipeline, mis-join processing and haploid assembly rebuilding were inseparable. In HM2, mis-join processing is redesigned as the first independent module, which allows for choosing optimal parameters and running iterations of the module to maximize error detection. It is worth noting that false detection of mis-joins due to repetitive sequences has been suppressed by the initial repeat-masking procedure. In a pair of allelic scaffolds involved in a mis-join, it is difficult to determine which has the error. Additionally, it is hard to discriminate between mis-joins from natural inversions and translocations. Therefore, HM2 breaks up both scaffolds involved in a potential mis-join. The correct connection can be restored later by the scaffolding module.
2.3 Reconstruction of two separated haploid sub-assemblies
The old version of HM reconstructs allelic relations based on the best reciprocal, mirrored whole-genome alignments of the diploid assembly. Then, a heuristic method is employed to elect the best allele into the reference haploid assembly, whereas another allele is used to fill the N-gaps or is simply discarded. This procedure might cause the loss of the alternative alleles and excessive switches between two haplotypes. In HM2, we revised the algorithm to reconstruct both haploid sub-assemblies: the reference sub-assembly and the alternative sub-assembly. Specifically, if two alleles are available for a locus, HM2 separates them into two different sub-assemblies, with the better-quality allele placed in the reference sub-assembly. If only one allele is available for a locus (often due to haplotype collapsing or the allele is simply discarded by the de novo assembler), HM2 puts this allele into both sub-assemblies. In the sub-assemblies, the allelic scaffolds are given the same scaffold name. Finally, because there are switches between haplotypes in the rebuilt haploid sub-assemblies, the sub-assemblies are not haplotype phased.
2.4 Hierarchical scaffolding of the haploid sub-assemblies
In polymorphic diploid assembly, scaffolding with mate-pairs is ineffective because reads of the same pair are often aligned to different haplotypes. This is the major factor that causes heavy fragmentation and excessive assembly errors in polymorphic assembly (Huang et al., 2012; Kajitani et al., 2014; Safonova et al., 2015; Vinson et al., 2005). In HM2, we implement hierarchical scaffolding in the haploid sub-assemblies. Without interference of the different alleles, this re-scaffolding procedure can dramatically improve the sequence continuity. Currently, HM2 invokes a third-party program, SSPACE v3.0, to implement scaffolding (Boetzer et al., 2011). In our experience, SSPACE v3.0 implements a fast, straightforward greedy scaffolding algorithm. However, HM2 also supports other scaffolders, as long as the scaffolders do not remove or add sequences to the sub-assembly. Continuity could be further improved by invoking multiple rounds of the scaffolding module. Additionally, only the reference sub-assembly needs re-scaffolding because HM2 will update the alternative sub-assembly according to the new scaffolding layout of the reference sub-assembly.
2.5 Detection and removal of potential tandem assembly errors
HM2 utilizes an updated module with several fixed bugs and new configurable options. For example, the module now can scan tandems as small as 100 bp, and detect tandems of unequal length (option ‘XvY’). In addition, multiple rounds of tandem removal can be performed, usually with decreased tandem sizes and increased sensitivity. Then, the tandem-assembled sequences that have been removed are collected in an output file rather than discarded as was done before. Finally, the users should be careful with this module because it is the only module in HM2 that may lose genomic information.
2.6 N-gap closing
HM2 invokes a third-party software, GapCloser (Luo et al., 2012), to implement N-gap closing. Since gap-filling sequences generated by GapCloser are not always reliable, HM2 will re-examine all the gap-filling sequences and choose to retain the reliable ones. Because this examination is specific to the GapCloser output, HM2 currently does not support other gap-filling software. It is possible to run multiple rounds of gap-filling with different datasets. All gap-filling sequences are annotated in an AGP-formatted file (v.1.1).
3 Sample applications
We provide three examples for testing HM2. The first two examples use an artificial diploid assembly (∼100 Kb) to test if HM2 is installed successfully and functions properly. Both examples can be finished in a minute.
The third example uses a real, highly-polymorphic diploid assembly for a wild-type amphioxus. This assembly was created from a mixture of 454 and Illumina reads (∼60X) using the Celera assembler CABOG v6.1 (Miller et al., 2008). A copy of this assembly can be downloaded from GenBank (accession: AYSR00000000.1), or from our HM2 release website (named ‘bbv18wm.fa.gz’). It has been soft-masked and is ready to use. This assembly contains ∼708 M bases and has a scaffold/contig N50 size of 264 Kb/30 Kb, exhibiting an average rate of allelic polymorphism of ∼4%. After a single round of HM2, we can obtain two separated haploid sub-assemblies of ∼406 Mb with a scaffold/contig N50 size of 2.2 Mb/40 Kb. This takes <3 hours to finish on a machine with 12-cores and 64 Gb of memory. The results and performance are highly reproducible. A full description of this application is provided in the Supplementary information.
4 Discussion
HM2 works in the post-assembly stage. Its performance is bound by the quality of the initial diploid assembly. For example, if one of the haplotypes is largely missing in the diploid assembly, HM2 cannot recover it. However, a reference haploid sub-assembly is always guaranteed. HM2 has algorithmic limitations, which offer little help if the diploid assembly is too fragmented (i.e. <100 kb).
In essence, HM2 is a tool kit comprising a set of executables of independent function, as well as wrappers for winMasker, Lastz, chainNet, SSPACE and GapCloser. The intermediate information and running messages are tracked and documented for each step and function. The pipeline presented here is a special organization of a selection of tools from this kit. Therefore, HM2 can be used for other applications in post-assembly analysis. For example, it can be used to create self-versus-self whole-genome alignments or pairwise alignments between two genome assemblies to detect tandem duplication, further scaffold an assembly and close some N-gaps.
Funding
This work was supported by the 973 Project [grant number 2013CB835305], the National Nature Science Fund [grand number 31171193] and by the National Supercomputer Center in Guangzhou and the Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund.
Conflict of Interest: none declared.
References