Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Comparative Study
. 2010 Jan 15:11:33.
doi: 10.1186/1471-2105-11-33.

Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction

Affiliations
Comparative Study

Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction

Lance E Palmer et al. BMC Bioinformatics. .

Abstract

Background: With the rapid expansion of DNA sequencing databases, it is now feasible to identify relevant information from prior sequencing projects and completed genomes and apply it to de novo sequencing of new organisms. As an example, this paper demonstrates how such extra information can be used to improve de novo assemblies by augmenting the overlapping step. Finding all pairs of overlapping reads is a key task in many genome assemblers, and to this end, highly efficient algorithms have been developed to find alignments in large collections of sequences. It is well known that due to repeated sequences, many aligned pairs of reads nevertheless do not overlap. But no overlapping algorithm to date takes a rigorous approach to separating aligned but non-overlapping read pairs from true overlaps.

Results: We present an approach that extends the Minimus assembler by a data driven step to classify overlaps as true or false prior to contig construction. We trained several different classification models within the Weka framework using various statistics derived from overlaps of reads available from prior sequencing projects. These statistics included percent mismatch and k-mer frequencies within the overlaps as well as a comparative genomics score derived from mapping reads to multiple reference genomes. We show that in real whole-genome sequencing data from the E. coli and S. aureus genomes, by providing a curated set of overlaps to the contigging phase of the assembler, we nearly doubled the median contig length (N50) without sacrificing coverage of the genome or increasing the number of mis-assemblies.

Conclusions: Machine learning methods that use comparative and non-comparative features to classify overlaps as true or false can be used to improve the quality of a sequence assembly.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overlap statistics for E. coli MG1655 reads. The percent mismatch of the alignment between the two reads (a, b), the first quartile k-mer frequency of k-mers within the overlap (c), the median k-mer frequency (d), the third quartile k-mer frequency (e), and the comparative overlap score (f) are plotted for both true and false overlaps. The results are normalized for percentages of total overlaps for each of the true and false overlaps (a, c, d, e, f) or by overall count (b). The number of total true overlaps with 0 mismatches is 5,209,686.
Figure 2
Figure 2
Assembly of E. coli strain MG1655. Statistics from overlaps derived from S. typhi training reads were used to train a J48 Weka model. Overlaps from the MG1655 test data were classified based on this model and any overlaps predicted to be false were removed. The remaining overlaps were used in the assembly of MG1655. The N50 contig length of the final assembly as well as the percentage of the reference MG1655 genome matched by the contigs are plotted. Within parenthesis, the percent cutoff for strains to be analyzed with the comparative score is shown. For the 'One related genome' and 'Two related genomes' data, only one (ATCC8739 for the test set) and two (ATCC8739 and E24377A for the test set) related genomes, respectively, for the training and test sets were used.
Figure 3
Figure 3
Visualizing alignments of contigs. Contigs of assembled E. coli MG1655 reads from the uncorrected (top) and J48 corrected (bottom) overlaps were mapped to the MG1655 genome using SNAPPER and ordered in respect to their position. MAUVE was used to visualize alignments of assembled contigs to the reference genome. Each segment represents a matching alignment between the contigs and reference genome. There may be more than one segment per contig. Red vertical lines represent contig boundaries. Colored blocks represent regions where contigs are in the correct order. Colored lines connect corresponding blocks between the reference genome and the assembled contigs. White spaces within the blocks within the reference genome indicate regions not represented within a contig.
Figure 4
Figure 4
Assembly of S. aureus strain JH1. Statistics from overlaps derived from the JH9 training reads were used to train a J48 Weka model. Overlaps from the JH1 test data were classified based on this model and any overlaps predicted to be false were removed. The remaining overlaps were used in assembly of JH1. The N50 contig length of the final assembly as well as the percentage of the reference JH1 genome matched by the contigs are plotted. Within parenthesis, the percent cutoff for strains to be analyzed with the comparative score is shown.

Similar articles

Cited by

References

    1. Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KH, Remington KA, Anson EL, Bolanos RA, Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC. A whole-genome assembly of Drosophila. Science. 2000;287(5461):2196–2204. doi: 10.1126/science.287.5461.2196. - DOI - PubMed
    1. Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES. ARACHNE: A Whole-Genome Shotgun Assembler. Genome Res. 2002;12:177–189. doi: 10.1101/gr.208902. - DOI - PMC - PubMed
    1. Havlak P, Chen R, Durbin KJ, Egan A, Ren Y, Song XZ, Weinstock GM, Gibbs RA. The Atlas genome assembly system. Genome Res. 2004;14(4):721–732. doi: 10.1101/gr.2264004. - DOI - PMC - PubMed
    1. Miller JR, Delcher AL, Koren S, Venter E, Walenz BP, Brownley A, Johnson J, Li K, Mobarry C, Sutton G. Aggressive Assembly of Pyrosequencing Reads with Mates. Bioinformatics. 2008;24(24):2818–24. doi: 10.1093/bioinformatics/btn548. - DOI - PMC - PubMed
    1. Hernandez D, Francois P, Farinelli L, Osteras M, Schrenzel J. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 2008;18(5):802–809. doi: 10.1101/gr.072033.107. - DOI - PMC - PubMed

Publication types