Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction
- PMID: 20078885
- PMCID: PMC2824677
- DOI: 10.1186/1471-2105-11-33
Improving de novo sequence assembly using machine learning and comparative genomics for overlap correction
Abstract
Background: With the rapid expansion of DNA sequencing databases, it is now feasible to identify relevant information from prior sequencing projects and completed genomes and apply it to de novo sequencing of new organisms. As an example, this paper demonstrates how such extra information can be used to improve de novo assemblies by augmenting the overlapping step. Finding all pairs of overlapping reads is a key task in many genome assemblers, and to this end, highly efficient algorithms have been developed to find alignments in large collections of sequences. It is well known that due to repeated sequences, many aligned pairs of reads nevertheless do not overlap. But no overlapping algorithm to date takes a rigorous approach to separating aligned but non-overlapping read pairs from true overlaps.
Results: We present an approach that extends the Minimus assembler by a data driven step to classify overlaps as true or false prior to contig construction. We trained several different classification models within the Weka framework using various statistics derived from overlaps of reads available from prior sequencing projects. These statistics included percent mismatch and k-mer frequencies within the overlaps as well as a comparative genomics score derived from mapping reads to multiple reference genomes. We show that in real whole-genome sequencing data from the E. coli and S. aureus genomes, by providing a curated set of overlaps to the contigging phase of the assembler, we nearly doubled the median contig length (N50) without sacrificing coverage of the genome or increasing the number of mis-assemblies.
Conclusions: Machine learning methods that use comparative and non-comparative features to classify overlaps as true or false can be used to improve the quality of a sequence assembly.
Figures
Similar articles
-
Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches.BMC Genomics. 2016 Aug 22;17 Suppl 7(Suppl 7):507. doi: 10.1186/s12864-016-2895-8. BMC Genomics. 2016. PMID: 27556636 Free PMC article.
-
De novo likelihood-based measures for comparing genome assemblies.BMC Res Notes. 2013 Aug 22;6:334. doi: 10.1186/1756-0500-6-334. BMC Res Notes. 2013. PMID: 23965294 Free PMC article.
-
Clover: a clustering-oriented de novo assembler for Illumina sequences.BMC Bioinformatics. 2020 Nov 17;21(1):528. doi: 10.1186/s12859-020-03788-9. BMC Bioinformatics. 2020. PMID: 33203354 Free PMC article.
-
Machine Learning and Deep Learning Applications in Metagenomic Taxonomy and Functional Annotation.Front Microbiol. 2022 Mar 14;13:811495. doi: 10.3389/fmicb.2022.811495. eCollection 2022. Front Microbiol. 2022. PMID: 35359727 Free PMC article. Review.
-
Visualizing genomes: techniques and challenges.Nat Methods. 2010 Mar;7(3 Suppl):S5-S15. doi: 10.1038/nmeth.1422. Epub 2010 Feb 25. Nat Methods. 2010. PMID: 20195257 Review.
Cited by
-
Machine Learning Approaches for Epidemiological Investigations of Food-Borne Disease Outbreaks.Front Microbiol. 2019 Aug 6;10:1722. doi: 10.3389/fmicb.2019.01722. eCollection 2019. Front Microbiol. 2019. PMID: 31447800 Free PMC article. Review.
-
Broiler chickens can benefit from machine learning: support vector machine analysis of observational epidemiological data.J R Soc Interface. 2012 Aug 7;9(73):1934-42. doi: 10.1098/rsif.2011.0852. Epub 2012 Feb 8. J R Soc Interface. 2012. PMID: 22319115 Free PMC article.
-
LOCAS--a low coverage assembly tool for resequencing projects.PLoS One. 2011;6(8):e23455. doi: 10.1371/journal.pone.0023455. Epub 2011 Aug 15. PLoS One. 2011. PMID: 21858125 Free PMC article.
-
Quake: quality-aware detection and correction of sequencing errors.Genome Biol. 2010;11(11):R116. doi: 10.1186/gb-2010-11-11-r116. Epub 2010 Nov 29. Genome Biol. 2010. PMID: 21114842 Free PMC article.
References
-
- Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KH, Remington KA, Anson EL, Bolanos RA, Chou HH, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC. A whole-genome assembly of Drosophila. Science. 2000;287(5461):2196–2204. doi: 10.1126/science.287.5461.2196. - DOI - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials
Miscellaneous