Abstract
Circular consensus sequencing with Pacific Biosciences (PacBio) technology generates long (10â25âkilobases), accurate âHiFiâ reads by combining serial observations of a DNA molecule into a consensus sequence. The standard approach to consensus generation, pbccs, uses a hidden Markov model. We introduce DeepConsensus, which uses an alignment-based loss to train a gap-aware transformerâencoder for sequence correction. Compared to pbccs, DeepConsensus reduces read errors by 42%. This increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27% and at Q40 by 90%. With two SMRT Cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9âmegabases (Mb) to 17.2âMb), increase gene completeness (94% to 97%), reduce the false gene duplication rate (1.1% to 0.5%), improve assembly base accuracy (Q43 to Q45) and reduce variant-calling errors by 24%. DeepConsensus models could be trained to the general problem of analyzing the alignment of other types of sequences, such as unique molecular identifiers or genome assemblies.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 /Â 30Â days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
Sequencing data, predictions and analysis files are available at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/deepconsensus/publication.
Code availability
Code and pretrained models are available at https://github.com/google/deepconsensus. Sequencing data are available from the following sources:
â Sequel II data from Novogene42 at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/sequencing
â 15-kb HG002 and 24-kb HG002 reads from PacBio at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/deepconsensus/publication/sequencing
â Sequel II data from PacBio at https://downloads.pacbcloud.com/public/dataset/HG002_SV_and_SNV_CCS/
â HG002 diploid assembly at https://obj.umiacs.umd.edu/marbl_publications/hicanu/hg002_hifi_hicanu_combined.fasta.gz
References
Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53â59 (2008).
Travers, K. J., Chin, C.-S., Rank, D. R., Eid, J. S. & Turner, S. W. A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic Acids Res. 38, e159 (2010).
Mc Cartney, A. M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat. Methods 19, 687â695 (2022).
Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science. 376, eabl4178 (2022).
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155â1162 (2019).
Olson, N. D. et al. precisionFDA Truth Challenge V2: calling variants from short- and long-reads in difficult-to-map regions. Cell Genom. 2, 100129 (2022).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44 (2022).
Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597â614 (2020).
Chin, C.-S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563â569 (2013).
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050â1054 (2016).
Vaser, R., SoviÄ, I., Nagarajan, N. & Å ikiÄ, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737â746 (2017).
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014).
Shafin, K. et al. Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks. Nat. Methods 18, 1322â1332 (2021).
Vaswani, A. et al. Attention is all you need. Preprint at https://doi.org/10.48550/arXiv.1706.03762 (2017).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2018).
Dosovitskiy, A. et al. An image is worth 16âÃâ16 words: transformers for image recognition at scale. Preprint at https://doi.org/10.48550/arXiv.2010.11929 (2020).
Rao, R. et al. MSA transformer. Preprint at bioRxiv https://doi.org/10.1101/2021.02.12.430858 (2021).
The AlphaFold team. AlphaFold: a solution to a 50-year-old grand challenge in biology. DeepMind https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology
Mensch, A. & Blondel, M. Differentiable dynamic programming for structured prediction and attention. Proc. 35th International Conference on Machine Learning 80, 3462â3471 (2018).
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291â1305 (2020).
Lal, A. et al. Improving long-read consensus sequencing accuracy with deep learning. Preprint at bioRxiv https://doi.org/10.1101/2021.06.28.450238 (2021).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170â175 (2021).
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072â1075 (2013).
Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595â597 (2018).
Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2020).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094â3100 (2018).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983â987 (2018).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555â560 (2019).
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253â2255 (2020).
Warren, R. L. et al. ntEdit: scalable genome sequence polishing. Bioinformatics 35, 4430â4432 (2019).
Morisse, P., Marchet, C., Limasset, A., Lecroq, T. & Lefebvre, A. Scalable long read self-correction and assembly polishing with multiple sequence alignment. Sci. Rep. 11, 761 (2021).
Islam, S. et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods 11, 163â166 (2014).
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044â1053 (2020).
Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196â1203 (2021).
Huang, Z. et al. CCNet: criss-cross attention for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 603â612 (2020).
Choromanski, K. et al. Rethinking attention with performers. Preprint at https://doi.org/10.48550/arXiv.2009.14794 (2020).
Wang, S., Li, B. Z., Khabsa, M., Fang, H. & Ma, H. Linformer: self-attention with linear complexity. Preprint at https://doi.org/10.48550/arXiv.2006.04768 (2020).
Katharopoulos, A., Vyas, A., Pappas, N. & Fleuret, F. Transformers are RNNs: fast autoregressive transformers with linear attention. Preprint at https://doi.org/10.48550/arXiv.2006.16236 (2020).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://doi.org/10.48550/arXiv.1412.6980 (2014).
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Baid, G. et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv https://doi.org/10.1101/2020.12.11.422022 (2020).
Acknowledgements
We thank F. Liu of the Google TensorFlow Model Garden team for improving our use of open-source implementation of the transformer architecture.
Author information
Authors and Affiliations
Contributions
G.B., P.-C.C. and A.C. conceived the study. G.B. and D.E.C. wrote DeepConsensus and trained models. G.B., D.E.C., K.S., T.Y., M.N. and A.B. performed experiments with DeepConsensus reads and made figures and documentation. F.L.-L., Q.B. and J.-P.V. conceived and implemented the alignment loss strategy, which D.E.C. integrated into DeepConsensus. A.M.W., W.J.R. and A.T. provided insight into PacBio data, identified areas for improvement, suggested informative features and provided code for preprocessing and evaluation. W.A. experimented with embedding strategies. A.K. and A.T. contributed to efficient processing of PacBio reads. H.Y. coordinated data acquisition and research agreements. J.-P.V., A.V., C.Y.M., M.N., P.-C.C. and A.C. provided guidance on experimental design, architecture and code review. G.B., D.E.C., K.S., T.Y., F.L.-L., Q.B., A.M.W., W.J.R., M.N., J.-P.V., A.V., C.Y.M., P.-C.C. and A.C. wrote the paper.
Corresponding author
Ethics declarations
Competing interests
G.B., D.E.C., K.S., T.Y., F.L.-L., Q.B., A.B., M.N., H.Y., A.K., W.A., J.-P.V., A.V., C.Y.M., P.-C.C. and A.C. are employees of Google LLC and own Alphabet stock as part of the standard compensation package. A.M.W., A.T. and W.J.R. are full-time employees and shareholders of Pacific Biosciences. This study was funded by Google LLC.
Peer review
Peer review information
Nature Biotechnology thanks Justin Zook, Andrey Bzikadze and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisherâs note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 DeepConsensus with longer reads improves genome assembly contiguity.
(a) HG002 read length distribution for 15kb and 24kb DeepConsensus reads from two SMRT Cells. (b) Contiguity of the HG002 hifiasm assembly with 15kb and 24kb DeepConsensus reads from two SMRT Cells. (c) HG002 variant calling performance for 15kb and 24kb reads from DeepConsensus for two SMRT Cells.
Supplementary information
Supplementary Information
Supplementary Figs. 1â11, Supplementary Tables 1â29 and documentation of software commands used.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Baid, G., Cook, D.E., Shafin, K. et al. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat Biotechnol 41, 232â238 (2023). https://doi.org/10.1038/s41587-022-01435-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41587-022-01435-7
This article is cited by
-
Chromosome-level genome assembly of the sacoglossan sea slug Elysia timida (Risso, 1818)
BMC Genomics (2024)
-
Genomic resources, opportunities, and prospects for accelerated improvement of millets
Theoretical and Applied Genetics (2024)
-
Pangenome graph construction from genome alignments with Minigraph-Cactus
Nature Biotechnology (2024)
-
Mabs, a suite of tools for gene-informed genome assembly
BMC Bioinformatics (2023)
-
Comparing methods for constructing and representing human pangenome graphs
Genome Biology (2023)