DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer

Baid, Gunjan; Cook, Daniel E.; Shafin, Kishwar; Yun, Taedong; Llinares-López, Felipe; Berthet, Quentin; Belyaeva, Anastasiya; Töpfer, Armin; Wenger, Aaron M.; Rowell, William J.; Yang, Howard; Kolesnikov, Alexey; Ammar, Waleed; Vert, Jean-Philippe; Vaswani, Ashish; McLean, Cory Y.; Nattestad, Maria; Chang, Pi-Chuan; Carroll, Andrew

doi:10.1038/s41587-022-01435-7

Article
Published: 01 September 2022

DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer

Gunjan Baid¹^Â na1,
Daniel E. Cook¹^Â na1,
Kishwar ShafinÂ ORCID: orcid.org/0000-0001-5252-3434¹,
Taedong YunÂ ORCID: orcid.org/0000-0002-6242-5536¹,
Felipe Llinares-LÃ³pez¹,
Quentin Berthet¹,
Anastasiya Belyaeva¹,
Armin TÃ¶pferÂ ORCID: orcid.org/0000-0003-1637-1466²,
Aaron M. Wenger²,
William J. RowellÂ ORCID: orcid.org/0000-0002-7422-1194²,
Howard Yang¹,
Alexey Kolesnikov¹,
Waleed Ammar¹,
Jean-Philippe VertÂ ORCID: orcid.org/0000-0001-9510-8441¹,
Ashish Vaswani¹,
Cory Y. McLeanÂ ORCID: orcid.org/0000-0001-9928-8216¹,
Maria Nattestad¹^Â na1,
Pi-Chuan ChangÂ ORCID: orcid.org/0000-0003-3021-6446¹^Â na1 &
â€¦
Andrew CarrollÂ ORCID: orcid.org/0000-0002-4824-6689¹^Â na1Â

Nature Biotechnology volumeÂ 41,Â pages 232â€“238 (2023)Cite this article

10k Accesses
50 Citations
128 Altmetric
Metrics details

Subjects

Abstract

Circular consensus sequencing with Pacific Biosciences (PacBio) technology generates long (10â€“25â€‰kilobases), accurate â€˜HiFiâ€™ reads by combining serial observations of a DNA molecule into a consensus sequence. The standard approach to consensus generation, pbccs, uses a hidden Markov model. We introduce DeepConsensus, which uses an alignment-based loss to train a gap-aware transformerâ€“encoder for sequence correction. Compared to pbccs, DeepConsensus reduces read errors by 42%. This increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27% and at Q40 by 90%. With two SMRT Cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (ï»¿NG50 4.9â€‰megabases (Mb) to 17.2â€‰Mb), increase gene completeness (94% to 97%), reduce the false gene duplication rate (1.1% to 0.5%), improve assembly base accuracy (Q43 to Q45) and reduce variant-calling errors by 24%. DeepConsensus models could be trained to the general problem of analyzing the alignment of other types of sequences, such as unique molecular identifiers or genome assemblies.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: DeepConsensus improves the accuracy of CCS reads.**

**Fig. 3: DeepConsensus improves the contiguity and quality of the genome assemblies generated with hifiasm.**

**Fig. 4: DeepConsensus improves variant-calling performance of DeepVariant.**

Linear time complexity de novo long read genome assembly with GoldRush

Article Open access 22 May 2023

Scalable long read self-correction and assembly polishing with multiple sequence alignment

Article Open access 12 January 2021

Efficient hybrid de novo assembly of human genomes with WENGAN

Article Open access 14 December 2020

Data availability

Sequencing data, predictions and analysis files are available at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/deepconsensus/publication.

Code availability

Code and pretrained models are available at https://github.com/google/deepconsensus. Sequencing data are available from the following sources:

âˆ™ Sequel II data from Novogene⁴² at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/sequencing

âˆ™ 15-kb HG002 and 24-kb HG002 reads from PacBio at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/deepconsensus/publication/sequencing

âˆ™ Sequel II data from PacBio at https://downloads.pacbcloud.com/public/dataset/HG002_SV_and_SNV_CCS/

âˆ™ HG002 diploid assembly at https://obj.umiacs.umd.edu/marbl_publications/hicanu/hg002_hifi_hicanu_combined.fasta.gz

References

Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53â€“59 (2008).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Travers, K. J., Chin, C.-S., Rank, D. R., Eid, J. S. & Turner, S. W. A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic Acids Res. 38, e159 (2010).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Mc Cartney, A. M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat. Methods 19, 687â€“695 (2022).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science. 376, eabl4178 (2022).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155â€“1162 (2019).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Olson, N. D. et al. precisionFDA Truth Challenge V2: calling variants from short- and long-reads in difficult-to-map regions. Cell Genom. 2, 100129 (2022).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44 (2022).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Logsdon, G. A., Vollger, M. R. & Eichler, E. E. Long-read human genome sequencing and its applications. Nat. Rev. Genet. 21, 597â€“614 (2020).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Chin, C.-S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563â€“569 (2013).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050â€“1054 (2016).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Vaser, R., SoviÄ‡, I., Nagarajan, N. & Å ikiÄ‡, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737â€“746 (2017).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Shafin, K. et al. Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks. Nat. Methods 18, 1322â€“1332 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Vaswani, A. et al. Attention is all you need. Preprint at https://doi.org/10.48550/arXiv.1706.03762 (2017).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://doi.org/10.48550/arXiv.1810.04805 (2018).
Dosovitskiy, A. et al. An image is worth 16â€‰Ã—â€‰16 words: transformers for image recognition at scale. Preprint at https://doi.org/10.48550/arXiv.2010.11929 (2020).
Rao, R. et al. MSA transformer. Preprint at bioRxiv https://doi.org/10.1101/2021.02.12.430858 (2021).
The AlphaFold team. AlphaFold: a solution to a 50-year-old grand challenge in biology. DeepMind https://deepmind.com/blog/article/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology
Mensch, A. & Blondel, M. Differentiable dynamic programming for structured prediction and attention. Proc. 35th International Conference on Machine Learning 80, 3462â€“3471 (2018).
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291â€“1305 (2020).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Lal, A. et al. Improving long-read consensus sequencing accuracy with deep learning. Preprint at bioRxiv https://doi.org/10.1101/2021.06.28.450238 (2021).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170â€“175 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072â€“1075 (2013).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595â€“597 (2018).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2020).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094â€“3100 (2018).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983â€“987 (2018).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555â€“560 (2019).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253â€“2255 (2020).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Warren, R. L. et al. ntEdit: scalable genome sequence polishing. Bioinformatics 35, 4430â€“4432 (2019).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Morisse, P., Marchet, C., Limasset, A., Lecroq, T. & Lefebvre, A. Scalable long read self-correction and assembly polishing with multiple sequence alignment. Sci. Rep. 11, 761 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Islam, S. et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods 11, 163â€“166 (2014).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044â€“1053 (2020).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Avsec, Å½. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196â€“1203 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Huang, Z. et al. CCNet: criss-cross attention for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 603â€“612 (2020).
Choromanski, K. et al. Rethinking attention with performers. Preprint at https://doi.org/10.48550/arXiv.2009.14794 (2020).
Wang, S., Li, B. Z., Khabsa, M., Fang, H. & Ma, H. Linformer: self-attention with linear complexity. Preprint at https://doi.org/10.48550/arXiv.2006.04768 (2020).
Katharopoulos, A., Vyas, A., Pappas, N. & Fleuret, F. Transformers are RNNs: fast autoregressive transformers with linear attention. Preprint at https://doi.org/10.48550/arXiv.2006.16236 (2020).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://doi.org/10.48550/arXiv.1412.6980 (2014).
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Baid, G. et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv https://doi.org/10.1101/2020.12.11.422022 (2020).

Download references

Acknowledgements

We thank F. Liu of the Google TensorFlow Model Garden team for improving our use of open-source implementation of the transformer architecture.

Author information

These authors contributed equally: Gunjan Baid, Daniel E. Cook, Maria Nattestad, Pi-Chuan Chang, Andrew Carroll.

Authors and Affiliations

Google LLC, Mountain View, CA, USA
Gunjan Baid,Â Daniel E. Cook,Â Kishwar Shafin,Â Taedong Yun,Â Felipe Llinares-LÃ³pez,Â Quentin Berthet,Â Anastasiya Belyaeva,Â Howard Yang,Â Alexey Kolesnikov,Â Waleed Ammar,Â Jean-Philippe Vert,Â Ashish Vaswani,Â Cory Y. McLean,Â Maria Nattestad,Â Pi-Chuan ChangÂ &Â Andrew Carroll
Pacific Biosciences, Menlo Park, CA, USA
Armin TÃ¶pfer,Â Aaron M. WengerÂ &Â William J. Rowell

Authors

Gunjan Baid
View author publications
You can also search for this author in PubMedÂ Google Scholar
Daniel E. Cook
View author publications
You can also search for this author in PubMedÂ Google Scholar
Kishwar Shafin
View author publications
You can also search for this author in PubMedÂ Google Scholar
Taedong Yun
View author publications
You can also search for this author in PubMedÂ Google Scholar
Felipe Llinares-LÃ³pez
View author publications
You can also search for this author in PubMedÂ Google Scholar
Quentin Berthet
View author publications
You can also search for this author in PubMedÂ Google Scholar
Anastasiya Belyaeva
View author publications
You can also search for this author in PubMedÂ Google Scholar
Armin TÃ¶pfer
View author publications
You can also search for this author in PubMedÂ Google Scholar
Aaron M. Wenger
View author publications
You can also search for this author in PubMedÂ Google Scholar
William J. Rowell
View author publications
You can also search for this author in PubMedÂ Google Scholar
Howard Yang
View author publications
You can also search for this author in PubMedÂ Google Scholar
Alexey Kolesnikov
View author publications
You can also search for this author in PubMedÂ Google Scholar
Waleed Ammar
View author publications
You can also search for this author in PubMedÂ Google Scholar
Jean-Philippe Vert
View author publications
You can also search for this author in PubMedÂ Google Scholar
Ashish Vaswani
View author publications
You can also search for this author in PubMedÂ Google Scholar
Cory Y. McLean
View author publications
You can also search for this author in PubMedÂ Google Scholar
Maria Nattestad
View author publications
You can also search for this author in PubMedÂ Google Scholar
Pi-Chuan Chang
View author publications
You can also search for this author in PubMedÂ Google Scholar
Andrew Carroll
View author publications
You can also search for this author in PubMedÂ Google Scholar

Contributions

G.B., P.-C.C. and A.C. conceived the study. G.B. and D.E.C. wrote DeepConsensus and trained models. G.B., D.E.C., K.S., T.Y., M.N. and A.B. performed experiments with DeepConsensus reads and made figures and documentation. F.L.-L., Q.B. and J.-P.V. conceived and implemented the alignment loss strategy, which D.E.C. integrated into DeepConsensus. A.M.W., W.J.R. and A.T. provided insight into PacBio data, identified areas for improvement, suggested informative features and provided code for preprocessing and evaluation. W.A. experimented with embedding strategies. A.K. and A.T. contributed to efficient processing of PacBio reads. H.Y. coordinated data acquisition and research agreements. J.-P.V., A.V., C.Y.M., M.N., P.-C.C. and A.C. provided guidance on experimental design, architecture and code review. G.B., D.E.C., K.S., T.Y., F.L.-L., Q.B., A.M.W., W.J.R., M.N., J.-P.V., A.V., C.Y.M., P.-C.C. and A.C. wrote the paper.

Corresponding author

Correspondence to Andrew Carroll.

Ethics declarations

Competing interests

G.B., D.E.C., K.S., T.Y., F.L.-L., Q.B., A.B., M.N., H.Y., A.K., W.A., J.-P.V., A.V., C.Y.M., P.-C.C. and A.C. are employees of Google LLC and own Alphabet stock as part of the standard compensation package. A.M.W., A.T. and W.J.R. are full-time employees and shareholders of Pacific Biosciences. This study was funded by Google LLC.

Peer review

Peer review information

Nature Biotechnology thanks Justin Zook, Andrey Bzikadze and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisherâ€™s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 DeepConsensus with longer reads improves genome assembly contiguity.

(a) HG002 read length distribution for 15kb and 24kb DeepConsensus reads from two SMRT Cells. (b) Contiguity of the HG002 hifiasm assembly with 15kb and 24kb DeepConsensus reads from two SMRT Cells. (c) HG002 variant calling performance for 15kb and 24kb reads from DeepConsensus for two SMRT Cells.

Supplementary information

Supplementary Information

Supplementary Figs. 1â€“11, Supplementary Tables 1â€“29 and documentation of software commands used.

Reporting Summary

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Baid, G., Cook, D.E., Shafin, K. et al. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat Biotechnol 41, 232â€“238 (2023). https://doi.org/10.1038/s41587-022-01435-7

Download citation

Received: 28 October 2021
Accepted: 15 July 2022
Published: 01 September 2022
Issue Date: February 2023
DOI: https://doi.org/10.1038/s41587-022-01435-7

This article is cited by

Chromosome-level genome assembly of the sacoglossan sea slug Elysia timida (Risso, 1818)
- Lisa MÃ¤nner
- Tilman Schell
- Carola Greve
BMC Genomics (2024)
Genomic resources, opportunities, and prospects for accelerated improvement of millets
- Faizo Kasule
- Oumar Diack
- Bethany Fallon Econopouly
Theoretical and Applied Genetics (2024)
Pangenome graph construction from genome alignments with Minigraph-Cactus
- Glenn Hickey
- Jean Monlong
- Benedict Paten
Nature Biotechnology (2024)
Mabs, a suite of tools for gene-informed genome assembly
- Mikhail I. Schelkunov
BMC Bioinformatics (2023)
Comparing methods for constructing and representing human pangenome graphs
- Francesco Andreace
- Pierre Lechat
- Rayan Chikhi
Genome Biology (2023)