Personalized pangenome references

Sirén, Jouni; Eskandar, Parsa; Ungaro, Matteo Tommaso; Hickey, Glenn; Eizenga, Jordan M.; Novak, Adam M.; Chang, Xian; Chang, Pi-Chuan; Kolmogorov, Mikhail; Carroll, Andrew; Monlong, Jean; Paten, Benedict

doi:10.1038/s41592-024-02407-2

Article
Published: 11 September 2024

Personalized pangenome references

Nature Methods volumeÂ 21,Â pages 2017â€“2023 (2024)Cite this article

3461 Accesses
16 Altmetric
Metrics details

Subjects

Abstract

Pangenomes reduce reference bias by representing genetic diversity better than a single reference sequence. Yet when comparing a sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with by filtering rare variants. However, this blunt heuristic both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach that imputes a personalized pangenome subgraph by sampling local haplotypes according to k-mer counts in the reads. We implement the approach in the vg toolkit (https://github.com/vgteam/vg) for the Giraffe short-read aligner and compare its accuracy to state-of-the-art methods using human pangenome graphs from the Human Pangenome Reference Consortium. This reduces small variant genotyping errors by four times relative to the Genome Analysis Toolkit and makes short-read structural variant genotyping of known variants competitive with long-read variant discovery methods.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Illustrating haplotype sampling at adjacent blocks in the pangenome.**

**Fig. 2: Mapping 30Ã— NovaSeq reads for HG002 to GRCh38 (with BWA-MEM) and to HPRC graphs (with Giraffe).**

**Fig. 3: Small variants evaluation across samples HG001 to HG005.**

GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs

Article Open access 27 November 2019

Pan-African genome demonstrates how population-specific genome graphs improve high-throughput sequencing data analysis

Article Open access 04 August 2022

Pangenome graph construction from genome alignments with Minigraph-Cactus

Article 10 May 2023

Data availability

This work was done using publicly available data. HPRC v.1.1 graphs and VCF files for the variants included in them are available at https://github.com/human-pangenomics/hpp_pangenome_resources. The underlying assemblies, including GRCh38, can be found at https://github.com/human-pangenomics/HPP_Year1_Assemblies. We used Illumina and Element short reads for HG001, HG002, HG003, HG003 and HG005 available at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/sequencing/fastq/novaseq/wgs_pcr_free and https://console.cloud.google.com/storage/browser/brain-genomics-public/research/element/cloudbreak_wgs, respectively. The GIAB small variant benchmark sets for the same samples can be found at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/. GIAB and challenging medically relevant gene SV sets for HG002 is available at the same location. The T2T assembly of HG002 is available at https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/hg002v0.9.fasta.gz. See Supplementary Section 1 for further details.

Code availability

The haplotype sampling approach described in this article is part of the vg toolkit available under MIT license at https://github.com/vgteam/vg. There is an example dataset in directory test/haplotype-sampling. Documentation can be found at https://github.com/vgteam/vg/wiki/Haplotype-Sampling. See Supplementary Sections 4 and 5 for details on other software used.

References

Eizenga, J. M. et al. Pangenome graphs. Ann. Rev. Genomics Hum. Genet. 24, 139â€“162 (2020).
ArticleÂ Google ScholarÂ
Garrison, E. et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 36, 875â€“879 (2018).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Rautiainen, M. & Marschall, T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 21, 253 (2020).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
SirÃ©n, J. et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science 374, abg8871 (2021).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68â€“64 (2015).
ArticleÂ Google ScholarÂ
Pritt, J., Chen, Nae-Chyun & Langmead, B. FORGe: prioritizing variants for graph genomes. Genome Biol. 19, 220 (2018).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Liao, Wen-Wei et al. A draft human pangenome reference. Nature 617, 312â€“324 (2023).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R. & McVean, G. Improved genome inference in the MHC using a population reference graph. Nat. Genet. 47, 682â€“688 (2015).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Vaddadi, K., Mun, T. & Langmead, B. Minimizing reference bias with an impute-first approach. Preprint bioRxiv https://doi.org/10.1101/2023.11.30.568362 (2023).
Hickey, G. et al. Pangenome graph construction from genome alignments with Minigraph-Cactus. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01793-w (2023).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983â€“987 (2018).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Hickey, G. et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 21, 35 (2020).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Ebler, J. et al. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat. Genet. 54, 518â€“525 (2022).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Human Pangenome Reference Consortium. HPRC Pangenome Resources. GitHub https://github.com/human-pangenomics/hpp_pangenome_resources/ (2023).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint https://arxiv.org/abs/1303.3997 (2013).
Kokot, M., DÅ‚ugosz, M. & Deorowicz, S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759â€“2761 (2017).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Cell Genom. 2, 100128 (2022).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Baid, G. et al. An extensive sequence dataset of gold-standard samples for benchmarking and development. Preprint at bioRxiv https://doi.org/10.1101/2020.12.11.422022 (2020).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555â€“560 (2019).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).
Carroll, A. et al. Accurate human genome analysis with Element Avidity sequencing. Preprint at bioRxiv https://doi.org/10.1101/2023.08.11.553043 (2023).
Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672â€“680 (2022).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595â€“597 (2018).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347â€“1355 (2020).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Kolmogorov, M. et al. Scalable nanopore sequencing of human genomes provides a comprehensive view of haplotype-resolved variation and methylation. Nat. Methods 20, 1483â€“1492 (2023).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Marchet, C. et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 31, 1â€“12 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Feuk, L., Carson, A. R. & Scherer, S. W. Structural variation in the human genome. Nat. Rev. Genet. 7, 85â€“97 (2006).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Rausch, T. et al. Delly: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333â€“i339 (2012).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Mohiyuddin, M. et al. Metasv: an accurate and integrative structural-variant caller for next generation sequencing. Bioinformatics 31, 2741â€“2744 (2015).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220â€“1222 (2016).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Fang, H. et al. Indel variant analysis of short-read sequencing data with scalpel. Nat. Protoc. 11, 2529â€“2548 (2016).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Wala, J. A. et al. Svaba: genome-wide detection of structural variants and indels by local assembly. Genome Res. 28, 581â€“591 (2018).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540â€“546 (2019).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Jiang, T. et al. Long-read-based human genomic structural variation detection with cutesv. Genome Biol. 21, 189 (2020).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Smolka, M., Paulin, L.F., Grochowski, C.M. et al. Detection of mosaic and population-level structural variants with Sniffles2. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-02024-y (2024).
SirÃ©n, J. & Paten, B. GBZ file format for pangenome graphs. Bioinformatics 38, 5012â€“5018 (2022).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
SirÃ©n, J., Garrison, E., Novak, A. M., Paten, B. & Durbin, R. Haplotype-aware graph indexes. Bioinformatics 36, 400â€“407 (2020).
ArticleÂ PubMedÂ Google ScholarÂ
Paten, B. et al. Superbubbles, ultrabubbles, and cacti. J. Comput. Biol. 25, 649â€“663 (2018).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821â€“829 (2008).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Chang, X., Eizenga, J., Novak, A. M., SirÃ©n, J. & Paten, B. Distance indexing and seed clustering in sequence graphs. Bioinformatics 36, i146â€“i153 (2020).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Gagie, T., Navarro, G. & Prezza, N. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. Assoc. Comput. Mach. 67, 2 (2020).
ArticleÂ Google ScholarÂ
Dufresne, Y. et al. The k-mer file format: a standardized and compact disk representation of sets of k-mers. Bioinformatics 38, 4423â€“4425 (2022).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ

Download references

Acknowledgements

This work was supported in part by the National Human Genome Research Institute and the National Institutes of Health (NIH). B.P. was partly supported by NIH grant nos. R01HG010485, U24HG010262, U24HG011853, OT3HL142481, U01HG010961 and OT2OD033761. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

UC Santa Cruz Genomics Institute, University of California, Santa Cruz, CA, USA
Jouni SirÃ©n,Â Parsa Eskandar,Â Matteo Tommaso Ungaro,Â Glenn Hickey,Â Jordan M. Eizenga,Â Adam M. Novak,Â Xian Chang,Â Jean MonlongÂ &Â Benedict Paten
University of Ferrara, Ferrara, Italy
Matteo Tommaso Ungaro
Google LLC, Mountain View, CA, USA
Pi-Chuan ChangÂ &Â Andrew Carroll
Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
Mikhail Kolmogorov
Institut de Recherche en SantÃ© Digestive, UniversitÃ© de Toulouse, INSERM, INRA, ENVT, UPS, Toulouse, France
Jean Monlong

Authors

Jouni SirÃ©n
View author publications
You can also search for this author in PubMedÂ Google Scholar
Parsa Eskandar
View author publications
You can also search for this author in PubMedÂ Google Scholar
Matteo Tommaso Ungaro
View author publications
You can also search for this author in PubMedÂ Google Scholar
Glenn Hickey
View author publications
You can also search for this author in PubMedÂ Google Scholar
Jordan M. Eizenga
View author publications
You can also search for this author in PubMedÂ Google Scholar
Adam M. Novak
View author publications
You can also search for this author in PubMedÂ Google Scholar
Xian Chang
View author publications
You can also search for this author in PubMedÂ Google Scholar
Pi-Chuan Chang
View author publications
You can also search for this author in PubMedÂ Google Scholar
Mikhail Kolmogorov
View author publications
You can also search for this author in PubMedÂ Google Scholar
Andrew Carroll
View author publications
You can also search for this author in PubMedÂ Google Scholar
Jean Monlong
View author publications
You can also search for this author in PubMedÂ Google Scholar
Benedict Paten
View author publications
You can also search for this author in PubMedÂ Google Scholar

Contributions

J.S. and B.P. conceived the method for haplotype sampling, and J.S. developed and implemented it. J.S., P.E., M.T.U. and M.K. performed the analyses shown in the paper. J.S., G.H., J.M.E., A.M.N., X.C. and J.M. contributed to the vg software on which the method is based and helped modify it for this work. P.-C.C. and A.C. trained and provided support on using DeepVariant for the paper. J.S., P.E., M.T.U. and B.P. wrote the paper. All authors reviewed and edited the draft.

Corresponding authors

Correspondence to Jouni SirÃ©n or Benedict Paten.

Ethics declarations

Competing interests

P.-C.C. and A.C. are employees of Google LLC and own Alphabet stock as part of the standard compensation package. The other authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Rayan Chikhi and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Lei Tang, in collaboration with the Nature Methods team.

Additional information

Publisherâ€™s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Sections 1â€“5, Tables 1â€“4 and Figs. 1â€“4.

Reporting Summary

Peer Review File

Supplementary Tables 5â€“13

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

SirÃ©n, J., Eskandar, P., Ungaro, M.T. et al. Personalized pangenome references. Nat Methods 21, 2017â€“2023 (2024). https://doi.org/10.1038/s41592-024-02407-2

Download citation

Received: 18 December 2023
Accepted: 06 August 2024
Published: 11 September 2024
Issue Date: November 2024
DOI: https://doi.org/10.1038/s41592-024-02407-2

This article is cited by

Constructing and personalizing population pangenome graphs
- Rayan Chikhi
- Yoann Dufresne
- Paul Medvedev
Nature Methods (2024)

Personalized pangenome references

Subjects

Abstract

Access options

Similar content being viewed by others

GraphTyper2 enables population-scale genotyping of structural variation using pangenome graphs

Pan-African genome demonstrates how population-specific genome graphs improve high-throughput sequencing data analysis

Pangenome graph construction from genome alignments with Minigraph-Cactus

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Peer Review File

Supplementary Tables 5â€“13

Rights and permissions

About this article

Cite this article

This article is cited by

Constructing and personalizing population pangenome graphs

Personalizing pangenome graphs with k-mers

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links