Abstract
Endogenous viral elements (EVEs)âviruses that have integrated their genomes into those of their hostsâare prevalent in eukaryotes and have an important role in genome evolution1,2. The vast majority of EVEs that have been identified to date are small genomic regions comprising a few genes2, but recent evidence suggests that some large double-stranded DNA viruses may also endogenize into the genome of the host1. Nucleocytoplasmic large DNA viruses (NCLDVs) have recently become of great interest owing to their large genomes and complex evolutionary origins3,4,5,6, but it is not yet known whether they are a prominent component of eukaryotic EVEs. Here we report the widespread endogenization of NCLDVs in diverse green algae; these giant EVEs reached sizes greater than 1 million base pairs and contained as many as around 10% of the total open reading frames in some genomes, substantially increasing the scale of known viral genes in eukaryotic genomes. These endogenized elements often shared genes with host genomic loci and contained numerous spliceosomal introns and large duplications, suggesting tight assimilation into host genomes. NCLDVs contain large and mosaic genomes with genes derived from multiple sources, and their endogenization represents an underappreciated conduit of new genetic material into eukaryotic lineages that can substantially impact genome composition.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 /Â 30Â days
cancel any time
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout



Similar content being viewed by others
Data availability
Nucleotide and protein sequences specific to each of the GEVEs, hallmark gene set used for phylogenetic analyses, alignments for all phylogenies presented, HMM profiles of the core genes and NCVOG families, and other data products are available at: https://zenodo.org/record/3975964#.XzFj0hl7mfZ.
Code availability
A custom bioinformatic pipeline (ViralRecall) was developed in Python 3.5 for purposes of this study. This code is already publicly available on GitHub for the Aylward lab: https://github.com/faylward/viralrecall. For NCLDV marker gene detection, we also used a custom Python script available on GitHub: https://github.com/faylward/ncldv_markersearch. Other bioinformatic analyses performed in this study were done using publicly available bioinformatic tools and are described in the Methods.
References
Feschotte, C. & Gilbert, C. Endogenous viruses: insights into viral evolution and impact on host biology. Nat. Rev. Genet. 13, 283â296 (2012).
Holmes, E. C. The evolution of endogenous viral elements. Cell Host Microbe 10, 368â377 (2011).
Fischer, M. G. Giant viruses come of age. Curr. Opin. Microbiol. 31, 50â57 (2016).
Wilhelm, S. W. et al. A studentâs guide to giant viruses infecting small eukaryotes: from Acanthamoeba to zooxanthellae. Viruses 9, 46 (2017).
Abergel, C., Legendre, M. & Claverie, J.-M. The rapidly expanding universe of giant viruses: Mimivirus, Pandoravirus, Pithovirus and Mollivirus. FEMS Microbiol. Rev. 39, 779â796 (2015).
Weynberg, K. D., Allen, M. J. & Wilson, W. H. Marine prasinoviruses and their tiny plankton hosts: a review. Viruses 9, 43 (2017).
Bhattacharya, D. & Medlin, A. L. Algal phylogeny and the origin of land plants. Plant Physiol. 116, 9â15 (1998).
Jeanniard, A. et al. Towards defining the chloroviruses: a genomic journey through a genus of large DNA viruses. BMC Genomics 14, 158 (2013).
Moniruzzaman, M., Martinez-Gutierrez, C. A., Weinheimer, A. R. & Aylward, F. O. Dynamic genome evolution and complex virocell metabolism of globally-distributed giant viruses. Nat. Commun. 11, 1710 (2020).
Filée, J. Genomic comparison of closely related giant viruses supports an accordion-like model of evolution. Front. Microbiol. 6, 593 (2015).
Van Etten, J. L. et al. Chloroviruses have a sweet tooth. Viruses 9, 88 (2017).
Schvarcz, C. R. & Steward, G. F. A giant virus infecting green algae encodes key fermentation genes. Virology 518, 423â433 (2018).
Sun, C., Feschotte, C., Wu, Z. & Mueller, R. L. DNA transposons have colonized the genome of the giant virus Pandoravirus salinus. BMC Biol. 13, 38 (2015).
Marcet-Houben, M. & Gabaldón, T. Acquisition of prokaryotic genes by fungal genomes. Trends Genet. 26, 5â8 (2010).
Rossoni, A. W. et al. The genomes of polyextremophilic cyanidiales contain 1% horizontally transferred genes with diverse adaptive functions. eLife 8, e45017 (2019).
Filée, J. Multiple occurrences of giant virus core genes acquired by eukaryotic genomes: the visible part of the iceberg? Virology 466â467, 53â59 (2014).
Maumus, F. & Blanc, G. Study of gene trafficking between Acanthamoeba and giant viruses suggests an undiscovered family of amoeba-infecting viruses. Genome Biol. Evol. 8, 3351â3363 (2016).
Gallot-Lavallée, L. & Blanc, G. A glimpse of nucleo-cytoplasmic large DNA virus biodiversity through the eukaryotic genomics window. Viruses 9, 17 (2017).
Maumus, F., Epert, A., Nogué, F. & Blanc, G. Plant genomes enclose footprints of past infections by giant virus relatives. Nat. Commun. 5, 4268 (2014).
Guglielmini, J., Woo, A. C., Krupovic, M., Forterre, P. & Gaia, M. Diversification of giant and large eukaryotic dsDNA viruses predated the origin of modern eukaryotes. Proc. Natl Acad. Sci. USA 116, 19585â19592 (2019).
Forterre, P. & Gaïa, M. Giant viruses and the origin of modern eukaryotes. Curr. Opin. Microbiol. 31, 44â49 (2016).
Piacente, F., Gaglianone, M., Laugieri, M. E. & Tonetti, M. G. The autonomous glycosylation of large DNA viruses. Int. J. Mol. Sci. 16, 29315â29328 (2015).
Schulz, F. et al. Giant virus diversity and host interactions through global metagenomics. Nature 578, 432â436 (2020).
Abrahão, J. et al. Tailed giant Tupanvirus possesses the most complete translational apparatus of the known virosphere. Nat. Commun. 9, 749 (2018).
Wilson, W. H. et al. Complete genome sequence and lytic phase transcription profile of a Coccolithovirus. Science 309, 1090â1092 (2005).
Roux, S. et al. Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature 537, 689â693 (2016).Â
Koonin, E. V. & Krupovic, M. The depths of virus exaptation. Curr. Opin. Virol. 31, 1â8 (2018).
Ochman, H., Lawrence, J. G. & Groisman, E. A. Lateral gene transfer and the nature of bacterial innovation. Nature 405, 299â304 (2000).
Groisman, E. A. & Ochman, H. Pathogenicity islands: bacterial evolution in quantum leaps. Cell 87, 791â794 (1996).
Martin, W. F. Too much eukaryote LGT. BioEssays 39, 1700115 (2017).
Keeling, P. J. & Palmer, J. D. Horizontal gene transfer in eukaryotic evolution. Nat. Rev. Genet. 9, 605â618 (2008).
Cock, J. M. et al. The Ectocarpus genome and the independent evolution of multicellularity in brown algae. Nature 465, 617â621 (2010).
Delaroque, N., Maier, I., Knippers, R. & Müller, D. G. Persistent virus integration into the genome of its algal host, Ectocarpus siliculosus (Phaeophyceae). J. Gen. Virol. 80, 1367â1370 (1999).
Delaroque, N. & Boland, W. The genome of the brown alga Ectocarpus siliculosus contains a series of viral DNA pieces, suggesting an ancient association with large dsDNA viruses. BMC Evol. Biol. 8, 110 (2008).
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).
Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).
El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427âD432 (2019).
Yutin, N., Wolf, Y. I., Raoult, D. & Koonin, E. V. Eukaryotic large nucleo-cytoplasmic DNA viruses: clusters of orthologous genes and reconstruction of viral genome evolution. Virol. J. 6, 223 (2009).
Filée, J., Siguier, P. & Chandler, M. I am what I eat and I eat what I am: acquisition of bacterial genes by giant viruses. Trends Genet. 23, 10â15 (2007).
Filée, J., Pouget, N. & Chandler, M. Phylogenetic evidence for extensive lateral acquisition of cellular genes by nucleocytoplasmic large DNA viruses. BMC Evol. Biol. 8, 320 (2008).
Hoff, K. J. & Stanke, M. Predicting genes in single genomes with AUGUSTUS. Curr. Protoc. Bioinformatics 65, e57 (2019).
Stanke, M. & Morgenstern, B. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res. 33, W465âW467 (2005).
Gu, Z., Gu, L., Eils, R., Schlesner, M. & Brors, B. circlize implements and enhances circular visualization in R. Bioinformatics 30, 2811â2812 (2014).
OâLeary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733âD745 (2016).
KieÅbasa, S. M., Wan, R., Sato, K., Horton, P. & Frith, M. C. Adaptive seeds tame genomic sequence comparison. Genome Res. 21, 487â493 (2011).
Federhen, S. The NCBI Taxonomy database. Nucleic Acids Res. 40, D136âD143 (2012).
Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol. Biol. Evol. 33, 1635â1638 (2016).
Pagès, H., Aboyoun, P., Gentleman, R. & DebRoy, S. Biostrings: efficient manipulation of biological strings. R package version 2.56.0  https://bioconductor.org/packages/Biostrings (2020).
Bao, Z. & Eddy, S. R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 12, 1269â1276 (2002).
Delcher, A. L., Phillippy, A., Carlton, J. & Salzberg, S. L. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30, 2478â2483 (2002).
Tatusov, R. L., Galperin, M. Y., Natale, D. A. & Koonin, E. V. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28, 33â36 (2000).
Haft, D. H. et al. TIGRFAMs: a protein family resource for the functional identification of proteins. Nucleic Acids Res. 29, 41â43 (2001).
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309âD314 (2019).
Moniruzzaman, M. et al. Virusâhost relationships of marine single-celled eukaryotes resolved from metatranscriptomics. Nat. Commun. 8, 16054 (2017).
Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307â321 (2010).
Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).
Capella-Gutiérrez, S., Silla-MartÃnez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972â1973 (2009).
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 47, W256âW259 (2019).
Lechner, M. et al. Proteinortho: detection of (co-)orthologs in large-scale analysis. BMC Bioinformatics 12, 124 (2011).
Csardi G, N. T. The igraph software package for complex network research. InterJournal Complex Systems 1695, 1â9 (2006).
Burns, J. A., Paasch, A., Narechania, A. & Kim, E. Comparative genomics of a bacterivorous green algae reveals evolutionary causalities and consequences of phago-mixotrophic mode of nutrition. Genome Biol. Ecol. 7, 3047â3061 (2015).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357â359 (2012).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841â842 (2010).
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658â1659 (2006).
Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586â1591 (2007).
Martinez-Gutierrez, C. A. & Aylward, F. O. Strong purifying selection is associated with genome streamlining in epipelagic Marinimicrobia. Genome Biol. Evol. 11, 2887â2894 (2019).
Huerta-Cepas, J. et al. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 44, D286âD293 (2016).
Nguyen, L.-T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268â274 (2015).
Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587â589 (2017).
Acknowledgements
We thank J. Burns from the Bigelow Laboratory of Ocean Sciences and E. Kim from the American Museum of Natural History for providing access to the RNA sequencing data of C. tetramitiformis. We acknowledge use of the Virginia Tech Advanced Research Computing Center for bioinformatic analyses performed in this study. This work was supported by a Simons Early Career Investigator Award in Marine Microbial Ecology and Evolution (grant no. 620443) and NSF grant IIBR-1918271 to F.O.A.
Author information
Authors and Affiliations
Contributions
F.O.A. and M.M. designed the project and wrote the paper. M.M. curated GEVEs, performed gene annotations and phylogenetic analysis. A.R.W. performed the GEVE protein annotations. C.A.M.-G. performed the dN/dS analysis.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature thanks Chantal Abergel, Matthew Sullivan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisherâs note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Fig. 1 Workflow for GEVE detection.
Overview of the initial steps to identify virus-like regions in chlorophyte genomes and subsequent steps to curate Giant Endogenous Viral Elements (GEVEs). Steps in the grey box are implemented in the ViralRecall tool; steps outside this box represent additional analyses we performed to validate our findings and further analyse the GEVEs.
Extended Data Fig. 2 General features of additional GEVEs.
Circular genome plots of 6 additional GEVEs (apart from those shown in Fig. 1b) showing NCVOG HMM hits, spliceosomal intron locations, and best LAST hit matches. Black dots atop the outermost track mark the locations of the core genes, while the blue links inside the circles represent duplicated regions. The grey shading demarcates the location of integrated GEVE as determined by ViralRecall in case of Chlorella and Tetradesmus obliquus.
Extended Data Fig. 3 GEVEs have coding potential similar to known giant viruses.
a, Principal component analysis (PCA) of the coding potential of the GEVE genomes, corresponding host genomes and reference giant viruses based on the presence/absence of Nucleocytoplasmic virus orthologous group (NCVOG) specific proteins in these genomes. The plot demonstrates the similarity in coding content of GEVEs and reference giant viruses, whereas the eukaryotic hosts are distinct in terms of coding potential. Nonviral chlorophyte host chromosomes have a much more scattered distribution due to the sporadic occurrence and low abundance of some NCVOGs in these genomes (ankyrin repeat proteins and transposons are represented in NCVOGs and are present in the nonviral portion of host chromosomes, for example). Eukaryotic-specific proteins are not included in NCVOGs, and so the host chlorophyte genomes donât show tight clustering, since this aspect of their genomic repertoires is not captured by NCVOGs. The prcomp() function in R was used to calculate the values. b, Bipartite network of 18 GEVEs and 126 reference giant viruses based on shared gene content. The network is constructed by profiling the presence of NCVOGs across all the virus and GEVE genomes represented. Large nodes represent NCLDV or GEVE genomes, smaller nodes represent NCVOG protein families and edges denote gene families represented in different genomes.
Extended Data Fig. 4 Example of gene prediction approach within the GEVEs.
Genes predicted by AUGUSTUS (outer ring, brown) and non-overlapping Prodigal predicted genes (middle ring, green) in the GEVEs within Chlamydomoans eustigma and Tetrabaena socialis are shown as examples. In most cases, Prodigal predicted many genes that were not detected by eukaryotic gene prediction algorithms. Many of the Prodigal predicted genes originally missed by AUGUSTUS have hits to NCVOGs (innermost right, purple) - including NCLDV core genes.
Extended Data Fig. 5 Level of duplications and core gene copy numbers in GEVE genomes versus reference giant virus genomes.
The left panel shows duplication level (repeated genomic regions at >90% nucleotide similarity) as estimated using RECON 1.08. The right panel shows copy numbers of NCLDV core genes in each of the GEVEs and reference genomes (see Methods for details).
Extended Data Fig. 6 Signature of relaxed selection in the GEVEs compared to free viruses.
Violin plot representing median dN/dS values of endogenized and free reference giant viruses. Statistical significance of differences between dN/dS values of the compared groups according to a non-paired, one-sided MannâWhitney Wilcoxon test is denoted by: ***P < 0.0001. âWâ denotes the Wilcoxon test statistic. For this test 79 values were for GEVE-GEVE dN/dS values and 775 were for comparisons between free viruses. The IDs of the reference genomes used for calculating the dN/dS values are provided in Supplementary Data 6.
Extended Data Fig. 7 Expression profiles of GEVE genes.
Selected set of expressed genes in 6 of the GEVEs. For each GEVE, up to 15 genes with highest expressions are shown, with exception of Tetrabaena socialis GEVE_1, for which all genes having >1 expression coverage are presented. For a particular gene, expression is measured as the average read mapping coverage of the CDS(s) in that gene. Genes having putative functions (based on PFAM or COG annotations) are shown in red, while mobile elements are shown in blue.
Extended Data Fig. 8 Functional potential coded by the GEVEs.
Functional profiles (EggNOG) of the GEVEs normalized across all the NOG functional categories except category S (Function unknown). No gene was found to be in category R (General function prediction only). Number of genes having no hits or in category S (Function unknown) are shown in the table on the right.
Supplementary information
Supplementary Information
This file contains the following: a) Supplementary results and discussion with references. b) Supplementary figures with captions describing each figure. c) Supplementary tables with captions describing each table.
Supplementary Data
Supplementary Data 1: Information on the genomes analysed in this study. FTP download link are provided for each of the genomes.
Supplementary Data
Supplementary Data 2: Summary statistics for individual contigs in each of the viral elements (GEVEs) analysed.
Supplementary Data
Supplementary Data 3: Average amino acid identities (AAI) between each pair of GEVEs.
Supplementary Data
Supplementary Data 4: Functional annotation for each of the GEVEs obtained using a number of protein family databases. Databases used are: COG, PFam, EggNOG, VOG, TIGR and EggVOG. See âMethodsâ for references for all these databases.
Supplementary Data
Supplementary Data 5: Annotation and expression values of the expressed genes in six of the GEVEs. Annotations are only provided for the genes which had hits to different databases (as specified in Supplementary Data 4).
Supplementary Data
Supplementary Data 6: Genome IDs of the reference NCLDVs that were used to calculate dN/dS values in the Phycodnaviridae and Mimiviridae group. The reference genomes can be accessed from the study cited in the âCalculation of dN/dS ratiosâ sub-section in the âMethodsâ.
Rights and permissions
About this article
Cite this article
Moniruzzaman, M., Weinheimer, A.R., Martinez-Gutierrez, C.A. et al. Widespread endogenization of giant viruses shapes genomes of green algae. Nature 588, 141â145 (2020). https://doi.org/10.1038/s41586-020-2924-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41586-020-2924-2