Galperin 2010 Understanding The Genome
Galperin 2010 Understanding The Genome
Galperin 2010 Understanding The Genome
Author Manuscript
Trends Biotechnol. Author manuscript; available in PMC 2011 March 29.
Published in final edited form as:
NIH-PA Author Manuscript
Abstract
The rapidly accumulating genome sequence data allow researchers to address fundamental
biological questions that were not even asked just a few years ago. A major problem in genomics
is the widening gap between the rapid progress in genome sequencing and the comparatively slow
progress in the functional characterization of sequenced genomes. Here we discuss two key
NIH-PA Author Manuscript
questions of genome biology: whether we need more genomes, and how deep is our understanding
of biology based on genomic analysis. We argue that overly specific annotations of gene functions
are often less useful than the more generic, but also more robust, functional assignments based on
protein family classification. We also discuss problems in understanding the functions of the
remaining “conserved hypothetical” genes.
Introduction
The year 2010 marks the 15th anniversary of the publication of the 1,830,138-base genome
of the bacterium Haemophilus influenzae Rd Kw20 - the first cellular life form to have its
entire genome sequenced [1]. Aided by the tremendous progress in sequencing technology,
genome sequencing is advancing at an ever-increasing pace. By the end of 2009, 1052
genomes representing 720 individual species (636 bacteria, 61 archaea, and 23 eukaryotes)
were completely sequenced, deposited in the public nucleotide sequence databases
(GenBank\EMBL\DDBJ) and made freely available over the internet. Many more genomes
were at various stages of sequencing and assembly, including almost 100 eukaryotic
genomes whose preliminary descriptions have been published [2]. Thanks to the advent of
the new generation of sequencing technologies, the costs of genome sequencing have
NIH-PA Author Manuscript
dropped so much that the projects to sequence the entire human microbiome
(http://nihroadmap.nih.gov/hmp/, [3]) and to generate ~5,000 reference genomes for every
major prokaryotic lineage (the Genomic Encyclopedia of Bacteria and Archaea:
http://www.jgi.doe.gov/programs/GEBA/, [4]) have become realistic. Given these
remarkable advances, it seems timely to address two lingering questions: ‘How many more
genomes do we need?’ and ‘How deep is our understanding of biology derived from genome
analysis?’
[5,6]. There seems to be some substance to this claim; for example, it is unlikely that we
ever see a single bacterial chromosome that is much longer than 13,033,779 nucleotides (as
in the myxobacterium Sorangium cellulosum). On the other end of the spectrum,
NIH-PA Author Manuscript
Nevertheless, genome sequencing is here to stay, and there are several compelling reasons
for that. First of all, the value of the sequence information is in the eye of the beholder.
Many biologists still passionately argue for sequencing their own favorite organism, strain
or isolate, no matter how many close relatives already have been sequenced. Indeed, not
having a genome sequence for an experimental model is increasingly - and for good reasons
- perceived as being stuck in the "dark ages". The availability of the genome sequence
allows researchers to easily clone and express any gene, create microarrays to analyze gene
NIH-PA Author Manuscript
expression, and reconstruct the metabolic and signaling networks. Having genomic
sequences from closely related organisms opens the door to the quantitative study of
mutational patterns, selective regimes, adaptations to ecological factors and, in the case of
microbial pathogens, virulence determinants. Potentially even more important is the
possibility to identify genes and traits that are not present in the given genome - a task that
clearly requires a complete genome sequence.
Secondly, the available genome collection, despite its rapid expansion, still barely scratches
the surface of the real biological diversity. The availability of genomic data already led to a
revolution in systematics, especially with regard to bacteria and archaea, having put this
field on a solid evolutionary footing and giving rise to the new discipline of phylogenomics
[10,11]. Still, judging from the metagenomic data, as many as 90% of the microbial species
on Earth remain uncultivated [12,13], which complicates reconstruction of the global carbon
and nitrogen cycles. Genome analysis has already led to several important advances in these
areas. Thus, the genome of the marine α-proteobacterium SAR11 (now renamed Candidatus
Pelagibacter ubique), apparently the most abundant organism on this planet, opened our
eyes to a peculiar role of bacteriorhodopsin-mediated photosynthesis as an auxiliary energy
source in the extremely streamlined metabolism of this bacterium [14]. The genome
NIH-PA Author Manuscript
Thirdly, hidden sampling biases in genome sequencing are becoming apparent. For example,
starting with Mycoplasma genitalium in 1995, more than 20 mollicute genomes have been
sequenced, none of which encoded a single environmental sensor [17]. However, the
perception that mollicutes have no signal transduction systems was shattered upon the
completion of the (slightly larger) genome of the soil mollicute Acholeplasma laidlawii,
which encodes two sensory histidine kinases, three response regulators, an adenylate
Fourthly, although obtaining complete genome sequences from every major lineage [4]
would certainly be a dramatic step forward, a single representative genome is by no means
sufficient to assess the true biological diversity of a taxon. As a case in point, the sequencing
of several genomes from the cyanobacterium Prochlorococcus marinus - a widespread
inhabitant of ocean surface waters - was originally aimed at establishing the principal
differences between “high-light” and “low-light” ecotypes [18]. However, different strains
of P. marinus proved to have vastly different gene repertoires, indicative of high rates of
gene acquisition and loss by these organisms. These findings have shown that: (i) the core
set of genes shared by all P. marinus isolates is very limited – and shrinking; and (ii) the P.
marinus pan-genome, that is the sum total of genes represented in at least one P. marinus
strain, is extremely large – and expanding [19]. This crucial yet unexpected development
puts into question the very rationale for assigning organisms with dramatically different
genome contents – but (nearly) identical 16S rRNA sequences – to the same “species” (such
as P. marinus or Escherichia coli) and puts the study of pan-genomes to the forefront of
genomic research.
Finally, there remains the crucial issue of using genome sequencing to improve human
health. For obvious reasons, the first sequenced genomes were mostly those of common
NIH-PA Author Manuscript
bacterial pathogens. Then the human genome and representative genomes from popular
model organisms emerged. As sequencing costs continue to decrease, the use of genomic
data for fighting disease becomes more and more attractive. For many bacterial pathogens,
multiple strains have been sequenced, often providing clues to the virulence factors, host
specificity and drug resistance. Some biologists advocate developing a system of constant
genome-based monitoring of various points on the globe, hoping to catch new emerging
pathogens before they cause a new epidemic. Such an effort is already well underway for
influenza viruses [20,21]. The human cancer genome projects aims at sequencing thousands
of tissue samples from various tumors, in hopes of delineating the whole spectrum of
mutations that could contribute to cancer [22]. Although this approach has been criticized
[6], the perspective of obtaining the full list of potentially oncogenic mutations – thereby
achieving a “complete understanding” of the causes of cancer – is certainly too attractive to
pass.
exact meaning of the word “understanding” (as well as “function”). Modern dictionaries
associate “understanding” with such terms as “appreciation”, “comprehension”,
“explanation”, “insight”, “interpretation”, “knowledge”, and “mastery”. Accordingly,
understanding a genome starts from the “knowledge” of the nucleotide sequence and the
sequences of encoded proteins and RNAs, and includes “interpretation” of their functions,
“insight” into their complex interactions, and “explanation” of the evolutionary history that
shaped each particular genome. This leads to the “comprehension” of the potential activity
of each component of the cell, which must be tempered by the “appreciation” that proteins
often have additional (e.g. moonlighting [23,24]) function. Finally, this understanding can
be extended into “mastery” – the ability to modify the genome for certain (e.g.
biotechnological) applications. Therefore, the problem of understanding the genome can be
rephrased as follows: how good is the “parts list” that is compiled for each genome in the
form of functional annotation of the predicted protein-coding and RNA-coding genes?
Obviously, this list is never complete. Almost 10 years ago, Peer Bork described the “70%
hurdle”: on average, for approximately one-third of the genes in any given genome, the
functions could not be predicted through traditional methods of genome analysis; perhaps
NIH-PA Author Manuscript
even worse, the accuracy of functional prediction was only ~70% for the remaining genes
[25]. Bork warned that hopes to cross this 70% barrier and achieve a better understanding of
the functional content of genomes with the help of high-throughput analytical methods
would be tempered by the fact that these methods themselves have high error rates and are
most effective when used in concert [25]. Looking back, Bork’s sobering prediction was
right on target. High-throughput analyses of gene and protein expression, protein-protein
interactions, and ligand binding led to a dramatic increase in the amount of data pertaining
to any given gene in model genomes [26]. However, as illustrated in Box 1, accumulation of
such data does not necessarily translate into clarity regarding gene function, at least not
immediately, and not without much work.
Owing to the paucity of experimental data, this information is rarely available in its entirety,
and functional assignments for the majority of the genes are based solely on the sequence
similarity of their products to experimentally characterized proteins in a handful of model
NIH-PA Author Manuscript
(something that we generally know how to do) for specific protein annotation, which except
possibly for a handful of obvious cases, will remain questionable until each protein is
experimentally characterized, even when predictions appear entirely plausible and supported
NIH-PA Author Manuscript
It is important to note that family assignment is only the first step towards understanding,
which, as discussed above, requires knowledge of both the biochemical activity of the
protein and the cellular process in which the protein is involved. As an example, the
sequence-based prediction that the conserved bacterial protein Era is a GTPase was a good
first step in its characterization, and recognition of its involvement in translation was another
step forward. However, “true understanding” of the role of this GTPase in the translation
process – and its proper functional annotation – came only after an experimental study that
revealed the participation of Era in processing and maturation of 16S rRNA [35].
surprising in case of lineage-specific genes that are found, for example, only in Vibrio or
Burkholderia - bacterial lineages that are extensively sampled by genome sequencing, but do
not include well-characterized model organisms. However, some genes that are widespread
among bacteria, archaea and/or eukaryotes still remain without functional annotation [39].
The protein products of these genes have been variously referred to as “hypothetical”,
“conserved hypothetical”, “uncharacterized” or even “putative uncharacterized” (as of May
1, 2010, 3,118,564 proteins in UniProt were annotated this way [40]). Several lists of
“conserved hypothetical” proteins have been compiled, including Domains of Unknown
Function (DUFs) in Pfam, R- and S-COGs in the COG database, and Uncharacterized
Protein Families (UPFs) in UniProtKB\Swiss-Prot [29,33,40]. These lists have been
extensively used to guide structural genomics efforts, which resulted in structural (albeit
usually not functional) characterization of many such proteins [41,42].
To highlight the distinction between the “hypothetical” genes whose functions remained
completely unknown and those that could be assigned a general biochemical function (e.g. a
methyltransferase, an oxidoreductase, a transcriptional regulator or a membrane transporter),
NIH-PA Author Manuscript
we denoted the former category of genes “unknown unknown” and the latter category
“known unknown” [39]. The “known unknown” category includes also genes of unknown
biochemical function that have (partially) known cellular function, such as a “cell division
protein” or a “stress response protein”. In purely operational terms, there are more or less
clear ways of establishing function for “known unknown” genes, but not for “unknown
unknowns”.
Six years ago we analyzed widely conserved “hypothetical” genes and compiled the “top
10” lists of “known unknown” and “unknown unknown” genes [39]. A re-examination of
these lists shows that, despite mounting observations, nearly half of those genes still remain
without an assigned function (Tables 1 and 2). Some of the genes in the two lists have been
experimentally characterized, and in a few cases the function has been established [43]. In
eukaryotes, products of some, albeit not all, of these widely conserved genes appear to be
targeted to mitochondria [44–48]. In two instances, mutations in these genes were linked to
mitochondrial diseases, such as hereditary paraganglioma [44] and the late-onset Leigh
syndrome [48]. In other cases, however, experimental results were contradictory (Box 1).
Apparently, the problem was not in the lack of effort to characterize these genes, but in the
pleiotropic phenotypes of their mutations, which made it difficult to pinpoint the primary
NIH-PA Author Manuscript
function.
Less common “hypothetical” genes are far more abundant in the genomes of free-living
organisms than in the relatively streamlined genomes of parasites, symbionts and
saprophytes [53]. Based on the observation that the fraction of metabolic and particularly
regulatory genes increases with the genome size [17,54,55], sophisticated regulation of gene
expression and complex (secondary) metabolism, including various post-transcriptional and
NIH-PA Author Manuscript
Recent studies have highlighted an additional class of functions that might account for the
abundance of uncharacterized genes in free-living organisms, namely, detoxification
(usually hydrolysis) of potentially hazardous side-products of various metabolic reactions
[56]. These activities, commonly referred to as “house-cleaning”, are particularly important
for aerobic organisms that have to cope with spontaneous oxidation of nucleotides, amino
acids, lipids, and other cellular components. For example, the recently characterized
“conserved hypothetical” gene yebR (renamed msrC) has been shown to encode an enzyme
that hydrolyzes methionine-(R)-sulfoxide, a product of methionine oxidation [57]. Other
cellular reactions that might require house-cleaning include methylation, acetylation and
adenylation, among potentially many others. It is probably no coincidence that many poorly
characterized proteins appear to function as hydrolases [27,28].
Finally, it has to be kept in mind that a considerable fraction of genes in many genomes
might not have definable cellular functions, but rather originate from viruses and mobile
elements and only transiently pass through microbial genomes. Genomes are highly
NIH-PA Author Manuscript
dynamic entities, and each sequence is a temporal snapshot that is likely to include many
short-lived elements that are not maintained by selection. The very notion of annotation for
such “selfish” genes is different from that applied to “regular” genes with distinct cellular
functions [9].
Concluding remarks
In conclusion, it might be worthwhile to make several basic generalizations regarding
genomes and the understanding of gene functions:
• Functions of many widespread genes are known; all universal genes are involved in
translation [9]
• Widespread genes with unknown functions remain uncharacterized for a reason:
they often affect multiple processes and their mutations typically are pleiotropic
(Box 1)
• The functions of a substantial fraction of genes in each sequenced genome remain
unknown
NIH-PA Author Manuscript
• Not every experiment on an unknown gene yields useful clues regarding function.
• Structural characterization of a protein rarely gives direct clues to its function
[41,42,58]).
• Analysis of gene expression rarely gives direct clues to gene functions
• Delineation of a protein interaction network involving the gene of interest rarely
gives direct clues to its function [26,59,60]
• Functional assignments for previously uncharacterized, widely conserved genes are
just like any biological discoveries: they require a lot of hard work and a bit of luck
So far there is no single high-throughput approach that would finally reveal the functions of
all “hypothetical” genes encoded in the sequenced genomes. This goal may be reachable
only through sustained efforts of numerous experimental, computational and structural
biologists [61]. At the end of 2009, NIH awarded a grant to the COMputational BRidge to
EXperiments (COMBREX, http://www.combrex.org/, formerly SciBay) consortium project
that aims to coordinate collaborative efforts of various research groups towards
computational identification of the most interesting families of “conserved hypothetical”
proteins and their experimental characterization
NIH-PA Author Manuscript
The E. coli ygjD (gcp) gene has orthologs in almost every bacterial, archaeal and
eukaryotic genome. In many eukaryotes it is found in two paralogous copies, such as
QRI7 and Kae1 in yeast, At4g22720 and At2g45270 in Arabidopsis thaliana, and
NIH-PA Author Manuscript
OSGEPL and OSGEPL1 in human. In addition, there is a family of more distant bacterial
paralogs, represented by E. coli YeaZ and B. subtilis YdiC. We have previously
discussed the potential functions of this protein family (which contains an actin/HSP70
superfamily ATPase domain), and expressed doubts about its annotation as "O-
sialoglycoprotease", which was based on a single experimental observation, and further
suggested an association of this protein with translation (e.g. co-translational degradation
of misfolded proteins) [39]. In the past several years, proteins of this family have been
studied in several model organisms, and the crystal structures of several family members
have been solved [46,59]. An archaeal YgjD family member showed no protease activity,
but has been reported to bind DNA and possess an apurinic endonuclease activity [62]. In
yeast, Kae1 is a subunit of the KEOPS complex which regulates transcription, telomere
uncapping and telomere length, and is required for cell growth; this protein is targeted to
mitochondria and appears to be essential for genome maintenance [46]. Despite all these
observations, the actual function of the YgjD family proteins remains enigmatic [46,60].
A recent study suggested their involvement in biosynthesis of
threonylcarbamoyladenosine (t6A), a universal tRNA base modification occurring at
position 37 in a subset of tRNAs decoding the ANN codons [63]. If so, translational
defects resulting from impaired t6A biosynthesis could explain at least some properties of
NIH-PA Author Manuscript
translation of the full-length COX1 polypeptide were considered: (i) securing an accurate
start of translation, (ii) stabilizing the elongating polypeptide, and (iii) interacting with
the peptide release factor [48]. While involvement in translation appears very likely for
such a widespread protein family, its apparent capacity to bind DNA remains to be
confirmed and/or explained.
YjgF/YabJ/YER057c/UK114 family
The E. coli yjgF gene has highly conserved homologs in bacteria, archaea and
eukaryotes, often with multiple paralogs in the same genome. Representatives of the
YjgF protein family are known as "purine regulatory protein YabJ" in B. subtilis and as
"tumour-associated antigen UK114" in human and other mammals. Members of this
family have been reported to possess ribonuclease activity, to function as a molecular
chaperone, calpain activator, transcriptional regulator, and translational inhibitor, and
also to affect photosynthesis, isoleucine biosynthesis and mitochondrial genome
the cellular functions of the members of the YjgF family remain unclear.
Acknowledgments
This study was supported by the Intramural Research Program of the National Library of Medicine at the U.S.
National Institutes of Health.
References
1. Fleischmann RD, et al. Whole-genome random sequencing and assembly of Haemophilus
influenzae Rd. Science. 1995; 269:496–512. [PubMed: 7542800]
2. Liolios K, et al. The Genomes On Line Database (GOLD) in 2009: status of genomic and
metagenomic projects and their associated metadata. Nucleic Acids Res. 2010; 38:D346–D354.
[PubMed: 19914934]
3. Ley RE, et al. Worlds within worlds: evolution of the vertebrate gut microbiota. Nat. Rev.
Microbiol. 2008; 6:776–788. [PubMed: 18794915]
4. Wu D, et al. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature. 2009;
462:1056–1060. [PubMed: 20033048]
NIH-PA Author Manuscript
5. Whitworth DE. Genomes and knowledge - a questionable relationship? Trends Microbiol. 2008;
16:512–519. [PubMed: 18819801]
6. Kaiser, J. A skeptic questions cancer genome projects. ScienceInsider, 23 April 2010. 2010.
(http://news.sciencemag.org/scienceinsider/2010/04/a-skeptic-questions-cancer-genom.html)
7. McCutcheon JP, et al. Origin of an alternative genetic code in the extremely small and GC-rich
genome of a bacterial symbiont. PLoS Genet. 2009; 5 e1000565.
8. Galperin MY, Kolker E. New metrics for comparative genomics. Curr. Opin. Biotechnol. 2006;
17:440–447. [PubMed: 16978854]
9. Koonin EV, Wolf YI. Genomics of bacteria and archaea: the emerging dynamic view of the
prokaryotic world. Nucleic Acids Res. 2008; 36:6688–6719. [PubMed: 18948295]
10. Eisen JA, Fraser CM. Phylogenomics: intersection of evolution and genomics. Science. 2003;
300:1706–1707. [PubMed: 12805538]
11. Koonin EV. The origin and early evolution of eukaryotes in the light of phylogenomics. Genome
Biol. 2010; 11:209. [PubMed: 20441612]
12. Hugenholtz P. Exploring prokaryotic diversity in the genomic era. Genome Biol. 2002; 3
REVIEWS0003.
13. DeLong EF. The microbial ocean from genomes to biomes. Nature. 2009; 459:200–206. [PubMed:
19444206]
NIH-PA Author Manuscript
14. Giovannoni SJ, et al. Genome streamlining in a cosmopolitan oceanic bacterium. Science. 2005;
309:1242–1245. [PubMed: 16109880]
15. Hou S, et al. Genome sequence of the deep-sea gamma-proteobacterium Idiomarina loihiensis
reveals amino acid fermentation as a source of carbon and energy. Proc. Natl. Acad. Sci. USA.
2004; 101:18036–18041. [PubMed: 15596722]
16. Klotz MG, Stein LY. Nitrifier genomics and evolution of the nitrogen cycle. FEMS Microbiol.
Lett. 2008; 278:146–156. [PubMed: 18031536]
17. Galperin MY. A census of membrane-bound and intracellular signal transduction proteins in
bacteria: bacterial IQ, extroverts and introverts. BMC Microbiol. 2005; 5:35. [PubMed: 15955239]
18. Rocap G, et al. Genome divergence in two Prochlorococcus ecotypes reflects oceanic niche
differentiation. Nature. 2003; 424:1042–1047. [PubMed: 12917642]
19. Scanlan DJ, et al. Ecological genomics of marine picocyanobacteria. Microbiol. Mol. Biol. Rev.
2009; 73:249–299. [PubMed: 19487728]
20. McHardy AC, Adams B. The role of genomics in tracking the evolution of influenza A virus. PLoS
Pathog. 2009; 5 e1000566.
21. Lee CW, et al. Large-scale evolutionary surveillance of the 2009 H1N1 influenza A virus using
NIH-PA Author Manuscript
46. Oberto J, et al. Qri7/OSGEPL, the mitochondrial version of the universal Kae1/YgjD protein, is
essential for mitochondrial genome maintenance. Nucleic Acids Res. 2009; 37:5343–5352.
[PubMed: 19578062]
NIH-PA Author Manuscript
47. Rudolph C, et al. ApoA-I-binding protein (AI-BP) and its homologues hYjeF_N2 and hYjeF_N3
comprise the YjeF_N domain protein family in humans with a role in spermiogenesis and
oogenesis. Horm. Metab. Res. 2007; 39:322–335. [PubMed: 17533573]
48. Weraarpachai W, et al. Mutation in TACO1, encoding a translational activator of COX I, results in
cytochrome c oxidase deficiency and late-onset Leigh syndrome. Nat. Genet. 2009; 41:833–837.
[PubMed: 19503089]
49. Phillips G, et al. Discovery and characterization of an amidotransferase involved in the
modification of archaeal tRNA. J. Biol. Chem. 2010; 285:12706–12713. [PubMed: 20129918]
50. Pouliot Y, Karp PD. A survey of orphan enzyme activities. BMC Bioinformatics. 2007; 8:244.
[PubMed: 17623104]
51. Osterman A, Overbeek R. Missing genes in metabolic pathways: a comparative genomics
approach. Curr. Opin. Chem. Biol. 2003; 7:238–251. [PubMed: 12714058]
52. Hanson AD, et al. 'Unknown' proteins and 'orphan' enzymes: the missing half of the engineering
parts list--and how to find it. Biochem. J. 2010; 425:1–11. [PubMed: 20001958]
53. Kolker E, et al. Global profiling of Shewanella oneidensis MR-1: Expression of hypothetical genes
and improved functional annotations. Proc. Natl. Acad. Sci. USA. 2005; 102:2099–2104.
[PubMed: 15684069]
54. van Nimwegen E. Scaling laws in the functional content of genomes. Trends Genet. 2003; 19:479–
NIH-PA Author Manuscript
68. Sayers EW, et al. Database resources of the National Center for Biotechnology Information.
Nucleic Acids Res. 2009; 37:D5–D15. [PubMed: 18940862]
69. Koller-Eichhorn R, et al. Human OLA1 defines an ATPase subfamily in the Obg family of GTP-
NIH-PA Author Manuscript
subunit biogenesis, cell growth, and midgut precursor cell maintenance. Mol. Biol. Cell. 2009;
20:4424–4434. [PubMed: 19710426]
77. Jiang M, et al. The Escherichia coli GTPase CgtAE is involved in late steps of large ribosome
assembly. J. Bacteriol. 2006; 188:6757–6770. [PubMed: 16980477]
78. Pereira CM, et al. IMPACT, a protein preferentially expressed in the mouse brain, binds GCN1
and inhibits GCN2 activation. J. Biol. Chem. 2005; 280:28316–28323. [PubMed: 15937339]
79. de Hoog CL, et al. RNA and RNA binding proteins participate in early stages of cell spreading
through spreading initiation centers. Cell. 2004; 117:649–662. [PubMed: 15163412]
80. Balaji S, Aravind L. The RAGNYA fold: a novel fold with multiple topological variants found in
functionally diverse nucleic acid, nucleotide and peptide-binding proteins. Nucleic Acids Res.
2007; 35:5658–5671. [PubMed: 17715145]
NIH-PA Author Manuscript
Figure 1.
Accumulation of protein sequences of unknown function in the genome databases. Open
symbols indicate the total number of protein sequences encoded in prokaryotic (blue) and
eukaryotic (red) genomes; filled symbols indicate the number of “hypothetical” or
“uncharacterized” proteins. The data are taken from the NCBI’s RefSeq database [68]; the
numbers for 2010 are extrapolated from the first 4 months.
NIH-PA Author Manuscript
Table 1
Updated “top 10” list of widespread “known unknown” genes
NIH-PA Author Manuscript
Gene name Protein family PDB Initial predictions (2004) Updated functional
entry annotation, reference
E. coli Yeast Human Pfam COG
ygjD QRI7 OSGEP PF00814 0533 2VWB Putative metal- and ATP- DNA binding protein with
dependent protease. Fused to a apurinic endonuclease activity
Ser/Thr protein kinase domain [62];
in some archaea. Gene threonylcarbamoyladenosine
neighborhoods suggest biosynthesis in tRNA [63]
association with translation
ychF YBR025c PTD004 PF06071 0012 1JAL Predicted GTPase; binds An ATPase in the GTPase
double-stranded RNA; family [69]
coexpressed with peptidyl-
tRNA hydrolase; predicted to
be a translation factor
yrdC SUA5 YRDC PF01300 PF03481 0009 1HRU Double-stranded RNA binding Ribosome maturation factor
protein, predicted translation RimN [70];
initiation factor; induced by threonylcarbamoyladenosine
ischemia in humans biosynthesis in tRNA [43]
ybeM NIT2 NIT1 PF00795 0388 1EMS A member of nitrilase Omega-amidodicarboxylate
superfamily, predicted amidohydrolase activity [71]
amidase. Some members
might function as glutaminase
NIH-PA Author Manuscript
yfcE VPS29 PEP11 PF00149 0622 1SU1 A phosphoesterase of the A phosphodiesterase with
calcineurin-like superfamily; variable activity against 2',3'-
vacuolar sorting protein in cAMP [74,75]
yeast. Gene neighborhood is
compatible with a role in RNA
metabolism
- NUG1 GNL3 PF01926 1161 1PUJ Predicted GTPase; genome No news [76]
context suggests possible
involvement in translation. In
yeast, required for nuclear
export of 60S pre-ribosomal
particles. In humans, nucleolar
protein, important for cell
proliferation
yhcM AFG1 LACE1 PF03969 1485 n/a Predicted ATPase, in Promotes degradation of
eukaryotes localized to the cytochrome c oxidase
mitochondria
Gene name Protein family PDB Initial predictions (2004) Updated functional
entry annotation, reference
E. coli Yeast Human Pfam COG
NIH-PA Author Manuscript
mitochondrially encoded
subunits [45]
Modified from Table 2 from Ref [39] with permission from Oxford University Press. Additional information on the listed gene products is
available from the respective online resources: for Pfam [29], in the http://pfam.sanger.ac.uk/family?PF00814 format; for COGs [33], in the
http://www.ncbi.nlm.nih.gov/COG/grace/wiew.cgi?COG0533 format; for Protein DataBank (PDB), in the
http://www.rcsb.org/pdb/cgi/explore.cgi?pdbId=2VWB format.
Abbreviations: n/a, not available; COG, Clusters of Orthologous Groups of proteins database.
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Table 2
Updated “top 10” list of widespread “unknown unknown” genes
NIH-PA Author Manuscript
Gene name Protein family PDB Initial predictions (2004) Updated annotation
entry
E. coli Yeast Human Pfam COG
yebC YGR021w PRO0477 PF01709 0217 1KON Often encoded in the same operon DNA-binding transcriptional
with Holliday junction resolvasome regulator [65]; translational
(RuvABC) subunits. However, also activator of COX I; mutation
found in eukaryotes (mitochondrial causes cytochrome c oxidase
protein) whose resolvases are deficiency and late-onset Leigh
unrelated to RuvABC. Potential role syndrome [48]
in DNA repair and/or recombination
ybgI NIF3 NIF3L1 PF01784 0327 1NMO In yeast, interacts with transcriptional Mitochondrial localization
coactivator NGG1p. Could be a
transcriptional regulator
ybeB - C7orf30 PF02410 0799 2ID1 Homologs of plant protein Iojap, Co-migrates with the 50S
required for normal function of subunit [77]; NAD-dependent
chloroplast ribosomes. In most nucleic acid AMP ligase,
bacteria, adjacent to the gene for releases NMN from NAD (V.
nicotinic acid mononucleotide de Crecy-Lagard, pers.
adenylyltransferase, suggesting a role commun).
in NAD metabolism and/or bacterial
cell division
yjeF YNL200c AIBP PF03853 0062 1JZT In many prokaryotes, fused to a sugar Mitochondrial localization,
NIH-PA Author Manuscript
Modified from Table 3 from Ref [39] with permission from Oxford University Press. Additional information on the listed gene products is
available from the respective online resources: for Pfam [29], in the http://pfam.sanger.ac.uk/family?PF00814 format; for COGs [33], in the
http://www.ncbi.nlm.nih.gov/COG/grace/wiew.cgi?COG0533 format; for Protein DataBank (PDB), in the
http://www.rcsb.org/pdb/cgi/explore.cgi?pdbId=2VWB format.