MICROBIOLOGY/GENETICS 607

MICROBIOLOGY/GENETICS 607

LECTURE NOTES
2011
copyright 2011, G. Roberts
TABLE OF CONTENTS
Preface
1 DNA, GENES, AND THE CODE

DNA structure
Nature of genes
Homology
The genetic code
Translation
Transcription
Homologous recombination
4
11
13
16
20
24
26
2 MUTAGENESIS IN VIVO
Inherent mutation frequency
Spontaneous mutations
Repair of all mutation types
Mutators and Anti-mutators
Chemical Mutagenesis
Effects of mutations
Numerology
Mutant detection
28
29
33
36
37
41
43
43
3 REGULATION
Themes in regulation
Regulation at the level of DNA, RNA, translation and protein
Fusions
Arrays
47
52
60
62
4 GENETIC ENGINEERING
Generation of mutations in vitro
In vivo analysis
63
66
68
5 DELETIONS
71
6 DUPLICATIONS AND INVERSIONS

Tandem duplications
Non-tandem duplications
Amplifications
Inversions
73
76
77
78
7 MOBILE GENETIC ELEMENTS

General classes of elements
Distribution and evolution of ISs
Effects of IS/Tn
Uses of Tns
79
79
84
85
86
8 SELECTIONS, SCREENS AND ENRICHMENTS

Selections
Screens
Enrichments
87
90
92
9 PLASMIDS AND CHROMOSOMES

Types of replicons
Replication functions
Replication in yeast
Partitioning in prokaryotes
Conjugation
93
93
96
98
99
101
10 TRANSFORMATION
104
11 VIRUSES AND OTHER INFECTIOUS ELEMENTS

Infectious elements
Molecular biology of phage
Uses of phage
RNAi and CRISPR
106
107
111
113
12 COMPLEMENTATION
The normal case
Interspecific complementation
114
117
13 GENETIC MAPPING IN PROKARYOTES

Objective
Two-factor crosses
Deletion mapping
118
120
122
14 YEAST GENETICS
Yeast forms/cell cycle
Mating types
Analysis of chromosomal segregation
123
124
125
15 SUPPRESSION
Introduction
Informational suppressors
Non-informational suppressors
127
128
129
16 EVOLUTION
131
17 SCIENCE AND SOCIETY
141
INDEX
144
"Science is a way to teach how something gets to be known, what is not known, to what extent
things are known (for nothing is known absolutely), how to handle doubt and uncertainty, what
the rules of evidence are, how to think about things so that judgments can be made, how to
distinguish truth from fraud, and from show." Richard Feynman (as quoted in Genius, James
Gleick, p. 285)
ORGANIZATION OF THE TEXT

A number of abbreviations will be used throughout the text, mostly with reference to journals. These
will include: ASM, with reference to the ASM books on "Escherichia coli and Salmonella typhimurium" by
Neidhardt et al. 1986 edition; ASM2 refers to the 1996 edition of those books (where a chapter is referred
to, the first page of that chapter will be given [eg ASM2,1234], but sometimes a specific page will be
referred to [eg ASM2,p.1236]. Other journals and series are as follows: AAC, Antimicrobial Agents and
Chemotherapy; ARB, Annual Review of Biochemistry; ARBBS, Annual Review of Biophysics and
Biomolecular Structure; ARG, Annual Review of Genetics; ARM, Annual Review of Microbiology; Bioc,
Biochemistry; CBC, Chembiochem; Cell, Cell; EMBOJ, European Molecular Biology Organization J.;
Gene, Gene; G&D, Genes and Development; Genet, Genetics; FEBS, Federation of European Biological
Societies; FEMS, Federation of European Microbiology Societies; JBact, Journal of Bacteriology; JBC, J.
of Biological Chemistry; JGM, J of General Microbiology; JMB, J. of Molecular Biology; MGG, MicroRev,
Microbiological Reviews; Molecular and General Genetics; MMBR, Microbiology and Molecular Biology
Reviews; MolMicro, Molecular Microbiology; NAR, Nucleic Acids Research; Nat, Nature (London); PNAS,
Proceedings of the National Academy of Science (USA); Sci, Science; TICB, Trends in Cell Biology; TIG,
Trends in Genetics; TIBS, Trends in Biochemical Sciences.
Other common abbreviations will be: Ec, Escherichia coli, Bs, Bacillus subtilis; St, Salmonella enterica
serovar Typhimurium; LT, lecture topic); Sc, Saccharomyces cerevisiae.
PREFACE
Why do genetics? For the purposes of this text, genetics will be defined as the generation, identification
and analysis of mutants. What sort of information do you get through this approach? The discipline of
biochemistry allows the precise analysis of biological phenomena, but it is typically limited to analyses in
vitro. Genetic analyses are less precise and direct, but they provide an understanding of the system in
vivo. The following is not intended to be an all-inclusive list of uses of genetics, but rather to provide some
idea of the range of possibilities:
(i) The generation and characterization of mutants provide insight into the number of genes involved,
their relative location, and their transcriptional organization.
(ii) Biochemical characterization of mutants that have been genetically characterized and assigned to
a single gene provides an indication of the various roles that gene product plays in vivo. The correlation of
a gene and its protein product confirms the identity of the actual protein performing a given function in
vivo. For example, purifying a protein capable of donating electrons to enzyme X in vitro does not prove
that it is the in vivo electron donor. The isolation of a mutant that fails to reduce enzyme X in vivo and the
subsequent identification of a protein lacking or altered in that strain would provide a strong argument that
the missing protein was the in vivo donor. Analysis of metabolites accumulated in mutants provides an
indication of metabolic pathways and correlates the mutation (and therefore the affected gene) with a
biochemical step. Similarly, analysis of which externally supplied pathway intermediates bypass the
genetic block confirms pathway characterization and gene-function assignments.
A nice historical treatment of the beginnings of biochemical genetics, which essentially began with
1936 Beadle and Ephrussi publication of the transplantation of imaginal discs in Drosophila
(Genet21:225[36]), appeared in Genet143:1[96].
(iii) The biochemical characterization of a number of altered variants of a gene product provides insight
into the relationship of the structure to the function of the protein. This is particularly true for proteins
whose structure has been determined by X-ray crystallography: the sequence position of the mutation
within the gene allows the correlation of the resulting biochemical defect with a position on the 3-D protein
structure.
(iv) There are obvious biotechnological values in the generation of mutants with desirable properties
like overproduction of valuable metabolites.
The specific power of microbial genetics derives from our ability to analyze very large numbers of
events (because bacteria are very small) and of performing selections on large populations.
The virtues of studying microbes are twofold. First, they perform a large number of functions that are
important in and of themselves. They are both destructive pathogens as well as necessary symbionts;
they produce a large number of useful compounds (like antibiotics); and they perform a substantial
amount of the metabolism that creates the environment in which we all live (e.g. the production and
degradation of all the atmospheric gases). The second reason for the interest has only become obvious
from sequence analyses. At the level of protein structure and function, there are substantial similarities
between bacteria and lower eukaryotes and the more complex eukaryotes. The technical advantages of
microbes, which include powerful methods of genetic analysis, as well as rapid growth rate and structural
simplicity, make them the model systems of choice for addressing these most fundamental issues of
biology.
Lower eukaryotes, such as the standard lab yeast Saccharomyces cerevisiae, are very distinct from
prokaryotes in terms of cellular organization and many of the details of cellular biology. Merely the
presence of nuclei is sufficient to fundamentally change a variety of important molecular aspects of cells,
for example. However, this text will make fairly frequent reference to this yeast (and by implication to other
lower eukaryotes) because the technical features of their genetic analysis makes them much more like
bacteria than they are like plants or animals. Their small size, unicellular nature, rapid growth rates and,
most especially, our ability to readily make specific changes in their genomes and analyze the effects, are
the features that make prokaryotes and lower eukaryotes worth discussing at the same time.
But you might well ask, hasnt super-cheap sequencing made all this irrelevant? I think the answer is
that most of the methods we used to identify mutations has now in rendered obsolete by sequencing, so
this text spends little time on these, though there were effectively the heart of microbial genetics only 10
years ago. But the nature of spontaneous mutations is still relevant, as is the consideration of the
biochemical effects of mutations, and these topics occupy about half this text. Where outdated methods
are described, it is because there remains a biological lesson to be gained.
Definitions and discussion of traditional genetics. Since a critical issue in any field is an
understanding of both the denotations and connotations of the terms used, it is appropriate to start off with
a number of definitions (and some controversy). Traditionally, a mutant is a strain that has an altered
growth property (termed a mutant phenotype) relative to the phenotype of an arbitrarily chosen
benchmark strain (termed the wild type). (And remember that every strain has a phenotype and avoid
saying "we didn't see a phenotype.") In this description, a mutation is the change in the DNA sequence
(the genotype) that causes that altered phenotype. I would like to propose instead the following definition:
a mutation is a change in the sequence of DNA from what is found in the wild type irrespective of the
resulting phenotype. A strain carrying such a change is termed a mutant.
The two sets of definitions may seem identical at first blush, but they are different in an important way.
In the first version, the definitions hinge on an altered behavior (phenotype). This makes sense since, until
the time when sequencing became technically easy, that is all that could be readily examined. The
second set of definitions suggests that the genotype is the crux of the matter whether or not an altered
phenotype is obvious. These latter definitions are becoming increasingly relevant because we now often
examine the genotype directly, through sequence analysis.
The latter definitions stress two points: (i) Just because a strain has a wild-type phenotype does not
mean that it has a wild-type genotype. LT15 discusses suppressor strains in which one mutation
compensates for another to give a pseudo-wild-type phenotype, yet by a decidedly mutant genotype; and
(ii) these definitions emphasize the arbitrariness of a wild-type phenotype (remember that the wild type
was simply a randomly chosen isolate). When one says a strain behaves like the wild type you are strictly
saying "under the arbitrary conditions of the analyses employed, the strain in question is not
distinguishable from wild-type." If a strain truly has a non-wild-type genotype, there may well be conditions
(growth, media, temperature, etc.) where it will display a non-wild-type phenotype even if the scientist has
not yet found that condition.
I want to make two more points of clarification on these definitions. The first is to reinforce the point
that the definition of the wild type (organism), and therefore of the wild-type phenotype, is arbitrary; it
simply refers to the isolate that happened to be chosen. Therefore the wild type does not necessarily have
"better" or more robust growth properties than other strains, indeed, it might have specific defects
because of random mutations, compared with other possible isolates, and therefore with the genetic
potential of the organism. The second point is a semantic one, but refers to an error that we all make once
in a while, when we say "the mutant doesn't have a phenotype." Obviously, EVERY strain has a
phenotype, since a phenotype is whatever behavior it displays. What we mean, of course, is that the
mutant doesn't display a mutant phenotype, which is another way of saying that it looks like wild type, at
least for this condition.
When you analyze the phenotype of a particular mutant, you learn the phenotype caused by the
mutation in that strain. However, when you analyze the phenotypes of a number of mutants affected in the
same gene, you begin to learn the range of functions that gene product is involved in. More importantly,
the analysis tells you which mutants are typical (perhaps representative of a complete loss of gene
product function, for example) and which are atypical (which may mean interesting or may mean weird)
for subsequent biochemical analysis. This last point brings another prejudice that will be noted throughout
the text, namely, that the best use of genetics is in conjunction with biochemical analyses of the various
mutant strains.
A change in genotype refers to any known alterations in the DNA sequence from that found in wild
type. The genotype is not easily interpreted to predict growth behavior of the organism but rather is an
indication of what gene or genes might be altered in it relative to wild type. For our purposes, we will
define a gene as a region of DNA that encodes a product (either RNA or protein).
You should also become aware of the proper usage of genetic terminology and this will be sprinkled
throughout the text. Different organisms or groups of organisms have - unfortunately - been given different
standard terminology. Happily, this is standard for all prokaryotes, but rather different from that of yeast,
which is different again from other eukaryotes.
For prokaryotes, if a strain is altered in a gene whose product is itself involved in, histidine
biosynthesis, then the first mutation isolated in such set of genes would be called his-1 (note italicization,
though underlining can be used instead). If that mutation subsequently is determined to be in a gene
called hisA, then the mutation would now be called hisA1. Different versions of a gene are called alleles of
one another and the "1" in hisA1 is the allele number, which is used to name that particular mutant version
of hisA. There should not be another mutation termed hisB1 for that organism, since that his allele
number has already been used. The phenotype of an organism is noted by a different nomenclature. For
example, a strain requiring histidine for optimal growth (presumably like the strain containing the hisA1
+
mutation) is termed His , while one not requiring histidine is called His . The take-home lesson is that the
genotype is noted by three small letters followed by either a capital letter, if a gene has been designated,
or by a dash if it is not, followed again by an allele number. All of these symbols are italicized. A
phenotype, or altered growth behavior of the cell, is designated by a capital letter followed by two small
letters with either a plus or minus superscript and such a designation is not italicized. When writing the
genotype of a strain, only mutant loci are named (all others are assumed to be wild-type). This is in
contrast with the case of plasmids, where only replication functions can be assumed and any other
encoded genes should be listed. Typically plasmids are assumed to have only those genes that are
noted. Finally, the convention, often abused, is that a superscript "+" refers to the wild-type genotype. The
particular rules for allele designations are laid out in Demerec et al. (Genet54:64[66]) and are largely
followed by journals. The rules for naming plasmids (BactRev40:168[76]), transposons, and "temperaturesensitive" mutations are available (cf http://jb.asm.org/misc/jbitoa.pdf, pp.16-19). Gene products are often
referred to as a non-italicized gene name with the first letter capitalized, such as HisD or TrpC, but a few
journals use HISD and TRPC.
For yeast, such as Saccharomyces cerevisiae (Sc), genes are given a name of three italicized letters
and an italicized Arabic number: for example ade5 or cdc28. Unfortunately these look like prokaryotic
alleles numbers, but they refer to the wild-type genes. More unfortunately, genes first identified by a
dominant mutation (or because the wild-type was first identified through cloning) are the same but all
capitalized: Cup1, for example. Allele numbers are reminiscent of the bacterial case, ade5-12 for
example, but there is still the goofy difference in naming of dominant alleles. A dominant mutant allele of
ade5 might be termed ADE5-27, for example (how a mutant allele can be dominant to the wild type is
discussed in LT12). The obviously problem here is that alleles can be dominant for some phenotypes and
recessive for others, so the usage is unnecessarily complicated. There is special terminology for the
genes involved in mating that will not be discussed here. Phenotypes are described as in prokaryotes and
gene products are sometimes given as a non-italicized gene name with the first letter capitalized, followed
by a p: e.g. Ade5p.
Mutants can either be tight or leaky. A tight mutant displays its non-wild-type phenotype distinctly and
clearly while a leaky mutant displays a phenotype that is not very distinct from that of wild type. Leakiness
can be due to a range of things, as discussed later. Mutants can also be stable or unstable and this is an
indication of the frequency with which they revert to an apparently wild-type phenotype (reversion is the
return of a mutant to a phenotype rather like that of the wild type).
A conditional mutant is one that it is known to display its mutant phenotype only under certain
conditions. A particular condition where the mutant phenotype is evident is termed non-permissive, while
the wild-type phenotype is observed when conditions are permissive. The typical examples are either
cold-sensitive or temperature-sensitive mutants. These are mutants that display their mutant phenotype
only at low or high temperatures, respectively. A strain that can derive all carbon requirements from the
principal carbon source is termed a prototroph. If one or more other organic growth factors (like amino
acids, nucleotides, or vitamins) are required by the strain, it is termed an auxotroph. Occasionally
reference will be made to conditionally lethal mutations, where the strain carrying such a mutation dies
under the non-permissive conditions regardless of the growth medium. Such mutations are typically found
in genes whose products perform DNA replication, RNA transcription, protein synthesis or other functions
essential on any media. For comparison, while a temperature-sensitive his mutation does not grow at the
non-permissive temperature without histidine, it also does not die rapidly under such conditions.
Another term that one occasionally comes across is that of a synthetic phenotype. This refers to a
phenotype created by the combination of a pair of mutations that is not present in strains with either single
mutation. A trivial example might be two homologous genes that both encode proteins capable of doing
an essential function: In mutants defective in either gene, the product of the other might suffice and a wildtype phenotype would be seen, but a strain lacking both genes would be dead.
A final point on the arbitrariness of phenotypes: when a strain is referred to as His , you may assume
it requires the addition of histidine to the media for good growth. You should not assume that it fails to
make any of its own histidine, simply that it does not make enough for optimal growth under the conditions
tested. Similarly, a strain might make rather less histidine than a wild-type strain, but if it is capable of
+
making sufficient histidine so that it grows normally without supplementation, it is designated His . Again,
+
the assignment of either a His or His phenotype is purely arbitrary according to the actual conditions of
growth used in the experiment.
Most of the above text is relevant to all organisms and not merely prokaryotes (though the phrase
merely prokaryotes is pretty goofy from an evolutionary standpoint). However, the nomenclature rules
for some prokaryotes and virtually all eukaryotes are somewhat different. So too, some organisms (like
humans) are necessarily auxotrophic for some number of compounds, so dont assume that every wildtype organism has similar properties to the E. coli paradigm.
There is another interesting quirk about phenotypes in prokaryotes and eukaryotes that is rarely
mentioned, but is important. As emphasized above, mutant phenotypes are only those that we see and it
happens that we see things differently in simple and complex organisms. For microbes, there are not a lot
of readily observable traits. We primarily look at growth rate and the appearance of colonies, but of course
one pile of cells on a petri plate looks a lot like another, except for the size and occasional pigment
differences. Multi-cellular eukaryotes are very different in that they have a slew of readily observable
properties and many of those properties coincidentally can serve as indicators of rather subtle metabolic
properties. Multi-cellular eukaryotes also have complex developmental processes in which very subtle
problems show up as gross morphological changes or even death. For example, very slight changes in
purine biosynthesis affects a variety of features in Drosophila, so such mutants are readily detected,
though such subtle perturbations would never be seen in bacteria because the affect on growth, the
readily observable phenotype, is too small. Similarly, there are many types of mutations that will be lethal,
perhaps for developmental reasons, in multi-cellular eukaryotes that would hardly perturb the behavior of
bacteria. As a consequence, the detectability of mutations is often profoundly different in different
organisms.
An important suggestion. To the extent that the experience of previous students is instructive, it
appears to be important that you develop mental pictures of the biochemical processes that we are
analyzing genetically. For example, the frequency of occurrence of temperature-sensitive mutations,
relative to that of a loss-of-function mutations, will seem arbitrary until you picture a protein and imagine
the sort of alterations necessary to make that gene product fail at elevated temperature while it continues
to function at lower temperatures. Such an effect on the protein clearly implies an alteration of the gene
product, not its destruction. It should also seem reasonable that there are not too many different
alterations (mutations resulting in amino acid changes) that will have this property and therefore mutants
with such phenotypes should be rather rare compared to alterations that destroy activity at all
temperatures, since there will be many more alterations that will have that result. In a similar vein, it
should seem reasonable that restoration of normal function to the affected gene products should result
from similarly rare, very particular changes in the amino acid sequence (resulting from changes in the
genotype). Without such mental pictures of the biochemistry underlying the phenotypes, genetic analysis
will seem to be an arbitrary and formal game. With such images, the genetic result will make sense
because it is consistent with a mechanistic picture of cellular processes.
When thinking about genetic results, it is useful to consider three levels: (i) The genotype, which
directly affects the production and function of gene products; the functionality of these gene products can
be viewed as a (ii) "biochemical phenotype" which then determines the (iii) growth phenotype of the
organism. Considering the middle level should make it easier to see the connection between the genotype
and the growth behavior of the bacterium.
Lecture Topic 1........DNA, GENES, AND THE CODE

DNA structure in bacteria. (By structure, I refer to the relative organization of the two DNA strands
with respect to each other the so-called secondary, tertiary and quaternary structures rather than to
the nature of bases and base pairs.) There are several reasons for any molecular biologist to be
interested in the issue of DNA structure: (i) DNA is relatively featureless (or thought to be) beyond primary
structure, so it makes sense that organisms would utilize any deviation from this "plainness" to identify
particular regions easily. The structures below have different degrees of experimental support for their in
vivo existence, but it is very likely that whenever formed, they will be used as informative features. (ii) The
higher order organization of DNA is a critical issue for the replication and transcription of any DNA
molecules in the cell. (iii) Supercoiling, which is loosely the tension in the double helix of a given DNA
relative to that of a completely relaxed relative state, influences and is influenced by other activities such
as transcription and replication. This means that supercoiling can be the medium for all sorts of
biologically important effects to occur, many of them quite indirect (i.e. did the repressor cause the
phenotype directly by its binding to the DNA or indirectly by reducing transcription, thereby affecting
supercoiling, which was actually the cause of the observed effect?).
Supercoiling. Bacterial chromatin is negatively supercoiled in vivo, which means that the double helix
has a slight tension that loosens or unwinds it. It makes sense that this is the typical state because that
slight loosening is what allows RNA polymerase to open the DNA strands as a precursor to transcription
initiation. This action, termed the formation of the open complex, involves the unwinding of approximately
13 bp. If supercoiling were positive, this strand separation would be energetically very difficult. This
biologically typical state of being a bit under-wound is termed "negative supercoiling. Indeed, many
publications use the term supercoiling to mean negative supercoiling, but this is careless because DNA
can certainly be positively supercoiled both in vivo and in vitro.
Although it is true that bacterial DNA tends to be negatively
supercoiled, this does not mean that any replicon is homogeneous in this
respect. In fact, supercoiling is a local phenomenon, with some regions
being negatively supercoiled, while other regions being positively
supercoiled, even if these two regions are only separated by a few
hundred bases. This localization is important because the effects are
necessarily local. RNA polymerase is better able to separate the DNA
strands of a given promoter if the immediate region is negatively
supercoiled - it really does not matter to RNAP at this promoter if another
region is positively supercoiled or not. Such localization of supercoiling
exists in part because it can be created or removed in a localized way (see
below), but also because there can be barriers to the diffusion of
supercoiling. If there were no barriers, then the tension in adjacent
negative and positive supercoiled regions would diffuse together and
rapidly cancel each other out. These barriers to diffusion can be all sorts of
things, even severe bends in DNA, but the simplest to understand might be
Figure 1-1. RNA polymerase is
the following. Consider a region in a replicon that encodes a protein that is
depicted as the dark oval on the
exported from the cell during translation. The replicon would certainly be
lower left and there is a block to
attached to an RNAP transcribing this gene, which in turn is attached to the
supercoiling dissipation shown
mRNA, which is attached to the translating ribosomes, which is producing
at the top. (PNAS 99:9139[02])
the nascent peptide, which is bound to the membrane as it is being
exported. Because of this chain of links, this region of DNA simply cannot
be allowed to spin as would be necessary for simple diffusion of supercoiling through that site.
If supercoiling is so local, how can you detect the supercoiling state of a given region? Not very well
is the answer, because there is no easy way of looking precisely at a specific region within a living cell. So
instead, the following assumptions have been made: (i) If one cannot determine the supercoiling state of a
region absolutely, then perhaps one can determine it relatively that is, is it more or less supercoiled in
this mutant than in the WT? (ii) If a mutant is altered in its ability to create or dissipate supercoiling, then
that alteration should be seen in the supercoiling state of any replicon in the cell. Thus the procedure has
been to isolate a given plasmid from two different strains and ask in vitro if they differ in their supercoiling.
If there are differences then one concludes that the different proteins in each strain have differentially
affected the creation or dissipation of supercoiling. Obviously one is not analyzing local supercoiling here,
but some approximation of the overall supercoiling of the replicons, when all proteins and diffusion
barriers have been removed. One then makes the final assumption that there should be some correlation
between the measured general effect and the unmeasured state of any given small region.
Now there could be pages of digression here, but that would not be helpful for the level of
understanding we are after. However, a few things need to be clarified. It might bother you to read that the
analysis is done in vitro, since this must change supercoiling. Indeed it does, as does the presence or
proteins and the specific salt concentration of the solution. But since these replicons are covalently closed
circles, then the relative tension cannot disappear, but it does manifest itself in a different way. What one
actually is detecting are supertwists in the replicon (Ill try to demonstrate in class), which are an indirect
measure of the average supercoiling that was in the replicon before its removal from the cell. These
supertwists are more precisely described by a term called the linking number, but that rapidly becomes
complicated as well. For a fairly lucid description of this, as well as the enzymes involved, see
NAR2009:1[09]. This article also describes the roles of supercoiling and related enzymes in allowing the
separation of daughter replicons immediately after replication.
I should cite one more curiosity concerning the generality of supercoiling issues. The NAR review cited
immediately above makes the very surprising claim in its abstract that in extreme thermophiles, DNA is
positively supercoiled which protects it from thermal denaturation (NAR2009:1[09]). Part of the surprise is
that this is not discussed anywhere else in the article (!?!?). Another part is that, it is my sense that the
o
DNA duplex is quite stable above 100 C at physiological salt conditions, so I am unsure that stabilization
by positive supercoiling is necessary. Finally, it then becomes unclear how RNAP ac ever get started
because there would presumably be a substantial energetic barrier to overcome in trying to open
positively supercoiled DNA. I think that it is true, however, that almost all examined thermophiles have a
novel reverse gyrase that has the net effect of introducing positive supercoils
(BiochemSocTYrans37:69[09]), but I am unsure of what to conclude from that.
Causes of supercoiling. If DNA were simply a DNA duplex floating in the cytoplasm, there would be no
such thing as supercoiling, because, as alluded to above, any supercoiling tension would simply dissipate
into the surrounding regions. This dissipation often does happen, but the fact that supercoiling can exist
locally implies that such dissipation sometimes fails to occur. How can this be? First we need to start with
the mechanism by which supercoiling is induced, however transiently.
Almost certainly the most important source of supercoiling is transcription, so lets consider the
case of an RNA polymerase (RNAP) transcribing a region of DNA. Obviously if the RNAP tracked along
the outside of the DNA duplex, essentially rotating around the helix, there would be no strain on the DNA
duplex, since the DNA itself would not move. But this simply cannot be the case for the following reason.
This would require not only RNAP to circle the DNA helix (and because transcription is at roughly 50
nt/sec, this means five rotations of the DNA per second), but it needs to trail the newly synthesized mRNA
and, most importantly, any ribosomes that are already translating that newly synthesized mRNA. This is
because prokaryotes, lacking a nucleus, typically begin translation of an mRNA as soon as the
appropriate region of the mRNA emerges from the transcription complex. It should be obvious that a
string of ribosomes bound to an mRNA cannot be hauled around the DNA five times per second.
Sometimes it is worse than that: some proteins that are transported out of the cytoplasm are inserted into
the membranes as they are being translated, which means that the mRNA itself is fixed to the membrane
and cannot rotate with respect to the DNA at all.
If RNAP cannot move around the DNA, yet it must move with respect to the DNA (again, since the
DNA is itself a helix), then the DNA must move around the RNA polymerase. In other words, as the DNA
feeds past the transcription complex, there must be continual loosening and tightening of the helix on
either side. The net effect of this is that there is an accumulation of positive supercoiling 5' (in front of) of a
transcription complex and negative supercoiling 3' (behind). Roughly the same thing is true for a
replication complex: in some ways its worse because replication is even faster at about 1,000 nt/sec, but
it is also true that replication through a given region is much less common than is transcription, so the
latter has a greater total effect. On all of the above, see mini-rev Cell56:521[89] & Nat337:206[89] and
ASM2,764[96]).
So this is how supercoiling is locally created, but why doesnt it simply dissipate (Fig. 1-1)? We
will consider three examples. First, consider a region of DNA where there are two regions that are
transcribed toward each other (two convergently transcribed operons with their termination regions near
to each other). In each case, the act of transcription creates positive supercoils in front of the
transcription, but the region of the termination sites is actually in front of both sets of polymerases. Thus
positive supercoiling here simply cannot dissipate in each direction. Second, consider the case mentioned
above where a protein is being cotranslationally inserted into the membrane. Transcription of that region
simple cannot allow any type of supercoiling tension to diffuse through that region because the DNA is
fixed to the membrane through the RNAP-mRNA-ribosome (JBact171:2181[89]). Finally, imagine simply
that there is a protein that causes a severe bend in the DNA. This bend only affects a couple of base pairs
directly, but imagine what it looks like from a short distance (a couple hundred nM). Rather than the
normal fairly straight rigid DNA duplex, which certainly would be free to rotate along its axis if there were
not transcription and membrane association, there is a V-shaped section of DNA. If that region was to
rotate to dissipate tension, it is not a simple rotation about the axis but a dramatic movement of the V
though the surrounding cytoplasm and other DNA. This is much harder and such sites have been shown
to serve as blocks to supercoiling dissipation. Note that such blockages need not be total. That is, as long
as the rate of dissipation is slower than the rate of supercoiling creation, there will be a net accumulation
of supercoiling.
So how does the cell deal with these challenges? The answer is with the use of topoisomerases
that directly create or dissipate supercoiling. For supercoiling, the most important class are termed type II
DNA topoisomerases. These make a double-stranded cut in DNA, then pass another ds DNA region
through that cut and reseal it. The known examples are gyrase and Topo IV. Though it is not obvious from
this description, there is directionality to this process such that they introduce negative supercoils into the
target DNA in an ATP-dependent manner. A major role of gyrase is to remove the positive supercoils
ahead of RNAP.
Bacterial cells typically also have two type I topoisomerases, termed Topo I and III, though only
the former seems very important for supercoiling. Topo I removes negative supercoils and indeed is
activated by high negative supercoiling of a region. It cuts only a single strand of the duplex and allows
the two strands to spin relative to each other, which dissipates the negative supercoiling.
Both gyrase mutants (gyrA) and Topo I mutants (topA) are extremely sick but mutations in either
are suppressed by compensatory mutations in the other. This might seem paradoxical - how does the cell
improve its lot by eliminating another important enzymatic activity? The answer seems to be that this
suppressor mutation restores a bit of balance to the net supercoiling, though such cells have growth
problems in comparison to the wild type. Because the major role of gyrase is removing the positive
supercoils ahead of transcription complexes, mutations or conditions that slow transcription provide some
r
relief for the need for gyrase. For example, either a Rif mutation (which causes a slower RNAP) or uracil
deficiency (ditto) suppresses the growth defects of gyr. This is explained by the fact that accumulation of
supercoiling is the problem and that this depends on the relative rates of creation and removal. If a cell is
slower to create supercoiling because of slower transcription, than the level of natural dissipation can do a
better job of keeping up.
Effects of supercoiling. The most critical effect of high levels of supercoiling is to interfere with
transcription. However, there are a number of other effects as well. For the general effects of supercoiling
on global gene expression in E. coli, see ARG36:175[02]. Supercoiling definitely potentiates the existence
of most of the unusual DNA structures described in the following sections, with the possible exception of
bent DNA (Cell54:403[88], Sci240:300[88], PNAS87:8373[90]). It is quite possible that supercoiling also
aids the compaction of DNA into nucleosomes and this could have an indirect role in recombination (see
JBC265:6441[90] for effects on site-specific recombination) and gene regulation. Lastly, supercoiling
changes the physical spacing between nucleotides, with potential effects on protein:DNA interactions
(JMB213:931[90]). This is because protein:DNA interactions involve the precise interaction of specific
regions of protein with specific atoms on the DNA. Supercoiling necessarily changes the positions of
those DNA atoms in any given base pair with respect to those on either side, simply because
repositioning of nearest neighbor base pairs in the DNA is the only place for this tension to go. As a
consequence, a specific region of DNA has a different structure at the atomic level (as seen by proteins)
depending on its supercoiling.
It has been suggested that supercoiling is a mechanism to regulate anaerobic gene expression
specifically, because addition of gyrase inhibitors blocks transcription of at least some anaerobically
regulated genes (PNAS82:2077[85] and tons of similar reports). However, it has been argued that the
observed effect reflects the fact the studied genes are heavily expressed, and not that they are
anaerobically regulated (JBact171:4836[89]). The connection would be that high expression leads to lots
of supercoiling, which would need to be relieved by high gyrase activity.
It has also been shown that certain growth shifts (anaerobiosis, osmotic shock) can perturb
supercoiling. This might be related to a transient effect on the ATP/ADP ratio, thereby affecting gyrase
activity, but caution should be used in interpreting these results. For example, while it's true that osmotic
shock affects supercoiling, supercoiling can affect gene expression, and osmotic shock induces proU
expression, direct measurements have shown that the last is not an effect of supercoiling of the proU
region (JBact173:879[91]). Similarly heat shock causes a transient relaxation of negative supercoiling and
it occurs in an rpoH mutant, so it is not a result of heat shock proteins. Interestingly heat shock proteins
seem to be encoded by some of the few genes whose transcription is elevated under conditions of
reduced supercoiling (MGG238:1[93]).
Among the genes whose expression is affected by supercoiling is gyrA (which encodes gyrase),
apparently by a mechanism of on regulated anti-termination (PNAS86:8882[89]). This gives the
anomalous result that inhibitors of gyrase can actually lead to an eventual increase in supercoiling, by
inhibiting gyrase and therefore leading to overexpression of gyrase.
Summary of supercoiling. Supercoiling affects and is affected by almost everything that happens to DNA.
This is not to demean its importance, but to emphasize the difficulty of a reductionist approach. There is
almost nothing you can change without potentially affecting the supercoiling of the region of interest. One
can even imagine that moving a gene to a different replicon, or adding or deleting adjacent regions of
DNA, could affect gene expression indirectly through the effects of those changes on supercoiling, simply
because the regions that are now adjacent to the gene have different effects on supercoiling. A most
obvious, but critical, lesson is that correlation does not prove causation.
Left-handed or Z-DNA. The easiest way to picture
such DNA is to imagine that a number of contiguous
o
base pairs are rotated 180 with respect to the helical
axis (Fig. 1-2). The hydrogen bonding between the
bases is unaffected, but the bonds that attach the
bases to the sugars have been the sites of rotation.
The face of a given base that used to be oriented
toward one end of the helix is now oriented toward the
other. It so happens, however, that this single rotation
of a couple chemical bonds has profound
consequences for the structure of that DNA region.
Because of its effects on the DNA helix, Z-DNA
formation is stimulated by negative supercoiling (see
below).
Figure 1-2. Depiction of a section of Z-DNA
Z-DNA is potentiated a variety of factors that
(from Zubay Biochemistry (1984))
either differentially stabilize it or destabilize normal B
DNA: high salt; low MgCl2, CoCl2 or glycerol; negative
supercoiling. DNA sequences with 5-MeC residues potentiate it because the methyl groups are
hydrophobic and are in a more favorable hydrophobic pocket in Z-DNA, while they are in contact with H2O
in the major groove of B DNA.
One probes for Z-DNA by a variety of tools. These include nucleases that cut one or the other
DNA form, restriction enzymes/methylases that typically only cut B form DNA, antibodies to Z-DNA, and
certain chemical probes that happen to react differentially with the two DNA forms. Z-DNA has been
recognized in vivo by the ability of methylases and endonucleases to interact with B but not Z-DNA
(Sci238:773-777[87]), but such assays are technically difficult. Nevertheless it is certainly clear that some
population of Z-DNA must exist in cells. The impacts of Z-DNA in vivo are a bit less clear. Regions that
are Z-prone are also prone to deletions, but Z-DNA can bind recA product and it might to potentiate
recombination and deletion formation. It is unclear, however, if the Z-DNA caused deletions or is caused a
selection for deletions (JMB207:513[89]). Some of the effects assigned to Z-DNA may in fact reflect its
ability to potentiate hinged DNA and cruciforms.
Cruciforms. Cruciforms are stem-and-loop structures that can
form in single stranded nucleic acids or in the single strands of a
separated double-stand DNA duplex (Fig. 1-3). Cruciforms are
distinguished from simple palindromes in both structure and
function: cruciforms have a non-palindromic region in the center,
to form their loop, while palindromes do not have this central
region and therefore cannot loop out. The typical reason for
Figure 1-3. Cruciform (JBC
symmetry of in DNA sequences is because they serve as binding
263:1095 [88]).
sites for proteins. The simple palindromes recognized by
restriction/modification systems are typically symmetrical. More
complex palindromes, where short inverted sequences are separated by a few base pairs, are more
typically the binding sites of symmetric protein dimers. The beauty of this is that the cell gains the
specificity inherent in a binding site of n bases, but each protein subunit must only evolve to bind n/2
bases. Certainly there are other inverted repeats, separated by other DNA sequence, that have nothing to
do with protein binding, but can still form cruciforms.
In normal B-DNA, the looping out of DNA to form cruciforms is energetically unfavorable, because
it involves the loss of the stacking energy of the bases in the loops. However, cruciforms can be induced
by negative supercoiling, since a cruciform relieves a lot of supercoiling tension. Certain DNA
modifications also affect cruciforms: meA increases cruciform formation 4-fold, while meC decreases it 23-fold (JMB205:593[89]). Finally, they are affected by salt concentration (JBC263:7370[88]). Cruciforms
have been detected in vivo by induction of single-strand nucleases with subsequent in vitro analysis for
the presence of nicks (JBC262:11364[87]). Cruciforms might have a role in recombination (JMB202:3543[88]).
Bent DNA or curved DNA. (see the review on sequence-directed curvature of DNA in ARB59:755[90])
All DNA sequences have a natural curve because each type of adjacent base pair causes a very slight
distortion in the helix from a perfect symmetry around the axis. For most sequences, however, the
distribution of base pairs is such that these small factors average out and most regions are roughly
straight. However, regions of non-random base choice (such as dA-dT tracts) have a decided bend,
though the curvature of DNA is not static but a "net time averaged deflection of the helical axis from
linearity. Bends can also result from protein:DNA interaction. Several remarkable experiments have
shown that, in at least some cases, it is the existence of a large bend, and not the mechanism of its
formation, that is necessary for exerting a biological effect. For example, IHF (integration host factor)
bends the DNA for the function of the phage lambda intasome, but CRP (the cyclic AMP binding protein),
which is a completely unrelated transcriptional factor can replace IHF simply because it can also bend
DNA (Nat341:251[89]). Similarly, curved DNA can replace a CRP-binding site for the gal promoter. These
results indicate clear roles for bending in transcription and recombination, largely through allowing the
formation of the proper multi-protein complexes. Bent DNA has also been implicated in recognition of the
origin of transfer of conjugative plasmids (JBC265:10637[90]).
In at least the case of the nucleoid protein H-NS, bent DNA can affect transcriptional control by
binding to a naturally curved promoter region (indeed, H-NS favors curved DNA binding in vitro) and
apparently scaffolds this DNA to the appropriate structure for regulation (Cell71:255[92]). The nature of
the bacterial nucleoid is addressed in ASM2,158 & 1662[96]).
Bent DNA is detected by its retardation of DNA fragments on gels, and the degree of retardation
may reflect the activation energy necessary to straighten out the DNA for passage through the gel pores
(Bioc29:9269[90]).
Related issues in bacterial DNA structure.
Hinged or triply stranded DNA: This is a non-B, right-handed helix, with an anti-parallel third strand, where
the last term refers to two strands of similar sequence in the opposite orientation (ARB64:65[95]). This
structure is potentiated by stretches of purines on one strand (e.g. GGA8, AG12, GGAA6, G19, GAA8,
GAAA6, but not A20) and negative supercoiling. A certain percent of G's is necessary. Hoogsteen "triple"
base pairs seem to be involved (Hoogsteen pairs are base:base H-bonding interactions at positions
completely different from those described by Watson and Crick and clearly have biological significance in
some situations.) One finds appropriate sequences in eukaryotes at greater than random frequencies,
often 5' to regions involved in recombination (Sci241:1791,1800[88]). The presence or absence of such
structures is affected by length, sequence, pH, metal ions and polycations like spermine
(JBC265:10652[90]). A different form of triple-stranded DNA, involving interactions of parallel doublestranded and single-stranded DNA molecules, is probably the initial complex in homologous
recombination. It is termed paranemic (not truly interwound) JBC265:16898[90]).
DNA bulges/single-base loops. The addition or deletion of one base on one strand of a DNA duplex
o
relative to its complementary strand imparts a dramatic a 21 bend (Bioc28:4512[89]) in DNA. Such
situations arise through errors in replication. These effects are more dramatic than multiple mismatches
(NAR17:6821[89]) and are recognized and repaired by the mismatch repair system.
Funny base pairs. There are actually A:A and C:C base pairs in hairpins at telomeres (Nat339:634[89]). It
is also apparent that a whole range of non-Watson-Crick bp (such as the Hoogsteen base pairs
mentioned above) may be of relevance to both odd DNA structures and to non-standard bps. It's been
argued that both syn and anti bases (different base pair geometries) may be involved in A:G bps, with
possible implications for repair systems. Some A:G pairs are known to be important in certain tRNA and
rRNA structures and these are often dependent on the nearest-neighbor context (see NAR23:306[95]).
It is also unclear how important thymine is to DNA since the Bs phage PBS2 contains uracil
instead. An appropriately mutated Ec strain has also been shown to complete an entire round of
replication using uracil instead of thymine, but then dies for unknown reasons (JBact174:4450[92]).
Repetitive DNA. It has certainly been known for some time that eukaryotic genomes contain massive
amounts of highly repetitive DNA, some of which seems to serve a structural function at centromeres. In
bacteria, there have been a number of repeated elements noted over the years and given a variety of
names. Generally speaking these elements are found in transcription units, but not in obvious coding
regions. They were thought to perhaps play roles in mRNA processing or in supercoiling of large domains
in the chromosome. (MolMicro.12:61[94], ASM2,2012[96], JBact172:2755[90], NAR20:3479[92]).
In the past few years, it has become clear that at least some of these elements have a completely
different role in microbial physiology they are part of a system that is effectively a primitive immune
system, albeit it one based on DNA sequence rather than molecular structure. The system is so involved
that it is difficult to even summarize, but it involves protein factors that create small DNA fragments of
invasive DNA, viruses or plasmids, which are then stored between repeated sequences. This library of
sequences is transcribed and used by other proteins to target incoming DNAs containing the identical
sequences (Microbe4:224[09], PLoS ONE 4:e4169[09]).
RNA structure. There are clearly unusual structures, termed pseudo-knots, that can form in RNA (cf
JMB214:437[90]) and provide a biological function (Cell58:9[90]). For example, they have a role as
nucleation sites for T4 gene product 32 binding to its own mRNA (JMB201:517[88]) and are involved in
self-cleavage structures in an animal virus (Nat350:434[91]). Also, at least some of the translational
frameshifting sites occur immediately upstream of pseudoknots, suggesting that they slow translation
(Cell57:537[89]). G-rich regions can form 4-stranded helices that may be relevant in vivo
(Nat351:331[91]). Importantly, it remains difficult to predict RNA structures as they are often not based on
normal Watson-Crick base pairs (PNAS91:4160[94]). It seems likely, for example, that A:G pairs are
important for rRNA structure and that their conformation depends on flanking sequences (JMB242:1[94]).
Ancient DNA. PCR has allowed the amplification and analysis of very old DNA, but it has been difficult to
know for certain that the amplified DNA was not a modern contaminant. It has been shown that
racemization of amino acids occurs at approximately the same rate as DNA degradation and is therefore
an excellent internal standard (Science272:810&864[96]). It is highly unlikely that claims of amplified DNA
6
much more than 10 years old are correct, because base loss would be very high over that time-frame.
Chromatin structure. The analysis of the structure of the bacterial nucleoid has a long history, but has
largely been indirect, which is not surprising given the complexity and technical difficulty of the problem.
Simply put, nucleoids are too small to tackle in a macroscopic way, but too large for most molecular
analyses. The majority of our understanding has come from electron microscopy, but that has been
supplemented by analysis of supercoiling in vivo and in vitro. The current view of genomes like that of Ec
is that there is some sort of core organizing unit with approximately 100 looped domains. The nature of
the organizing core is completely unclear although it might conceivably involve REP sequences and
possibly the interaction with gyrase itself. This general model of the structure is supported by the
observation that it takes approximately 160 single-strand breaks (by X-ray) to sufficiently relax the
chromatin so that it no longer binds psoralen (ASM2:20[96]).
There is no doubt that the bacterial nucleoid is tightly packed and that histone-like proteins are
involved in this compaction, including HU, IHF, and H-NS. It further seems likely that transcription tends
take place on the surface (exposed loops) of the nucleoid, but strong support for this is lacking. The
general topics of nucleoid structure are reviewed in ASM2,158 & 1662[96].
Epigenetics. Epigenetics refers to situations in which there is a change in phenotype that is not
correlated with a change in the DNA sequence. Yet these phenotypes can be inherited by progeny in
what can have the approximate appearance of genotypic changes. There are at least several known
possibilities: (i) DNA modification. Imagine a scenario in which there is a palindromic DNA sequence that
can be methylated on both strands, but where the methylase has a wildly dramatically activity for a hemimethylated site than for an unmethylated one. Should that region ever become methylated, then it will
remain so, because the hemi-methylated versions produced during replication will be excellent substrates
for the methylase. Otherwise identical sequences in the genome would typically remain unmethylated. So
if this methylation had a phenotypic effect, by modulating affinity of a transcriptional factor, for example,
then there would be a heritable change with phenotypic consequences, but without a change in DNA
sequence. Obviously there are variations on this theme in which DNA modification (with phenotypic
consequences) is modulated by either effects on the accessibility of the DNA or effects on the activity of
the modification enzymes. (ii) Histones: Binding of histones (see below) and other proteins to DNA can
obviously affect DNA structure and gene expression. Again, the ability of histones to bind to some regions
might be modulated by the presence or absence of other proteins or by the post-translational modification
of the histones themselves. Depending on the situation, this might mimic the site-selectivity in the above
example or be more general. (iii) RNA processing: To the extent that mRNAs are differentially processed
or differentially translated, then modulation of the selectivity of those steps would be a non-genetic
process that affected phenotype. To the extent that these steps can be cytoplasmically inherited, they
would have the general appearance of genetic inheritance in some cases. (See the last section of LT11,
on CRISPR and RNAi). (iv) Prions: Prions are discussed in LT11 with the viruses, but in brief, they are
proteins that can propagate themselves by converting normal cellular proteins to the prion form. When
10
this form alters cell phenotype, it has the appearance of a genetic mutation.
I should emphasize that these epigenetic cases really do not look exactly like genetic alterations
in yeast, for example, prions do not segregate 2:2 like chromosomal genes. Rather, the cases have the
approximate appearance of genetic changes, depending on the analysis and how closely one examines
them. I assume that site-specific (or gene-specific) epigenetic phenomena are relatively rare, given the
challenges for a mechanism to provide that specificity. But more global effects on gene expression
through any of the above mechanisms seem like they are highly likely in many cells under some
conditions.
DNA structure in lower eukaryotes. The central difference in DNA structure between prokaryotes and
eukaryotes is the existence of a much more elaborate and defined structure to chromatin, which is termed
the nucleosome. This is a complex of eight histone proteins and about 200 bp of DNA that are wrapped to
give discrete structures. Prokaryotes have histone-like proteins associated with their DNA, but the
arrangement of the complex is much less precise, and therefore has fewer functional implications, than
does the nucleosome. In many cases, the positioning of the nucleosome relative to the coding regions
affects the expression of the gene. Nucleosomes are also subject to higher-order organization, which in
turn can be affected by local supercoiling, though it appears that supercoiling appears to be somewhat
less of a factor in gene function in eukaryotes, especially higher eukaryotes, though of course that might
simply mean that we haven not studied it.
The degree of accumulated supercoiling in any region will of course be a function of its creation
and dissipation. The creation of supercoiling by transcription might seem to be the same in prokaryotes
and eukaryotes, but I am not certain if even if this is correct. One might imagine, for example, that the
absence of coupled translation in eukaryotes means that the newly synthesized mRNA can swivel around
the DNA so that less supercoiling is even produced. In terms of dissipation, it is hard to know the impact
of dissociation and reassociation of histones before and after transcription, as well as their relative impact
on supercoiling dissipating into surrounding regions. Still, at least for those eukaryotes that have massive
amounts of junk DNA between the coding regions, this junk does provide a potential region to absorb high
levels of for supercoiling. Presumably this challenging topic will receive more attention in the future.
The nature of genes in prokaryotes. For our purposes, a gene is a region known to encode a product,
while an ORF is a region that appears to encode a protein product because it is a reasonably sized
stretch of sense codons. The absence of nonsense stop codons, which is another way of stating the
definition of an ORF, is less valuable an indicator in GC-rich organisms, however. This is because the
G/C richness of the third position of the codon and the A/T richness of the first two positions of the
nonsense codons means that +1 frameshifts rarely cause stops and -1 frameshifts never do. Codon
choice statistics, which test if the frequency of use of different codons in a region matches that of known
genes in the organism, are more reliable for identifying biologically meaningful coding regions. As you will
read below, there are things like introns and translational frameshifts that make it difficult to know with
absolute confidence where the coding region even is, based simply on sequence. It is even difficult to
know with confidence where translation starts, without some biochemical analysis of the product. As a
consequence, while sequence information provides vast amount of information for testable hypotheses, it
cannot support strong conclusions about function. Finally, remember that not all gene products are
proteins. There is a growing literature on the importance of large and small RNAs to the physiology of all
organisms (ARB74:199[05], Sci314:1601[06]).
I simply do not have the expertise to weigh in on the matter of the quality of gene identification
and of annotation in data banks, but it is safe to say that you should not assume that annotation by
anyone is necessarily correct. For example, there has been a claim that the average ORF in archaea is
shorter than that in bacteria (TIG16:107[00]), but a counterclaim argued that this is fictitious and results
from counting a number of ORFs that simply do not correlate with a likely gene product (TIG17:425[01]).
The latter paper posits that the ORF is likely to be spurious unless there is something similar to it in the
data base. Obviously that is not a error-free assumption, but probably not bad due to the number of
sequences currently available. Similarly, there have been several comparisons in the nature of annotation
of genes that suggest significant error. One compared the annotations of three independent groups for the
same genome, Mycoplasma genitalium, and showed that at 8% of those were in serious disagreement. A
statistical comparison suggested that as many as 30% of annotations might be incorrect in some
important way (TIG17:429[01]).
Naming of genes.
What should you think about in naming a newly found gene? Lets assume that you have
sequenced a region in your bug of interest (or, quite more likely, seen the region appear on the TIGR
11
website when they sequenced your bugs genome) and there is an ORF with sequence similarity to
something in the databank, say gene newX of E. coli. So, do you call your gene newX as well? The point
is that giving it the same name implies that it does the same biochemical function, so you ought to use
that only when there is a pretty fair chance that its correct. On the other hand, giving your gene a new
(i.e. previously unused) name implies that its different and, importantly, would not clue the reader to the
fact that it really IS similar to something else. Now if the two genes are 99% identical, then we can
probably agree to use the same name, but what if its 80% identical? How about 20% (which is about the
limit of what we can detect above noise)?
Now there are certainly cases of two proteins with only modest sequence similarity that appear to
have identical biochemical functions. However, there are also cases of regulatory proteins in two different
organisms that both perform central nitrogen regulation, but do it by interacting with different, but
overlapping, sets of proteins. Are these two homologs doing the same biochemical function?
The answer is that there is no simple answer. If you have some good biochemical data on both
gene products, then you can be on fairly firm ground, but getting that sort of information is orders-ofmagnitude slower than genome sequencing, so we rarely have that option. The TIGR web site has a bit of
text on this, but they sort of dodge the critical issues
(http://www.tigr.org/CMR2/db_assignmentextver2.shtml). As noted above, either an old name or a new
one carries a connotation that one typically does not want to imply, yet some name has to be given. It is a
dilemma, but the implication is that you cannot take the names of genes too seriously in terms of
assuming precise function.
Indeed, what does "identical biochemical functions" even mean? If two proteins perform the same
catalytic reaction, but with a different Km, are they identical? What if the reactions are similar. but they are
affected differentially by inhibitors?
A related question to assigning a gene name is when should a gene be renamed to reflect a more
biochemically concrete understanding of its role? Some cases are clearly justified such as referring to
tRNA suppressors as alleles of the tRNA gene (e.g. tyrT1 rather than supX). In other cases, repeated
name changes (glnF was changed to ntrA, when it became it was not glutamine-specific and then to rpoN
when it turned out to be a sigma factor) do provide a more appropriate name, but at the cost of
complicating the literature. I would argue that the point of the name is to keep the literature straight, not to
precisely describe the function of the product. Be sure to read the related section on Homology below.
Overlapping genes in the same reading frame. There are a number of cases of a pair of proteins sharing
a large part of their coding sequence, but where they use different translational start or stop sites. It
should be obvious that it is easier to imagine two different start sites than two different stop sites, since
the latter implies that ribosomes must translate through one of them some of the time. Some of these
slightly different gene products have been shown to antagonize the activity of their partner protein (Tn5
transposase, JBact170:3008[88]), and in other cases the larger product has an activity in addition to that
possessed by the shorter (sog gene product's in conjugation, EMBOJ5:3007[86]). More complicated is the
case in IS1, where one region, insA (encoding a DNA binding protein), is occasionally fused by an internal
frameshift to another reading frame, insB (encoding transposition functions), thus producing two different
and competing proteins (to modulate transposition). Properly there are two gene products: InsA and
InsA'B, rather than two overlapping genes. This seemed to be further complicated by a third gene product
whose coding region starts within insA (JMB240:52[94]), and indeed as many as eight ORFs, overlapping
all over, have been proposed to encode products. In the final analysis, however, only insA and insB seem
to make useful products (JBact178:2420[96]). Similarly, there are two start sites for the P22 lysis gene
(JBact172:204[90]). The role of regulatory mechanisms yielding such protein pairs is discussed in LT3.
Differential termination involves either a leaky stop codon (one where termination is inefficient and where
an amino acid is occasionally inserted) or the very odd case of novel amino acids inserted at very specific
codons that is discussed in LT3.
Overlapping genes in a different reading frame. When two genes overlap in different reading frames,
there is no shared function for the two gene products (not surprising since they share no amino acid
sequence). The best cases of this situation is with some of the single-stranded RNA phage (MS2, GA),
where they have a lysis function encoded in a region that also encodes a replicase. Other examples of
extensive overlap are rare: some par and kil functions in plasmid RK2 are apparently encoded by
overlapping genes (NAR14:4453[86]); the int and xis genes overlap (PNAS77:2482[80]) and a similar
case exists for a Myxococcus phage (JBact181:406 [99]). Many people have the sense that overlapping
genes are common. For example, one writer stated that "It (overlapping coding regions) is frequently
present in viruses, bacteria and yeast" (TIG12:168[96]), but I think it is actually exceedingly rare outside of
viruses and by no means common among them. This rarity makes sense because it should be
12
exceedingly difficult to evolve a single DNA sequence that makes functional sense in two reading frames,
since any mutational change would likely alter the product of both reading frames.
Related issues. For similar reasons, it is hard to believe the poorly documented proposals of cases of
proteins encoded on "antisense" DNA strands. Unusually sized "anti-ORF's" (ORFs in any reading frame
on the strand complementary to the known sense strand) have been noted in rpoB and C, but functions
are not obvious (JBC264:15074[89]). An odd but related idea is that "anti-sense protein" (encoded by the
same reading frame but from the nonsense strand!) has a non-random affinity for its "sense" analog. Such
interactions are (at least) complicated and of no known biological function (Bioc28:8804[89] and
references therein).
Two antisense ORFs in yeast have been shown to be non-functional. The authors suggest that
these might have arisen because of codon choice on the sense strand. In both cases, the antisense
ORFs are in the same register as the true gene such that the failure to use codons "complementary" to
stop codons in the sense frame yields large ORFS (MGG243:363[94]).
Identification of gene products. While there have been a number of methods for confirming the identity of
a gene product, they were generally rather difficult, at least for low abundance proteins. However, it is
now possible to elute a protein from a 2-D PAGE and subject it to mass spectrometry that, in conjunction
with genome information, allows its absolute correlation with a specific gene. What I do not know is how
clever the programs are in detecting some of the gene oddities (odd starts, elongation and stops)
mentioned in this LT. Of course, not all gene products are proteins. Besides the well-characterized RNA
products such as tRNA and rRNA, there is a growing list of genes (mostly, but not exclusively, in
eukaryotes) that appear to encode "riboregulators" that may act by binding to mRNA or affecting RNA
metabolism in some other way (EMBOJ13:5099[94] & Cell109:141[02]).
Homology. Two biological objects are said to be homologous if it is evident that they are evolutionarily
related. As such, they can either be homologous or not, not "highly", "slightly", "very" or "really really"
homologous. Nucleic acid sequences can vary with respect to their degree of similarity or identity, but not
in their degree of homology. However, through both misuse and the complications of biology, our terms
have started to breakdown in this area and the issue is discussed in TIGS16:227, 437, 439[00].
Perhaps the most important term in the above definition of homology is the term "evident." If
things are very similar, that provides strong evidence that they are homologous, because it is unlikely that
two similar sequences appeared independently. However, the lack of similarity might simply mean that the
relatedness is too ancient for us to recognize. Certainly the last vestige of homology will be reflected in
the 3-D structure of the product, but this can only be identified through structural analysis or at least
through good computer simulations. There is certainly crystallographic data suggesting that many proteins
with no obvious functional or sequential similarities nonetheless contain domains that the structurally
identical. In other words, there are a large number of variations in amino acid sequence that give an
identical structure. But, you ask, isn't the point of looking at homology to gain insights into function?
Therefore if it's really true that proteins with at least some structurally identical domains have wildly
dissimilar functions, does homology tell us anything?
I'd argue that the conserved units of proteins are domains, and that their evolutionary relatedness
will be evident from sequence (if they have evolved recently) or structure. How a given protein uses those
different domains for function depends on the other domains and the particular amino acid side chains.
For this reason, the biochemical function of the entire protein may not be conserved, even if two proteins
are clearly homologous. To the extent that proteins mix and match domains, the very notion of
homologous proteins, as opposed to homologous domains, blurs a bit. In other words, a protein created
from the fusion of portions of two other proteins will have identity to each over certain regions, but the net
biochemical role of the various proteins are almost certain to be rather different.
It is also apparently true that nature has reinvented certain local structures through convergent
evolution. One example is the Ser/His/Asp catalytic triad (TIBS23:347[98]), which is found in different
protein folds. Some other simple protein folds, such as four-helical bundles, might also be the result of
convergent evolution. However, other folds such as -trefoils, previously thought to be examples of
convergence, are probably simply distantly related homologs (see full discussion in AR Biophys.31:4571[02]).
Homologous proteins can even have different structures. Ivan Rayment has noted that there are
several proteins associated with myosin that are homologous to each other and to calmodulin. While local
aspects of structure are conserved, in each case there is a severe change in the tertiary structure,
perhaps only reflecting rotation around one or two bonds. The topologies are, however, similar. Each of
these proteins is involved in protein-protein interactions with other proteins and that might account for the
13
willingness to assume rather different structures. In any event, it is at least a cautionary tale concerning
homology and global, if not local, structure.
Some other examples of surprises in the homology game: The acetohydroxy acid synthases,
which are the first step in ile/val biosynthesis, are all flavoproteins, even though the reaction they catalyze
does not involve redox, nor do their flavins undergo redox during enzyme turnover. It turns out that they
are homologous to pyruvate oxidase, which does perform a redox reaction using its flavin. Apparently the
synthesis evolved from something like the oxidases, developing a new function, but not seeing the
mechanism/need to eliminate the flavin, probably since it had become intrinsic to the protein structure
(JBact170:3937[88]). Another example are the "pseudo-domains" in phage-encoded DNA
methyltransferases: these are conserved regions without function in a functional gene product, but these
domains can be made functional by a single site-directed mutation (Nat352:645[91]). It appears that one
face of the protein, possessing a certain target specificity, was killed by mutation, but with the rest of the
proteins functions intact. By homology, one would have identified the (damaged) region as functional.
Yet another example is in the two-component regulatory systems: the paradigm is a sensor that
autophosphorylates a histidyl residue and then transfers that PO4 to the regulator. NifL from A. vinelandii
has the conserved histidine, but that residue is irrelevant to function and there is no evidence for
phosphorylation of NifL or its partner in the two-component system (MolMicro13:619[94]).
Conservation of sequence tends to imply (to us) that there is an obvious selection at work. In the
case of thymidylate synthetase, however, there is a conserved asn residue at the same position in its
active site of the first 17 sequenced examples. It turns out that this residue is uninvolved in catalysis (i.e.
aa substitutions show normal activity), but its real role is to exclude dCMP from the active site in favor of
dUMP (PNAS90:8604[93]).
Data base searches for homologous sequences are obviously a hugely powerful approach,
though the particular algorithms and concerns depend in part on the sorts of sequences being analyzed.
Genes that encode proteins have different set of constraints than do those that encode structural RNAs,
and searches for regions like promoters and terminators remain surprisingly challenging. Perhaps most
importantly, "there is no method which permits one to go from a statistically significant observation to a
biologically significant interpretation" (quote and a great deal of background on this topic are in
ASM2,2047[96]; see alsoASM2,2627, 2638 &2649). Because of introns etc, identifying ORFs in
eukaryotes is even trickier than in prokaryotes.
Introns in prokaryotes. (See NAR29:3757[01]; JBact182:5281[00], & PNAS95:14003[98]) There are
cases, however, where real bona fide bacterial genes are complicated by the presence of extraneous
material in the middle of them, which is actually processed out of the eventual mRNA from the gene. Such
situations typically involve introns. An intron is a region of DNA residing within a coding region that is
removed after transcription and before translation. It is therefore an insertion that (typically) does not
cause a mutant phenotype. Quite possibly the role of introns is primarily that of selfish DNA. Introns exist
because they can exist and they impart little harm to their hosts. In terms of their removal from the mRNA,
introns fall into three categories: self-splicing, protein-spliced, and spliceosome-spliced, with the first two
found in at least some prokaryotes. While Group I introns are capable of self-splicing in vitro. In yeast, at
least, they often encode a maturase that performs the splicing. At least one of these maturases is a bifunctional DNA endo/RNA maturase. These introns are therefore capable of performing both splicing and
transposition reactions.
Remarkably, it remains unclear if introns are truly ancient and were therefore lost by most
prokaryotes and archaea, or are more recent and have spread by horizontal transfer. For example, an
intron in the thymidylate synthetase gene of B. subtilis phage 22 is very similar to that in the same gene
(td) of E. coli phage T4. However, the two genes themselves are very different, the introns are at different
sites within the genes and the intron ORFs are also very different, suggesting each element of the system
has a very different evolutionary history (PNAS91:11669[94]).
T4 contains at least 3 group I introns capable of autocatalytic splicing in vitro. One of these
encodes a site-specific endonuclease that cleaves an intronless version of the region, allowing intron
mobility (rev in MolMicro4:867[90]), a feature that is not intrinsic to introns, but commonly found.
Group I introns are defined by certain potential base paired regions, a terminal U & G, and
internal guide sequences (IGS). Their mobility is site-specific, non-reciprocal, and efficient (all intron
+
become intron ; so why don't all potential sites have introns?). Movement requires a double-stranded
DNA endonuclease, typically encoded by the intron, and its targets are >8 bp and without dyad symmetry.
Eukaryotic introns related to those in T4 are capable of reversal of RNA self-splicing, resulting in insertion
of the intron at the original site, or at a new site if the IGS is present. Reverse transcription can therefore
allow an intron to hop. Several different introns in the 23S rRNA genes of thermophilic archaea have been
described, with some having coding regions for homing-type endonucleases (JMB243:846[94]). The
14
endonucleases of the T4 Group I introns are mechanistically different from the eukaryotic types in that
they cut at a distance from the intron insertion site, have lower site-specificity, and lack a particular
conserved motif.
Group II introns are also self-splicing in vitro and are excised from primary transcripts as
branched molecules. (Remember that "self-splicing in vitro" does not necessarily imply "self-splicing in
vivo": enzymes merely speed up reactions that occur spontaneously.) They differ from group I introns in
possessing a rather different secondary structure within the RNA and slightly different transesterification
reactions. They have been detected in cyanobacteria and Azotobacter (not too far removed from Ec) and
at least one of the bacterial versions has been shown to self-splice in vitro. The current model is that
these introns ended up in eukaryotes as mitochondria (satisfying the observations that they are only found
in eukaryotes that contain mitochondria), and then spread to the nucleus.
Exons often correlate with functional domains of proteins (TIG2:223[86] & PNAS 85:2944 [88]).
This has the implication that exchange of the exons might be a pathway for protein evolution. In at least
one case, such exon shuffling has been demonstrated in the T4 td gene, generating a hybrid exon and a
hybrid intron still capable of self-splicing (Nat340:574[89]). The role of T4 introns is completely obscure,
since the very closely related phage T2 lacks them. The origin of the introns is unknown, although they do
have T4-like codon usage.
Introns have been detected primarily by sequencing, since they do not perturb the phenotype.
32
Group I introns have also been detected by adding GT P to deproteinized RNA, and monitoring the
appearance of very small linear or circular labeled forms (Nat357:173[92]). By this approach they have
found introns in tRNA genes of Agrobacterium and Azoarcus. Others have been found in cyanobacteria
(EMBOJ13:4629[94]), archaeal genes for tRNAs and DNA polymerase, Bs phage SP01
(PNAS90:5379[93]), and recA of a eubacterium.
Detection of reverse transcriptase in bacteria (reviewed in ProgNucAcidResMolBiol 67:65-91 [01]). A
reverse transcriptase (RT), apparently necessary for the synthesis of a molecule (msdRNA) consisting of
single-stranded DNA covalently linked to RNA, has been detected in both myxobacteria (Cell48:47 &
55[87] and Ec B but not Ec K12 (Cell56:891[89]). The gene encoding the RT lies next to the region
encoding the structural RNA that is the substrate for the enzyme, to produce the msdRNA. The functions
of the msdRNA and RT are unclear, but they do not seem to be essential. The authors suggest that they
may reflect a retroelement possibly capable of transposition (JBact174:2419[92]). A large random set of
E. coli strains have been examined for msDNA. Thirteen per cent were found to have these structures,
but this group did not correlate with any other taxonomic determinants, suggesting the msDNA might have
been recently acquired (JBact172:6175[90]). In contrast, msDNA is apparently ubiquitous in Myxococcus
xanthus (JBact173:5363[91]).
To further blur the distinctions between different types of genome parasites: a phage P4-like
cryptic prophage has been found (in an Ec hospital isolate) that encodes an msDNA and RT that are of
similar structure, though of different sequence. This region can apparently be transduced naturally from
one host to another through a helper phage. As above, there is no obvious function for this parasite on the
genome of the bacteria. Quite possibly it's a recent invention, in terms of Ec, and is spreading as a
harmless disease.
The variety of different reverse transcriptases found through nature, primarily in higher
eukaryotes, has prompted models for their evolution and proliferation (GenomeRes13:1975[03]). An
attractive model posits that they are a remnant of a more RNA-focused world that has been unable to
make much headway in bacterial genomes, in part because of the inherent instability of mRNAs in these
organisms. The model posits that the caps and tails of eukaryotic mRNAs created an ideal environment
for the proliferation of retrotransposons in particular, which explains their distribution: Only about 1/3 of
eubacterial species have a gene for reverse transcriptase, almost no archaea do, yet they are abundant in
6
eukaryotic genomes, with copy numbers from 20 to ~10 in humans.
Protein introns. (See ARM56:263[02]) There are a number of cases in yeast, archaea, and bacteria of
protein splicing and these have been termed "inteins. In other words, these are bits of selfish DNA that
remove their coded region from the protein product and not from the mRNA. Hundreds of these have
been found in various organisms in all three domains of life and a phylogenetic relationship has been
suggested (TIG17:465[01]). Inteins undergo self-splicing, where the inner, excised protein is a sitespecific "homing" endonuclease, causing unidirectional movement of the intron to an empty site
(reminiscent in this sense of some of the Group I introns above). A mini-rev on the possible significance
and evolution of inteins argues that they are ancient and originally might have served a purpose, but have
been selected against so that they tend to be retained only when they are difficult to remove, particularly
when found in critical genes (TIG17:465[01]). Self-splicing proteins have some conservation at the splice
15
points and a mechanism for this splicing has been proposed (PNAS91:11084[94]).
In yeast there is an example of a single 119-kd peptide that undergoes a protein splicing reaction,
fusing the and C-terminal portions to form a 69-kd ATPase subunit, and liberating the central 50-kd
peptide which is a sequence-specific DNA endonuclease (Nat357:301[92] & PNAS90:5379[93]).
The nature of genes in yeast. In many ways, yeast is not that different than prokaryotes in terms of gene
structure, since the significant majority of yeast genes lack introns. As a consequence, an ORF has a
similar meaning in that it is a sequence with a defined start and stop codon and a continuous reading
frame in between. One striking difference in yeast is with the nature of translation initiation, where a
complex recognition sequence like the Shine-Dalgarno is not employed. Instead, with rare exception,
yeast initiates translation at the first AUG starting from the 5' end of the mRNA.
Yeast, like other eukaryotes, does not have operons but rather each single gene product is
expressed from a single mRNA, though of course differential processing of mRNA has the possibility of
yielding different protein products from the same gene. Perhaps related to this, there are no cases (that I
am aware of) of overlapping genes in yeast. Though not a gene issue, the fact of nuclei has an
important effect on the expression of genetic information. The requirement for newly synthesized mRNA
to move from the nucleus to the cytoplasm before translation implies a separation of transcription and
translation, with a variety of implications noted periodically in the text.
Yeasts certainly do have reverse transcriptase, and this is the basis for the proliferation of the
mobile genetics elements in these organisms, as discussed in LT7.
The genetic code. We are so
comfortable with the nature of
the genetic code, that it is difficult
to appreciate the challenge to
understand its nature. It was a
significant challenge to know its
spacing (doublet? triplet?) and
organization (was it overlapping,
such that each nucleotide is read
as a part of three different triplet
codons?). These features had to
be deciphered even as the actual
correlation between specific
codons and amino acids was
being determined. See, for
example, Crick
(Nat192:1227[62]), or
JMB38:367[68] for a typically
thoughtful retrospective by
Figure 1-4. The genetic code where "CT" stands for chain
Crick.
termination, which refers to nonsense or stop codons. UAG is
Three methods were
often referred to as "amber" and "UAA" as ochre.
employed for breaking the
genetic code: incorporation of radiolabeled amino acids upon the addition of redundant RNA oligos;
binding of tRNA; and sequence analysis of proteins from mutants and their revertants (PNAS47: 1588[61]
& PNAS48:441[62]). The first method looked for the ability of redundant oligonucleotides to stimulate the
synthesis of peptides in crude extracts, detected by the ability of TCA to precipitate the peptide polymers.
An ATATATAT oligo should lead to incorporation of the amino acids encoded only by ATA and TAT, for
example, which meant that only labeled Ile and Tyr would be precipitated. The second method - binding of
tRNAs to ribosomes - asked which three-base RNA trimers allowed specific tRNAs to bind stably to
ribosomes. The binding (and identity) of the tRNAs was detected by the fact that only one set of isoaccepting tRNAs had been charged in crude extract with a specific and appropriate amino acid. In such
an analysis, the UUU trimer allowed the binding of only labeled phenylalanine (attached to tRNA) to the
ribosomes. Finally, the reversion analysis started with known stop codons in a protein whose normal
activity was known and whose peptide sequence had been determined. The mutations were defined as
stop codons by the fact that they produced truncated peptides. These mutants were reverted to an
approximately wild-type phenotype and then the protein product from each revertant was purified and
sequenced. The logic was that any amino acid at the position of the former stop codon must be encoded
by a codon only a single one base different from that stop codon. Obviously the same game could be
played with missense mutations as well.
16
General rules of the genetic code and the exceptions. (MicroRev53:273[89] Cell74:591[93]). ASM2 has a
chapter on limitations in translational accuracy (p.979), which covers many aspects of translation, and one
on tRNA synthetases (p.887). In general, ATG, TTG and GTG are start codons; TAA, TAG, and TGA are
stop codons, though a number of organisms use TGA as a sense codon, including Mycoplasma and
Spiroplasma sp. (JBact174:6471[92]).
Wobble rules and tRNAs. Wobble refers to the ability of certain bases in the 5' end of the anticodon to pair
with more than one different base in the 3' end of the codon. It depends in part on the rest of the tRNA,
since two tRNAs that read UCC, and possess identical anticodons, are not identical in their recognition of
other codons. In some cases, a hypermodified base next to the anticodon affects wobbling.
There is no defined set of tRNAs common to all organisms, though all organisms certainly have a
set appropriate to decode their own mRNAs. While Ec has 78 tRNAs with 48 different anti-codons, the
gram-positive Mycoplasma capricolon has only 28 anticodons, apparently relying heavily on wobble. Even
the more typical gram-positive Bs has only about 60 tRNAs with about 28 anticodons. The implications of
tRNA distribution for the non-randomness of codon choice are covered in the following section.
Yeast generally follows a similar pattern but with two exceptions. First, there are multiple copies of
the genes for virtually every tRNA species in the cytoplasm, where in prokaryotes, certain tRNA are
represented by only a single gene. Second, the mitochondrial tRNAs are a bit odd. Remember that
mitochondria are the remnants of bacterial cells that became the organelles of an archaeal nuclei in the
evolution of the original eukaryotic cell. The mitochondria continue to have their own set of tRNAs and
tRNA synthetases, though only the tRNAs are actually encoded by the mitochondrial genome. Perhaps
because this genome has evolved to be rather small (75 kb), it has become conservative in its tRNAs,
using only 24 to encode all amino acid. It makes this work by using a rather more generous wobble rule in
which the last nucleotide of the codon is often irrelevant. These more casual tRNAs also lack one of the
typical arms of the standard cloverleaf structure.
Codon choice. (MicroRev54:198[90]) Even though all non-stop codons make sense, they are not
translated with equal speed and efficiency. Because different tRNA are present in different populations in
each organism, those codons that match more abundant tRNAs are translated more rapidly, because
fewer tRNAs need to be tested at that codon to find a match. Not surprisingly, the genes within each
organism use codons to reflect this tRNA abundance (or vice versa), at least those genes for which
translation must be efficient. This has a variety of implications for both the organism and the
experimenter. It makes sense, for example, that highly translated genes should have optimal codon
choice, and this presumably means rapidly translated codons. This simplistic view has been supported by
the demonstration that codon usage determines translation rate in vivo, with common codons being more
rapidly translated. Similarly, changing from the wild-type sequence to a more optimal codon choice
yielded a increase in the accumulation of a cloned gene product from 4-14% of total cell protein
(NAR17:10191[89]). On the other hand, while it seem to be true that highly translated genes have optimal
codon usage, the majority of genes, for which proper expression can be achieved by a combination of
decent translation and proper transcription, can be far from optimal (Genet149:37[98]).
Finally, physiology plays a role as well, since relative levels of specific tRNAs fluctuate with
growth rate, with the implication that there will be different optimal codon usages under different growth
conditions. Because we almost religiously examine log-phase cells, most of our analysis are skewed and
at least some of the contradictions noted above may be explained by this (ASM2,2053[96]).
The precise positioning of rare codons is used regulatorily in the process of attenuation
(Bioessay24:700[02]). It is even possible that it might be important to slow translation at certain positions
in a protein's translation in order to allow time for proper folding of the intermediate. Optimal codon choice
in this case would certainly not be obvious until we had a very clear understanding of the entire process of
protein synthesis and maturation.
Codon context effects. Context effects exist where a given tRNA decodes two identical codons at different
rates depending on the surrounding bases in the mRNA. Unfortunately some of these analyses showing
context effects beg the question of whether or not the context causes poor translation or is merely not
selected against by a requirement for very good translation; as above, we must be careful in assuming
what is "optimal" for a given situation. At least some context effects may be due to tRNA-tRNA
interactions on the ribosome (PNAS86:3699 & 4397[89]), since the surrounding codons also dictate the
adjacent tRNAs.
The context effect can be sufficiently strong (as much as 10-fold) that a selection was devised
that demanded better translation of a given codon and at least some of the resulting mutants had base
17
substitutions in an immediately adjacent base (Nat286:123[80]).

Effects of sequences on other coding properties. There is clearly some non-randomness in the use of
certain DNA sequences that are distinct from the GC content and codon choice, though it can be a bit
challenging to verify if the specific oligo sequence is under-represented or is actually imbedded within a
larger under-represented oligo (see ASM2, p2060[96] for a discussion). At least some tetranucleotides
are rare in enterics because they lead to mutations. An example involves 5-MeC deamination at a dcm
site and its associated vsr repair system in Ec (Nat355:596[92]). Dcm methylates the second C in
CC(A/T)GG, but spontaneous deamination would make a T-G pair (which would confuse mismatch
repair). The cell uses the vsr system at several tetranucleotide sequences, and this system assumes that
T-G should be C-G. However, this assumption causes a problem whenever replication errors make TA
base pairs into TG. To avoid these problems, cells with this methylation system avoid this set of
tetranucleotides (ASM2,p2062[96]).
Large-scale organization of bacterial genomes. As more prokaryotic genomes have been sequenced, a
number of fairly remarkable asymmetries have been seen (Micro150:1609[04]) . These include a
preference for the coding strand of a gene to be the one copied as the leading strand during replication (in
other words, the direction of transcription matches the direction of replication for the gene); the
preferential placement of pyrimidines, especially cytosine, on the lagging strand; the organization of
genes for rRNA near the origin and certainly oriented with replication; and a general A+T richness near
the terminus of replication. These will now be briefly discussed in turn.
It has long been recognized that genes for rRNA are oriented away from the origin of replication
(i.e. the RNAP transcribing these runs in the same direction as the DNA polymerase complex) and
therefore use the leading strand as coding. It is generally assumed that this is because these are
extremely strong promoters in all cells, especially during periods of rapid growth, and that such orientation
would decrease the livelihood of DNA polymerase and RNA polymerase colliding. (It appears that such
collision causes transcription to terminate and replication to be seriously arrested. In contrast, the collision
between a rapidly moving DNA polymerase and a much slower RNA polymerase moving in the same
direction only slows replication and the transcribing polymerase apparently completes its product.) This
hypothesis would predict that highly expressed genes should typically be on the leading strand and poorly
expressed genes on the lagging strand, but the distribution, except for those for rRNA, is not as
compelling as this hypothesis would predict. Rocha (the author of the cited Micro article) suggests the
surprising alternative: perhaps the problem with colliding polymerases is not one for replication, but rather
because the prematurely terminated proteins are a problem. He notes that genes encoding essential
proteins are more biased in this regard than are other genes, irrespective of expression level. I confess to
a certain skepticism that there could be much accumulation of such aberrant proteins, but I dont have a
better hypothesis.
The issue of the preferential presence of pyrimidines on the lagging strand is explained by the fact
that cytosine deamination is much higher is ds DNA than in ss DNA and that the leading strand is more
exposed in the ss state than is the lagging strand, because of the nature of synthesis at the growing fork.
Interestingly, eukaryotes use Okazaki fragments that are 10 times smaller than in E. coli (which will
therefore reduce the amount of time the leading strand exists as ss) and also show much less replication
strand bias.
On the issue of gene distribution with respect to the origin, there are several factors at play. One
is that in rapidly growing cells, then can be multiple copies of the genome near the origin, though typically
only one copy of the terminus region, since the cell divides when that region is replicated. One finds highly
expressed genes are statistically more common near the origin, but that might be because many of these
genes are critical for rapid growth, such as genes for rRNA and ribosomal proteins.
Finally, there are several explanations for the observation that many terminal regions of
chromosomes are AT-rich, relative to the rest of the genome. Some of these involve horizontal gene
transfer (and the argument is complicated and apparently not completely consistent with the data - see
the article if interested) and recombination near the terminus, but an interesting notion is the following.
ATP, and TTP to a lesser extent, is relatively abundant in the cell, so perhaps near the end of replication,
cell resources are low and it is easier to complete replication if you use the more abundant nucleotides.
This view suggests that replication is a separate and particularly resource-consuming process in the cell,
rather than something that cells are constantly doing, which is how we typically think of it in logarithmically
growing E. coli. In fact this is reasonable. Most prokaryotic cells in the real world are surviving, not
growing rapidly. They must make a conscious decision to begin replication when they feel they have the
resources to complete the process. The process does not seem to be as formalized as it is in the cell
cycle of yeast, but there must be some process occurring. By this logic, it is reasonable that resources
18
might be a bit strained at the end of the process, consistent with the starting hypothesis here. (One
wonders however, why cell just dont interconvert the two purines and two pyrimidines, since they have
enzymes for doing so, but maybe I am missing something here.)
One final note: there are apparently some prokaryotes that show little in the way of the strand
biases noted above, but at least some of these appear to replicate from more than one origin
(Cell116:25[04] PNAS 101:7046[04]), consistent with replication being a central driving force for the
biases.
Amino acid usage varies with GC content. I always assumed that GC content would affect specific codon
choice, but not amino acid choice. That is, a high GC-content organism would still use the same number
of, say, lysines, but would use AAG and not AAA to encode them. The point is simply that using
exclusively G/C or A/T at the third position of codons can dramatically affect GC content without altering
amino acid choice. I then looked at the data (http://www.kazusa.or.jp/codon/) and I was wrong: high GC
organisms are higher in GC content in all three codon positions (though it's true that the third position is
the most skewed with respect to GC content). A GC difference in the first two position means there must
be a difference in amino acid choice, so I looked at about 20 organisms, mostly bacteria, with a range of
GC content from 31 to 72%. I haven't done serious statistics, but when one sums the frequency of usage
of different amino acids (from the codon usage tables from the portions that have been sequenced), there
are several fairly striking trends. Here are the most affected amino acids with the ratio of the percentage
use in high GC organisms over the percentage in low GC organisms: proline (6/3), arg (8/3), gly (9/5), ala
(13/6), phe (2.5/5), tyr (2/4), asn (2/7), ile (3/8), and lys (3/9). Amusingly, lys+arg is roughly constant and
neither asp nor glu show a great deal of variance. Note that some of these amino acids, notably gly and
pro, are of great structural importance to proteins.
Interestingly, this correlation was noted about 35 years ago by Noboru Sueoka, then at U. Illinois
(PNAS 47:1141[61]). This is before the code was even broken, so what he did was examine the amino
acid content of total cells and correlate that with GC%. He found a positive correlation between GC
content and ala, arg, gly and pro and a negative correlation with ile, lys, asp+asn, glu+gln (in their
analysis, these were indistinguishable), tyr and phe. (For those scoring at home, my positive was with the
same four and my negative was ile, lys, asn, tyr and phe). In 1961, however, this was a "code" question
and not a "protein" question. Note also that the two analyses have totally different biases: the codon
analysis does not consider whether or not ORFs are even expressed, while the Sueoka/protein analysis
only examines total protein accumulation - the fact that they arrive at a similar conclusion is remarkable
and persuasive that something important is going on here. It turns out that this correlation had been noted
before (see Genet153:339[99}, which also has references to more recent Sueoka papers;
J.Mol.Evol36:201[93], MicroRev 56:229-264[92]), but remains broadly unrecognized by molecular
biologists. Also the emphasis has been on "how we got different GC%" and not what are the impacts are
on proteins? Sueoka has continued to publish in this area, though typically in evolution journals, so it
has escaped the notice of most molecular biologists (J.Mol.Evol.34:95[92] & 37:137[93]). This effect was
noted, but not commented on (in terms of effects on proteins or a discussion of what the driving force
actually is) in MicroRev.56:229[92] by Osawa and Muto (see especially Table 4 and p. 238).
While this has interesting, but very unclear, evolutionary implications, it also has implications for
amino acid comparisons among organisms: I believe that one should weight the use of a amino acid if it is
rare for its genomes GC content. In other words, it would seem likely that an Arg residue used by an ATrich organism would be more likely to have functional significance than would a Lys. In a related matter,
people have found that thermophiles tend to be richer in Cys residues, presumably for stabilizing proteins
(TIG18:278[02]). The bottom line seems to be that the effect has not been seriously considered from a
protein standpoint. So I think the issue remains, can one rationalize a condition whereby it would be
advantageous to have proteins whose compositions are skewed as noted above?
An argument has been made based on the observation that obligate symbionts and parasites
tend to be AT-rich (TIG18:291[02]. The authors note that there is a higher metabolic cost to making GTP
and CTP relative to that for ATP and TTP and they argue that parasites are therefore selected to be a
lower cost to the host. I simply do not know if this is correct, but, as the authors themselves note, it hardly
explains why any microbe would be GC-rich. For example, free-living microbes would also have the same
energy cost for having excess G/Cs and it would seem that there should be a comparable selective
pressure there as well.
Changing the code. It should be enormously difficult to evolve a change in the amino acid specified by a
given codon, although it certainly has happened, because there are some organisms and organelles that
have exceptions to the universal code. The difficulty in such a change is that one must simultaneously
change the decoding system and the genes themselves. There are two general models for how this might
19
be done: disuse, whereby an organism gradually stops using a codon for the old sense and then starts
using it for something else, and the ambiguity model, where a codon is transiently used for two different
amino acids for a while (Genet150:543[98]). To my mind, the former is rather more plausible, since the
latter requires that both amino acids are functional at all sites of the ambiguity.
The huge increase in genome sequencing data has suggested more cases of changes in the
code, especially in the genomes of eukaryotic organelles. Some organisms that appear to use altered
codes are the yeast genus Candida (in which CUG is Ser, not Leu), Mycoplasma spp. (where UGA is Trp,
not termination and CGG is termination, not Arg), Micrococcus spp. (where AGA is termination, not Arg
and AUA is termination not Ile), Euplotes spp. (where UGA is Cys not termination) and a set of lower euks
where UAA/G is Gln not termination) (TIG17:20[01]).
Translation (MMBR66:460[02]).
First, while it is a bit odd to address translation before transcription, I will do it here because it
flows more naturally from the genetic code. I will treat transcription a bit later in this LT. Second, I can
hardly do justice to the full range of knowledge on the process of translation here, so I will simply
emphasize some important points.
Translation in prokaryotes is rather similar to that of eukaryotes except for the process of
initiation. In eukaryotes, the translation machinery tend to start at the first AUG that appears in the
message when reading from the 5' end (Bioessays25:1201[03]). Moreover, there are few cases where
ribosomes restart translation once it has terminated. The cases that exist are used for regulation in
specific mRNAs (in yeast, at least). In prokaryotes, the situation is rather more complicated, as many
mRNAs are translated into multiple peptides, so there needs to be a recognition system to identify those
different initiation regions. That recognition system involves two regions: the vicinity of the first codon, and
a sequence 5' of that (termed the Shine-Dalgarno sequence). The latter makes contact with a
complementary sequence of the ribosome, so that mRNAs with sequences closer to the consensus
complementary sequence tend to be translated better. The start codon sequence is important, though the
codon used can be AUG, GUG or UUG, even though they are all read by the initiator fMet tRNA
(apparently CUG is also used, at least in Bacillus- P. Babitzke).
Once translation has begun, it proceeds along the mRNA by steps of one tRNA. You are
probably used to hearing that the ribosome proceeds by steps of three bases. However, because there
are tRNAs that shift the reading frame (either the translational frameshifting described in this LT or
frameshift suppressors mentioned in LT 15), the former definition is more correct. The nascent
polypeptide chain folds while it is still on the tRNAs and there are cases where that nascent peptide
possesses activities that affect translation. Eventually the nascent peptide protrudes from the ribosome.
Presumably, the nascent peptide is constantly undergoing conformational changes to move toward the
lowest free energy, but these movements are also constrained by kinetic parameters (not all
conformations can possibly be tested in the time-frame of translation). Ultimately many proteins are
assisted in achieving their active form by the action of chaperones, which help destabilize inappropriately
folded proteins, so they can properly refold.
-3
Translation is remarkably error-prone (perhaps 10 in terms in incorrect amino acids, and then
there are failures to properly elongate), such that few large proteins have the precise sequence encoded
by the gene. However, the nature of the genetic code, as well as the tolerance of proteins for similar
amino acids, means that the vast majority of these still have roughly normal function. In other words
mistranslation typically involves a tRNA whose anticodon is only a single base different from the proper
one and examination of the code shows that such substitutions often are of similar amino acids
(ASM2:979[96]).
Though the following factoid is obvious, it is nevertheless amazing and rarely commented on. It is
something of a wonder that translation is as fast as it is, since there are potentially a large number of
aminoacylated tRNAs that must be tested randomly to see which is appropriate for the next available
codon. It seems remarkable that these relatively bulky molecules can move in and out of the ribosome
with sufficient speed to allow proper incorporation at a rate of about 15 amino acids per second.
Recognize that 20 or more tRNAs typically need to be tested to find an appropriate one. A similar problem
exists with DNA polymerization and RNA transcription, of course, but in these cases there are only four
choices, not a minimum of about thirty as in the case of tRNAs. Here are some additional comments on
aspects of translation.
mRNA stability. Although translation is substantially similar in prokaryotes and eukaryotes, the substrate
of translation - the mRNA - has very different properties. The topic is covered in more detail in LT3 but
here is the brief summary. mRNAs in eukaryotes have substantial structure, including a 5' cap and a 3'
poly-A tail, that is critical for its stability, though there are important differences that depend on other
20
properties as well. Human cells have an average mRNA half-life of 16 h; with a generation time of 12 h,
that yields a ratio of about 1.3. In contrast, Sc has an mRNA half-life or 23 min, a generation time of 90
min, for a ratio of .26. In the best-studied archaea, the slow-growing Sulfolobus solfataricus, the half-life is
54 min and the generation time is 360 min, for a ratio of 0.15. Lastly, Ec has an average half-life of only 1
min and a generation time of 30 min for a ratio of 0.03 (see Genome Res13:1975[03]). In a curious role
reversal, 3' poly-A tails actually signal mRNA degradation in bacteria, where such tails stabilize eukaryotic
mRNAs..
Although organisms therefore differ in both the absolute and relative magnitude of their mRNA
stabilities, as well as in the specific mechanisms of mRNA processing, it remains true that mRNA stability
can be biologically important for the function of any gene. However, the very short average mRNA in Ec is
correlates with the relative paucity of regulation at this level in enteric bacteria (if all mRNAs are pretty
unstable, theres not much to be gained from modulating it a bit). Not surprisingly, the very long mRNA
half-lives in higher eukaryotes makes modulation of mRNA stability and regulated translation itself a very
common and powerful regulatory mechanism.
Second codon effects. The identity of the second codon of a gene (and sometimes its encoded aa) can
have a dramatic effect on accumulation of the protein product. In one example of a cloned gene where the
second codon was varied, there were effects on mRNA stability, translational efficiency, and product
stability (Gene98:217[91]). I believe it is possible that the first two effects are functionally linked: an mRNA
on which translation initiation is efficient tends to be more stable because endonucleases have less
opportunity to attack it. These effects are context-specific (that is, they depend on the surrounding mRNA
sequence), since changes in the second codon of lacZ showed a different pattern (EMBOJ6:2489[87]).
Translational coupling. (ASM2:906[96]) Translational coupling is defined as a situation in which a
downstream gene is poorly translated unless the immediately upstream gene in the same mRNA is
translated. Two general classes for the mechanism of translational coupling exist: in one, the initiation
o
codon is hidden by 2 structure involving upstream sequences, so that by translating the upstream gene,
o
the 2 structure is disrupted and exposes the initiation codon. The other general class is where a poor
initiation signal is poorly utilized unless presented with an already translating ribosome. The mechanism
here is that initiation factor IF3 functions to discourage initiation at poor start signals, but this protein falls
off of ribosomes once elongation has begun. As a consequence, when a ribosome has just finished
translating a gene, there is a brief window before it dissociates into subunits (when it will again require IF3
to initiate) when it lacks IF3 and is more forgiving about initiating at poor starts (MGG218:137[89]). In
many cases, at least with the first mechanism, there is an overlap of one or two bases in the two genes.
Thus the ribosome that just finished the previous gene will have an excellent chance of initiating at the
start of the second gene before falling off, because it will necessarily be in the immediate vicinity.
Unfortunately such an overlap in sequence (or, worse, apparent sequence, because we often dont
know exactly where translation starts) has come to be considered de facto evidence for coupling in the
absence of experimental analysis.
Aberrant translational initiation. When start codons like UUG and GUG are used, the Shine-Dalgarno
sequence becomes particularly critical (JMB204:1045[88], MGG218:137[89]). As in many cases of
complex interactions, there is an optimal consensus of different factors that is rarely matched exactly; the
strength of a given start region is a function of how well the overall set of factors matches. However, some
mRNAs start at the AUG, i.e. without a S-D sequence or any other leader and these mRNAs are often
well translated! In these cases, it appears that the first start codon is used preferentially. In fact leaderless
mRNAs apparently bind 70S ribosomes or 30S subunits without IF3, and thus bypassing the normal
translation initiation pathway (JBact184:6730[02]). Possibly the role of the 5' information in an mRNA is to
prevent aberrant translation more than a prerequisite for proper translation. Such leaderless mRNAs are
particularly common in archaea, especially in the first genes of operons (JMB309:347[01];EnvMic7:47[05])
Aberrant translational elongation. (ASM2,979[96]) Even correct elongation is not the rule: under normal
circumstances, only about 3/4 of the ribosomes complete the translation of lacZ and about 1/3 of the
failures are due to premature RNA polymerase termination (JMB215:511[90]). Translational accuracy is
only gradually being appreciated as a critical feature in biology. At issue is the fact that translation errors
are much higher than those of transcription and replication (not surprising, given the relative impact of
errors at each level), such that an average protein of 500 aa has a 22% chance of having an error!
(For a recent book on all sorts of examples of non-standard translation, including those in the
following paragraphs, see Recoding: Expansion of Decoding Rules Enriches Gene Expression, Nucleic
Acids and Molecular Biology, eds Atkins and Gesteland.)
21
Natural frameshifting: (rev in ARG21,p82[87]; ASM2,909& 916[96]) (This does not refer to a frameshift
mutation, but rather the change in reading frame by a translating ribosome.) There are certain tRNAs that
are naturally "shifty", or prone to slide over a base, causing shifts in the reading frame (remember that the
ribosome reading frame is set by tRNAs). This frameshifting can take place in either the + or - direction,
depending on the specific context and the precise rules for this are being sorted out (PNAS90:5409[93]).
As a general rule, frameshifting occurs when ribosomes pause, which can be the result of a number of
circumstances. The length of the pause is often critical: increasing the dosage of the right tRNA (the
r
codon the ribosome is sitting at when it actually shifts) decreases frameshifting. Conversely, Str
ribosomes, which are hyperaccurate and slower, increase the degree of frameshifting, presumably
because of longer pauses (PNAS90:2315[93]).
Most translational frameshifting is clearly deleterious, since it puts the ribosome into an incorrect
reading frame and the normal protein product cannot be made. However, in some cases the cell takes
advantage of frameshifting for regulatory purposes and the first example concerns RF2: Remember that
there are no tRNAs that recognize stop signals in normal cells. These codons are actually recognized by
proteins: RF1 recognizes and terminates at UAG and UAA signals, while RF2 terminates at UGA and
UAA signals. RF2 is encoded by prfB and its regulation involves natural frameshifting. There is a UGA
codon in the normal reading frame at codon 26, so that a translating ribosome that reaches that codon
terminates translation if there are sufficient levels of RF2 to detect the stop codon. As a consequence no
additional RF2 is made if there is already a sufficient level. However, when the level of RF2 falls, then the
UGA codon is not rapidly recognized and the ribosomes dawdles a bit. The immediately preceding codon
is one at which translational frameshifting can take place, so this dawdling leads to a +1 frameshifting and
the ribosome now reads the rest of the gene in this new reading frame, which actually encodes the normal
RF2 protein. In other words, frameshifting is required for proper translation and, quite appropriately, the
regulation is based on the levels of RF2 activity in the cell as judged by its ability to terminate at UGA
(JMB203:75[88]).
In another example, in the MGE IS1 there is a translational frameshift at AAAAAC that is
necessary in order to complete translation of the functional transposase (insA-insB) (PNAS86: 4609[89]),
and this "general" case exists in other MGEs. InsA (the non-elongated product) is an inhibitor or
transposition. Just to complicate things, there is yet a third product of the insA-insB region that starts at a
start codon within insA. This protein is apparently yet another inhibitor of transposition (JMB240:52[94]).
These are apparently the only two IS1 genes involved in transposition (JBact178:2420[96]). Apparently
this is an effort by the MGE to limit the level of functional transposase to extremely low levels. It is difficult
to do this by simply maintaining very low transcription, so the element is in a sense regulating (lowering)
transposase activity by automatically producing an inhibitor. I do not believe that the decision to frameshift
(or not) is itself regulated.
In coronavirus, frameshifts takes place upstream of a pseudo-knot in the RNA, which leads to
pausing by the ribosome (Cell57:537[89] &55:447[88]). In other cases, frameshifting occurs immediately
upstream of rare codons: a pair of rare AGA codons causes frameshifts 50% of the time in an
overexpressed gene (NAR18:5031[90], and frameshifts in yeast TY occurs upstream of the rare AGG
codon. A curious case exists in rhizobia, where two ORFs, each of which appears to be capable of
producing a functionally independent gene product, are fused by frameshifting and are involved in
formation of a small molecule (PNAS90:2641[93]).
Translational jumping: There are a few cases where ribosomes seem to skip over some number of
codons and then continue translation in the same reading frame (though there is no reason a priori that it
need be in the same reading frame). The most remarkable example of jumping ribosomes is in T4 (see
ARB69:343[00] & ARG30:507[96]). In this case, ribosomes skip exactly 20 codons in the mRNA, and
there is an identical codon at either end of this movement, suggesting that the ribosomes moves between
identical sites (JBact172:630[89]). Importantly, the nascent peptide translated 5' of the hop site is critical
for the effect, implying an activity for this nascent peptide when bound to the ribosome (this example, and
other cases of cis-acting peptides are discussed in MicroRev60:366[96]). Similar, albeit less dramatic
cases of jumping have been reported for Ec carA, where 4 codons are skipped, and a trpR::lacZ hybrid
discussed, as are related items, in ASM2, 917[96]. As noted previously, under introns (LT1), it has been
suggested that these might be cases of old protein introns that have been substantially deleted and are
actually spliced out by an unknown factor in trans.
Aberrant translational termination. UGA signals, at least in the enteric bacteria, have a general problem
with leakiness. It seems likely that this is the result of wild-type tRNA that reads the codon with low
22
efficiency at some sites. As an aside, recognize that most UGA suppressors will probably wobble to read
UGG as well. Since UGG is the codon for tryptophan, that means that a UGA suppressor that inserts an
amino acid other that tryptophan will cause some low-level errors (i.e. "something else in place of trp") at
most UGG codons.
There are also two remarkable stories about what have been termed the 21st and 22nd amino
acids. The first of these refers to selenocysteine insertion at very specific UGA codons (ARG21:p79[87]).
Selenocysteine has been shown to be cotranslationally incorporated at very particular UGA codons
anaerobically in Ec. This involves not only an odd "serine" tRNA, selC (NAR17:7159[89]), but also an
elongation factor that is specific for this translation event (Nat324:453[89]). This factor, SelB, binds to the
loop region of a hairpin 3' to the UGA, forming a complex with selenocysteine-tRNA and allowing
translation of this codon in this surprising way (PNAS90:4181[93]). The conversion of serine to
selenocysteine is performed by the selA gene product after the former is charged to the appropriate tRNA
(JBC266:6318[91] & MolMicro5:515[91] for a review). A similar phenomenon has been seen with certain
UAG codons in archaea and possibly some gram-positive bacteria, at which pyrrolysine is inserted
(Science296:1409[02]; MolMic48:631[03]), though many of the specific details governing insertion of
pyrrolysine are different from those for selenocysteine. Another example of a stop codon with novel
properties involves the inherent leakiness of many UGA codons. The phage Q is a small RNA phage
that has three proteins in its virion and a replicase factor that does not end up in the virion. The problem is
that these proteins are much more than the small Q genome can encode. It so happens that the stop
signal at the end of the gene for the coat protein ends in UGA, but the ribosomes translate through it (a
wild-type tRNA misreads it occasionally) about 10% of the time until the next translational stop codon is
reached. This longer protein is also used in the virion and is actually critical for phage function. Essentially
the phage is taking advantage of UGA leakiness to create to related but different proteins from the same
coding region (ARG21:80[87]). UGA readthrough is addressed in ASM2, 912[96].
Ec has a non-random preference for stop signals: UAA (68%), UGA[U/A] (20%), UGA [G/C] (8%),
and UAG (8%). UGAC tends to show up at "funny" stops like that in prfB and at selenocys signals; UAGC
shows up in the T4 hops (see above). Nonsense signals (mutationally caused translational stops) are not
identical to stop signals (i.e. the signals at the ends of normal genes) as evidenced by the fact that
efficient amber suppressors are not lethal (though ochre suppressors are always inefficient). It's been
noted that RF1 and 2 show context effects at UAA signals, perhaps making them better at distinguishing
correct termination codons from nonsense codons created by mutation (JBact170:4714[88] &
Bioc31:2443[92]). This actually explains why nonsense suppressor tRNAs are not lethal: at nonsense
mutations created by mutation, the context is poor for RF1 and 2, so the tRNA suppressors win the
competition with them some of the time. However, at normal termination signals, the context is very
attractive for the proteins, so they out-compete the tRNA suppressors very efficiently.
The actual site(s) of translation. The actual translation of the language of nucleic acid into that of amino
acid takes place at the tRNA:synthetase interaction, because this is the interaction that recognizes a
specific nucleic acid sequence and identifies the amino acid counterpart. This recognition of tRNAs is
surprisingly complicated. Because the coding properties of tRNAs are obviously reflected in the
anticodons, one would naturally assume that the tRNA synthetases recognize this portion in their
discrimination, but that is not the case. An additional complication is that while all tRNAs have roughly
similar shapes, they all have rather different sequences and, more significantly, the set of tRNAs that
insert the same amino acid, and are therefore recognized by the identical tRNA synthetase, are hardly
more similar to each other than they are to other tRNAs. In fact, synthetases inserting different amino
acids actually examine different parts of the cognate tRNAs, which accounts for the fact that the
recognition sites are not so obvious (see mini-rev in PNAS90:8763[93]). It appears that a portion of the
specificity in the tRNA-synthetase interaction involves the ability of the latter to perturb the structure of the
former in a precise way. (ARB69:617[00], Sci286:1893[99]).
Termination in the absence of a stop codon. Imagine a translating ribosome that reaches the end of an
mRNA without hitting a stop codon. This might happen because of errors in synthesizing the mRNA or,
more likely, because of errors in translation itself, such as ending up in an incorrect reading frame.
Apparently, this is not a rare event and it has been estimated that 13,000 such events occur per
generation in E. coli (MolMicro58:456[05]). The ribosome does not terminate because it cannot call in the
normal translation termination factors. Instead, the stalled ribosome is rescued by a remarkable RNAprotein complex found in all studied bacteria: tmRNA (for transfer-messenger RNA) and the SmpB protein
(CurrOpMicro10:169[07]). This abundant RNA looks like an alanine tRNA and binds only to such stalled
ribosomes. The binding is stabilized by SmpB, since normal codon-anticodon interactions are absent. The
nascent peptide is transferred to the tmRNA and translation continues using the 3' end of the tmRNA to a
23
stop codon, which produces a signal tag at the C terminus of the protein. The tagged protein is then a
target for proteolysis.
Transcription.
As with translation, this is a vast area of research that is covered in many excellent molecular
biology texts and I will only make a few general comments, followed by some specific details below. There
is a bit more relevant text in LT3 on the regulation of transcription.
The recognition of transcription initiation sites in prokaryotes is relatively simple compared to that
in eukaryotes, since the former involves either RNA polymerase by itself or with the assistance of a single
other protein (though of course there are some more complicated exceptions). In contrast, eukaryotes
often require several protein factors in addition to RNA polymerase to identify a promoter. Eukaryotes also
modulate histone binding in such a way that promoter regions are accessible relative to coding regions. I
assume that the greater complexity in eukaryotes is because they contain rather more DNA than to most
prokaryotes and therefore the modest sequences used in prokaryotes might not be sufficient to provide
the desired degree of specificity (though, in truth, the yeast genome is little larger than some prokaryotic
ones). Alternatively, this might reflect the problems caused by the greater structure of eukaryotic
chromatin, including the presence of so many more histone proteins.
In the very simplest case, a promoter sequence is recognized by the subunit of RNA, which
binds to two closely spaced regions on the DNA. For most forms of , these are termed the -10 and -35
regions, which reflects their position relative to the actual start site of transcription. Initially RNA
polymerase forms a closed complex with the DNA, but then separates the DNA strands (a form called the
open complex, and the transition between the two is called isomerization) and begins transcription. This
transcription elongation is not steady, but involves occasional pauses, which can have importance in the
process. After a bit of transcription, the falls off and becomes associated with another RNA polymerase.
Depending on the organism, there are either a few or several dozen different factors. Within a single
cell, these different factors presumably have different DNA binding sequences, but the recognition
sequence criteria of many subunits is not yet apparent. A few factors are present in cells at fairly
stable concentrations, but others are regulated in terms of level or function by post-translational
interactions.
The regulation of function provides transcription and some degree of regulation, but this is
typically not very specific regulation, since there are many genes and relatively few factors. Instead,
expression from many promoters is regulated by additional proteins that have their own binding
sequences in the general vicinity. As a simplified example, there are repressors that interfere with RNA
polymerase binding under some conditions. In such cases, the promoter itself has reasonably good
affinity for RNA polymerase/ and so regulation is effected by the presence or absence of the repressor.
The repressors DNA-binding activity is itself regulated either positively or negatively through allostery.
Alternatively, there are many cases of promoters with a poor affinity for RNA polymerase, but in which
additional proteins bind near the promoter and stimulate transcription through protein-protein interaction
with RNA polymerase. In a sense, these proteins provide the necessary energetics to compensate for the
relatively proper affinity for the specific DNA sequence. As with repressors, the ability of activators to bind
DNA can be positively or negatively regulated by allostery. In many of these protein:DNA interactions,
there are additional DNA-binding proteins, such as FIS and IHF, that have no more than modest
sequence specificity, yet can interact with other proteins to perturb DNA and DNA:protein structure. The
following are some other transcription issues of particular relevance to genetic analysis.
Proofreading by RNA polymerase has been shown both by the isolation of an error-prone RNA
polymerase (rpoB) that is faster in transcription (EMBOJ8:3153[89]) and by direct in vitro evidence of
reversibility of nucleotide incorporation by normal RNA polymerase (Cell93:627[98], PNAS93:13677[96]).
Transcription vs. replication. What happens when a replication fork collides with a transcribing RNAP?
Remarkably, some reports with purified systems argued that the replication complex can move through a
transcription complex (moving in the same direction; replication is about 10x faster) without displacing it,
thus allowing transcription to continue (Nat366:33[93] & PNAS91:10660[94]). But other data argued that
the replisome dissociates (MGG252:398[96]). More recent in vitro analyses argue that the replisome stalls
and then resumes elongation after displacing RNAP (Sci327:590[10]). This report also identifies the Mfd
protein as an important part of replisome restart. Realize that most, but not all, abundantly expressed
genes in prokaryotes are transcribed in the same direction as replication, presumably to reduce the
frequency of polymerase interactions.
24
Polarity. Historically, polarity

was defined as any reduction
of expression of a
downstream gene caused by
a mutations in an upstream
gene (ASM2,792 & 822[96]).
Obviously polarity of this
type has no significance in
eukaryotes where operons
probably do not exist. The
effect of polarity sometimes
results from mutations,
typically insertions, that
introduce transcription
termination signals, and
therefore stop transcription
directly. Alternatively,
mutations that cause
translation termination
signals can lead indirectly to
reduced transcription of
downstream regions by the
mechanisms described
below. The effect of
insertions can be
complicated because many
have their own promoters
Figure 1-5. The top panel shows the organization of a two-gene
that cause transcription
operon, with the positions of relevant signal sites noted. The bottom
within the element that can
panel shows a model of how the termination of translation, caused by a
enter the surrounding
premature nonsense signal, can allow Rho to bind the mRNA and have
chromosomal region. This
a possibility of catching up with the RNAP before it reaches the next
transcription may or may not
translational start site.
be higher than the natural
transcription of that region, but it certainly will not be identically regulated. As a further complication, the
detection of such transcription depends on the specific site of integration in the genome for the following
reason. As you will read immediately below, Rho tends to cause transcription termination when an mRNA
is not being translated. So for transcription from an element into the adjacent region, sometimes it will
happen to have a site where ribosomes bind and translate (and therefore protect the transcribing RNA
polymerase), but often no such translation site is present and this transcription is terminated. There have
also been cases of anti-polarity where the reduction in expression of one gene affects the expression of
upstream genes (due to mRNA stability or regulation?).
Transcriptional effects of translational stops. Nonsense signals display a gradient of polar effects that
increase with increasing distance to the next translational start signal. Such polarity requires the presence
of a functional Rho protein, which is normally involved in transcription termination at the end of operons. A
model explaining the mechanism by which stop signals created by mutation cause polarity is the following
(Fig. 1-5): Rho factor causes RNA polymerase to terminate transcription downstream from a site of
translation termination if the following conditions exist: (i) A region of the mRNA, lacking ribosomes
because of the upstream nonsense signal, contains a site that activates the ATPase activity of Rho,
probably by binding Rho. (ii) At some distance downstream (deliberately vague, but probably not
important) from this site, the activated Rho catches up with RNA polymerase and causes it to terminate
transcription.
The biochemistry of Rho action is explained fairly well in the ASM2 (pp832ff and 1278ff) and also
see Cell114:135 & 157[03]. Rho is an RNA helicase, though it can also act as an RNA-DNA helicase, and
both of these activities are ATP-dependent. While not completely clear, it would appear that a substantial
portion of Rho's role in termination is by the destabilization of the RNA-DNA hybrid at the transcription
complex.
There are clearly sites that stimulate Rho action when they are transcribed, but not translated.
These sites (termed "rut" for Rho utilization) have C residues spaced at ~13-nt intervals, intriguing
25
because biochemical characterization suggests that Rho complexes with about 13 nucleotides per protein
subunit of the hexamer. The 3' end of the terminated RNA tends to have a region that resembles the
consensus site (boxA) for the NusA factor, as well as regions of secondary structure (JBact171:4472[89])
(but is this where the termination occurred or where the RNA was processed back by exonucleases?).
The degree of polarity of translational stops depends on the number and effectiveness of these Rho sites,
as well as the presence of a functional rho allele.
Rho does not disrupt transcription of structural RNAs, in part because of the presence of structure
within these, but there is at least one additional important feature. rRNA operons have sequences near
the start site that establish some sort of anti-termination mechanism. While the precise mechanism is not
clear, there are certainly other described cases of anti-termination known (see ASM2, p834). An
additional possible mechanism to avoid Rho binding (and therefore termination) in transcribing these
RNAs is that these RNAs bind rRNA proteins and this might affect Rho function.
Rho-mediated polarity requires that there is no translation initiation, proper or improper, between
the rut site and the next translation initiation signal. If there are, then ribosomes will bind at these and
prevent Rho from catching the RNA polymerase - for this property, it really doesnt matter if the ribosomes
make a product or not, because it is their presence on the mRNA that has the effect. It seems that rut
sites occur fairly frequently, but still may be the limiting factors in determining where polarity occurs.
The effect of polarity due to this mechanism can range from very strong to negligible (0-90%
reduction in downstream mRNA synthesis). The strength of the effect is a function of a set of variables:
the length and potential structure of open mRNA between the rut site and the next translation in initiation
signal and the quality of both of these sequences, but to a first approximation, nonsense signals near the
5' end of a gene are strongly polar, while those near the 3' end display very low polarity.
Having told you the party line on this, some very recent results (2008) from the Landick lab
suggest that Rho might actually be associated with RNAP a large fraction of the time, which would call
into question some key aspects of the above model if the observation is confirmed.
Natural polarity is the phenomenon whereby the downstream portions of an operon are transcribed less
than the upstream portions, apparently because RNA polymerase terminates. This is actually surprisingly
common, happening up to half the time in longer transcripts, though the frequency of termination within a
specific mRNA depends on several factors such as the presence of rut sites (see above) and the
frequency of translation. That is, if an mRNA is translated frequently, then the rut sites are not available
for Rho, but anything that decreases translation (such as translational repression) causes an increased
likelihood of Rho-mediated transcription termination. However, the exact mechanism by which this
operates remains unclear because it seems that translation must be dramatically reduced for a strong
polarity effect to be seen (JMB385:733[09])
Homologous recombination. I will do little more than summarize the main points and note some
curiosities. See ARG35:243[01] & ARG29:509[95] for reviews. It is now generally recognized that the
homologous recombination system's primary role in the cell is that of a DNA repair system. This would go
some way toward explaining the very high energy costs (~100 ATPs/base pair of heteroduplex formed) of
the ATP-hydrolyzing RecA:DNA filaments.
As will be explained again in LT15, the term recombination is used somewhat differently in
eukaryotic genetics. It refers to reassortment of genes, which might well be on different chromosomes,
from the organization that existed in either parent. In other words, the recombination in this sense might
mean simply that a progeny cell has the copy of chromosome 1 from one parent and the copy of
chromosome 2 from the other. Importantly, this use of recombination does not necessarily have anything
to do with breakage and rejoining of DNA as is the exclusive sense of the term in prokaryotes. Instead
the term crossing over is often used with eukaryotes to refer to the action of breakage and rejoining of
DNA.
Mechanism of homologous recombination. There are three main steps in homologous recombination:
Presynapsis, synapsis and post-synapsis. In presynapsis, the 3' single-strand tail preferred by RecA must
be generated. This end might appear naturally, as in the case of the distal end of conjugationally
transferred DNA, or through the action of one of three exonucleases (RecBCD, RecJ or RecE) acting on a
double-stranded duplex end (though it is less clear to me what the source of such breaks would be,
ASM2:2248). A variety of proteins appear to be involved in making the single-stranded DNA available for
RecA, and the formation of the RecA-single-stranded DNA filament is the product of this step.
Synapsis is the generation of a four-stranded Holliday junction based on sequence homology.
The precise mechanism by which RecA-bound DNA is tested for sequence similarity is unclear, but it
must involve non-Watson-Crick base pairing because the helix itself is not disrupted. Finally, during post-
26
synapsis, RuvAB serves as a unidirectional driver of branch migration. RuvA binds to the DNA junction
and the asymmetric association of RuvB to two strands of the duplex at the junction provides the
directionality. The junction is resolved by the RuvC endonuclease, though RecG appears to be an
alternate pathway to RuvABC for resolving such junctions.
While RecA is required for the majority of homologous recombination events, there is still a
detectable level of homology-based recombination in recA strains in specific situations. Typically, this
recombination takes place at shorter regions of sequence identity that are positioned relatively closely
together on the chromosome (< 1 kb)(see below). It is likely that the mechanism involves slippage of
either the template or product strands during replication.
It used to be felt that there were two or perhaps three major pathways of homologous
recombination in the cell (all dependent on RecA, but differing in the steps following synapsis). One
pathway was termed the RecBC pathway and the other was termed the RecF. It is now clear that there
are more than 30 proteins that have some effect on the frequency and nature of homologous
recombination, and it is less clear that the relatively simple view of two or three paths is correct.
Certainly RecBCD is an element of the major pathway, apparently being involved for about 80%
of the homologous recombination seen in the cell. Their primary role seems to be to generate the 3'
single-stranded DNA and apparently tracks along the DNA, unwinding and degrading the 3' ended strand.
At Chi sites, a specific 8-bp sequence that occurs on average about every 5 kb in Ec, the protein can
terminate its nuclease activity, possibly by release of the RecD subunit. RecBC then continues to unwind,
but not degrade, DNA and this new 3' end serves as the substrate for RecA to compare with doublestranded DNA regions. It has been shown in vitro that Chi sites can protect linear DNA from ExoV
(RecBC) degradation. Chi sites are highly asymmetric in their orientation, with the significant majority
oriented with their 3' end toward the origin of replication and it has been argued that a major reason for
the system is to take care of the damage that occurs when replication origin runs into a gapped region,
generating a double-stranded break. Models of Chi action are in Genet141:805[95] and ASM2:767[96].
RecG represents an independent pathway for branch migration, but in the reverse direction of the RecAmediated strand exchange. One result of this is that it allows gene conversion without crossing-over.
Other aspects of homologous recombination.
Homologous recombination in the chromosome is also
not random: there appear to be recombination hot spots near the ter region (terminus of replication) of the
Ec chromosome, as if recombination was a requirement in that process. The effects are seen over about
10% of the chromosome (JBact173:5096[91]). Finally, it is also possible that potential secondary
structures in DNA, like cruciforms, might affect recombination directly (Genet134:409[93]).
An interesting question is the substrate requirement for RecA function: how similar do regions
need to be in order to serve as a substrate for homologous recombination? The frequency of homologous
recombination is directly proportional to size from (at least) 164-kb down to several kb. RecA-dependent
recombination becomes very inefficient below 20 bp. Interestingly, the spacing between two identical
regions is also important: if two 20-bp repeats are more than 300 bp apart, then all recombination that is
seen will be RecA-dependent. In contrast, if these regions are closer than 300 bp, then there will also be
RecA-independent recombination, presumably the result of slippage at the growing fork of replication (see
ASM2,2012[96]).
There are a number of indications in the literature that recombination, and particularly
homologous recombination leading to deletion formation, is dramatically enhanced by replication
(Genet154:971[00] and refs therein). In the published case, the authors caused an approximately 5-fold
increase in the replication (-type) of a large plasmid with as number of repeated regions and this lead to
3
a 10 -fold increase in deletion formation between these repeats. While the basis for the effect is unclear,
there are apparently some similarities with hyper-recombination in the vicinity of the replication terminus
of E. coli.
Non-homologous recombination. RecA is certainly central to the great majority of homologous
recombination events. As a consequence, one often sees RecA-dependent recombination used
interchangeably with homology-based recombination. This is not quite correct, as there appear to be
other pathways that perform recombination based poor identity. Still, these are much less common than
the Rec-based recombination and, for most purposes homologous recombination can be considered
almost non-existent in recA strains. However, it is also true that there are some recombinational events
that clearly do not involve homology, based on the fact that the actual sequences that participate in the
event have been sequenced and show no similarity to each other at all. This is termed non-homologous,
illegitimate or (sometimes, but not quite correctly) RecA-independent recombination. This has been
known for many years, but the molecular basis remains obscure, largely because it is very difficult to
identify an activity that is, by definition, at a very low level and that does not display obvious patterns of
27
action. Also remember that transposition is itself a specific form of non-homologous recombination (LT7).
Recombination in yeast. Though there are molecular distinctions (for example, there is not an obvious
eukaryotic counterpart to the RecA of bacteria), the process of recombination is apparently rather similar
in its implications. Perhaps the most striking biological difference is that recombination in prokaryotes is
almost exclusively a repair mechanism, where is eukaryotes it is also central to the genetic exchange that
is critical for most species and useful for all. It is also important that Sc happens to be highly
recombinogenic, as measured by the frequency of recombination events per unit length of DNA and that
this recombination is almost strictly homology-based. In contrast, some other eukaryotes, including some
other yeasts, perform homologous recombination at a much lower frequency, and so the recombination
they perform is often not based on sequence similarity. This has important implications for genetic
mapping and strain constructions as well, as discussed later.
607 Lecture Topic 2........ MUTAGENESIS IN VIVO

I never met a mutant I didnt like. Barry Ganetsky
The following text focuses exclusively on bacteria, though the differences between these organisms and
yeast are probably minor and exist in the precise mechanisms involved and not in the overall biological
effects. Eukaryotes certainly do have repair systems, but they are not as well understood as are those of
bacteria.
Inherent mutation frequency. Understanding the actual mechanism by which rare spontaneous
mutations occur has been an exceedingly difficult research problem and will not be discussed here
(Genet148:1415 & 1667[98], TIG17:214[01]). Suffice to say that, through some combination of errors in
DNA synthesis and occasional failures in DNA repair systems, the frequency of any base pair being
-7
-8
mutated to another is about 10 -10 . Therefore, if you start with a mutant strain containing a base
substitution mutation, you might expect it to revert to the wild-type genotype at about this frequency.
However, the measured reversion frequency (which is the return to an apparently wild-type phenotype)
will typically be higher than this. There are often a variety of base changes (starting with the mutant
genotype) that will restore a wild-type phenotype (or at least one close enough to the wild-type to satisfy
the selection) and yet will not restore the wild-type genotype. Some geneticists, even some of my friends,
use the term pseudo-reversion to refer to those revertants that do not restore the wild-type genotype. I
find this usage awkward, because you do not know whether to call the strain a revertant or a pseudo-8
revertant until you sequence it. The 10 figure thus provides a kind of benchmark or base line for
interpreting reversion frequencies.
A full discussion of the difference between mutation rate (the probability of mutation per cell per
generation) and mutant frequency (the proportion of mutants in a culture) is treated in The Genetics of
Bacteria and their Viruses by Hayes. That text also discusses four methods of determining such
frequencies in reasonably accurate ways. For our purposes, order-of-magnitude estimates are
satisfactory so the following simple methodology will suffice.
Operationally, one grows up a full-density culture of the strain under examination and plates out
8
about 10 cells under selective conditions. If, after an appropriate period of growth (perhaps 1-3 days for a
2
typical bacterium), one sees 10 colonies, the mutation frequency for the selected phenotype is said to be
-6
2
8
10 (10 colonies/10 cells plated). This method may seem to have little connection with the first
8
calculation which deals with "base pairs per generation." However, when one plates out 10 cells, one is
8
effectively examining the mutation frequency of a replicon which has undergone 10 replications so that
the number of mutants seen reflects the frequency of mutations capable of causing the selected
8
phenotype after 10 replications. If the only mutation capable of satisfying the selection is a return to the
-8
wild-type genotype, the frequency of revertants will be about 10 in the case of a base-substitution
mutation. In practice, mutation frequencies determined in this way will be a little high perhaps as much
as an order or magnitude, depending on the gene mutated - since some small amount of growth often
8
occurs on the selection plate by all the plated cells. The assumption of 10 cells plated is therefore an
underestimate by a factor that reflects the amount of growth that occurred. For our purposes, we will
ignore these complications and assume that the experimentally derived number is correct. We will use the
term "spontaneous, as in "spontaneous mutation frequency, to indicate "the lack of mutagenic treatment
by the experimenter, a definition that is not universally agreed upon. Glass, for example, considers
cosmic radiation as a source of induced, rather than spontaneous, mutations.
The above descriptions of analyses refer to batch cultures, in which one grows up a small culture
28
and analyzes the population after a small number of generations. This makes sense in that it is a method
that is routinely used in lab analyses of bacteria, but it is hardly the way the real world works. Instead,
more typical environments involve very long-term competition among bacteria in an ever-changing
environment and a modestly changing set of competitors. At least some aspects of this sort of mutational
situation can be studied in chemostat cultures, where growth rate of a long-term population can be
studied under controllable but consistent growth conditions. Results of analysis of such cultures have
interesting implications concerning mutation rates, since a surprising number of mutator strains are found,
though this depends on the actual treatment of the population (Nat387:700 &
703[97],Genet152:485[99],and Genet148:1667[98])
In the discussions that follow, we will often use the term point mutation. This refers to any
mutation (e.g., base substitution or frameshift) where only a single base of the DNA is affected in the
mutant when compared to wild type. Typically such mutations revert at easily detectable frequencies
-8
(greater than 10 ), in contrast to deletion mutations, which affect a number of bases and can almost
never be restored to a wild-type phenotype at a detectable frequency. The term marker will also be used
to refer to a mutation (of any type) that has a non-wild-type phenotype (and hence can mark a region of
the chromosome).
Mutations and lesions. A mutation is a change in the DNA sequence from that of the wild type, but what
of a situation where damage has occurred in the DNA, such as two adjacent pyrimidines being chemically
cross-linked by UV light? Is that a mutation? The answer is that a modification of the DNA that does not
yet look normal to the repair systems is not yet a mutation, because there is a reasonable possibility that it
will be repaired. Such a modification is best referred to as a premutational lesion or simply a lesion. In the
case of the cross-linked dimer, there are repair systems that seek and correct such lesions and, once
corrected, it is as if they never happened. However, if the replication fork reaches such a lesion before it is
corrected, then it will not be able to properly replicate that region and might well make errors. If such
errors now appear to be normal base pairs, then they cannot be corrected (except perhaps accidentally
by recombination) even if they are not the wild-type sequence, and they are now mutations. In other
words, if you detect a mutation, then there must have been a lesion that was not repaired prior to
replication.
Consider the following case in Fig. 2-1. A
reactive chemical in the cell has just reacted with
the G of a G:C base pair to create a methylated
version, which we will abbreviate G*. Let us
further hypothesize that G* can pair fairly well
with either C or T, which means that during
subsequent rounds of replication, it might cause
an incorrect base to be added to the new strand.
Figure 2-1. Steps in changing a lesion to a
However, the cell can recognize most improperly
mutation. Only the final product on the bottom
modified bases, so the G*:C base will probably be
right, with the A/T pair, is actually a mutation.
seen and corrected to G:C. The G*:C base pair is
clearly a lesion at this point. In our case, however, it happens that this modification took place immediately
before the replication fork moved through that region of the genome, so the lesion was not corrected
before replication. The two new strands (at that position) now might be G*:T (because G* mispairs with T
sometimes) and G:C (because the C on the other original strand will be a perfectly normal substrate in
replication). So is a G*:T base pair a mutation? Neither base is right here, so it is tempting to assume that
it is a mutation, but it is not. First, mismatch repair (described below) will very likely see this base pair and,
through its normal mechanism, replace the newly synthesized strand. With any luck, it will put in a C,
yielding G*:C. Now there are other repair systems that look for funny bases and, in this case, would
certainly either remove the modification from the G or replace the base entirely, and either process would
restore a wild-type G:C base pair. We only get a mutation from the G*:T base pair if a repair system cuts
out the G* and then adds an A to match the T, or the replication fork goes through again and the T serves
as a fine template to create an A:T. In either case these base pairs would now be mutations because they
are completely normal in appearance, but simply differ from the wild-type sequence. At least one of the
take-home lessons in all this is that a major factor in the creation of mutations depends on who reaches
the lesion first: if it is a repair system, it will probably be corrected, but if it is the replication fork, then a
mutation might be created.
Spontaneous mutations: source and type. (ASM2, 2218[96]). It is important for biological systems to
replicate themselves with reasonable fidelity. On the other hand, too great an attention to fidelity would
29
lead to an unacceptably low replication rate because speed and accuracy involves a trade-off. For
reasons that I do not fully understand, the maximum tolerated mutation rate cannot be much higher than
the reciprocal of the genome length (Naturwis58:465[71]), and it turns out that RNA viruses are at about
that level, while microbes tend to be about 300-fold lower (PNAS90:4171[93]). This maximum mutation
rate reflects the fact that if you make too many mutations, then almost none of the progeny will be
functional. RNA viruses push the envelope partly because they do not have access to the DNA repair
systems that exist in all hosts. Indeed, it was recently found that ribavirin, an anti-viral drug that has been
used for some time, takes advantage of this feature of RNA viruses. It enhances the mutation rates
approximately 10-fold by acting as a nucleotide analog and this is sufficient to push the virus over the
edge of what has been termed error catastrophe (PNAS98:6895[01]).
As explained a few pages down (under hot spots), not all sites in DNA are mutated at the same
frequency for two general reasons. Because of differences in the surrounding regions, some bases are
more likely to be altered to become lesions than are others. Alternatively, and again because of the
context, some lesions are more likely to be repaired than others, which influences the number of
mutations that we see at these sites.
Here are the common mechanisms for spontaneous mutations:
(i) Mispairing errors in replication. Energetically, H-bonding should limit mutation rates due to
-8
-4
mispairing to about 10 , but the occurrence of tautomeric forms causes this to be 10 (NAR16:9377[88]),
-8
while 3'-5' proofreading by the polymerase drops the rate back to 10 . This is, however, too simplistic a
way to look at the situation, since polymerase itself has a role in proper base selection beyond its
proofreading function. A mini-review on the role of DNA polymerase in mutations that helps clarify the
roles of hydrogen bonding and base stacking is in Genet148:1475[98]).
(ii) Depurination. (ASM,p1044; ARG20:201[86]; model in PNAS85:5046[88]) The methylation of
either purines or pyrimidines often leads to loss of those bases spontaneously or through base cleavage.
If the replication complex hits this site before repair, a range of mutation types can result. It has also been
suggested that RNAPs are error-prone at such sites, apparently tending to insert an "A"
(PNAS90:6601[93]).
(iii) Cosmic radiation. This no doubt has an effect, but it is difficult to do the negative control.
(iv) Deletions and higher order changes. These are the subjects of LT5. Briefly, such mutations
are largely dependent on homologous recombination either at patches of homology or at chi sites. The
positioning of direct sequence repeats at deletion end points is consistent with a mechanism of either
recombination or copy-choice replication errors.
(v) ISs. These can be a very common form of spontaneous mutation type, depending on the
organism. While ISs almost always kill the affected gene product, there are rare cases where they can
cause the aberrant expression of a given gene (see LT7).
(vi) Error-prone repair (SOS). (ASM,pp1022) This repair system is listed in this section since it
seems to be a source of spontaneous errors in vivo. For example, spontaneous mutations are rarer in
recA, lexA, and umuC strains. UmuC has been given the name DNA Polymerase V (polV) because it is
able to replicate past seriously damaged template regions. For years, I felt that the fact that umuCD
r
mutants are UV strongly suggested that this is not actually a repair system, but instead a mechanism for
increasing the mutation rate under certain circumstances. However, I think the answer is simply that UV
doesnt effectively produce the sorts of lesions that require PolV action.
Mutation rates in the real world. Most of our understanding of mutation rates in bacteria is from analysis of
E. coli under fairly fixed conditions, such as chemostats or agar plates, and typically under conditions of
reasonably good growth. However, there is growing evidence that harsh conditions not only select for
mutants that prosper in these conditions, but also that the population in such conditions has a higher
mutation frequency than we would predict from the standard conditions under which such rates have been
analyzed. There are two general ways of thinking about these higher rates, both of which are probably
correct for at least some situations. In the first, the mutation rate increases in each cell in the population,
and in the second, it increases dramatically for only a subset of the larger population. From the
bacteriums point of view, the logic in the first case might be that the growth situation in challenging, so
there is little to lose by raising mutation rates in the hope of creating a mutant progeny that does well. The
logic in the second case would be that it is dangerous to increase mutation rates in the entire population
because everyone becomes less fit due to deleterious mutations (in the event that a terrifically
advantageous mutation isnt found). However, it is a bit less risky to allow a sub-population to significantly
increase the mutation rate, so that a helpful mutation still might be found, but the entire population is not
put at risk. It is technically difficult to actually differentiate between these two possibilities (e.g. how can
you really tell what the mutation rate is within single cell?).
How might either of these possibilities take place? One can imagine that you can increase
30
mutations either by increasing the number of lesions, so that more slip past the repair system to become
mutations, or because you decrease the efficiency of the repair systems themselves. Fine, but how would
either of these happen? One way would be for mutations to occur in either the genes for replication (to
make the proteins more error-prone) or in the genes for repair and such mutants are termed mutators and
are discussed below. The problem with this from the bacteriums point of view, is that being a mutator
causes rather more deleterious mutations than helpful ones, so that even if a mutator gains an
advantageous mutation, it will continue to mutagenize its own genome, which will continue to cause
problems. This can still be advantageous in some circumstances, but you can see why a non-genetic
approach might be better. In other words, it has been argued that some cells might transiently reduce the
function of their repair systems by some mechanism, and then turn them back up to a normal level after a
period of time. The mechanism behind this is unclear as is any mechanism for defining a sub-population
as alluded to above.
6
At least some forms of error do seem to be more common in non-dividing cells: Both O MeG and
4
O MeT seem to accumulate in DNA under those conditions by an unknown mechanism. In wild-type Ec,
these seem to be repaired by the ada and ogt systems described below. The possible role of
environmentally stimulated, enhanced mutation rates is also addressed in the description of mutators in
this LT. Another possible mechanism, termed the translation stress-induced mutagenesis pathway, has
been proposed (JBact183:1796[01], MolMicro30:905[98]).
These phenomena may not be critically important to most laboratory manipulations, but they are
certainly relevant to both bacterial evolution and ecology. This is because in the real world most bacteria
are not growing very rapidly and therefore our models based on observations with log-phase enterics are
deceptive at best. The existence of hot spots for spontaneously derived mutations also suggests the
possibility of targeting, but presumably this class of targeting reflects primary sequence of the region
rather than regulation or function of the encoded product as discussed below (see ASM, p1018ff).
Are spontaneous mutations targeted? We can rationalize how and why mutation rates might be
dependent on growth conditions, but it has also been proposed that there might be situations where not
only are mutations rates increased, but the mutations are targeted to regions of the genome in which
advantageous mutations might be created. The initial notion involved the ebg story of Cairns
(Nat335:142[88]) and a similar argument, based on a rather different assay, was made by Hall
(Genet120:887[88]). Moreover the notion was not that only some regions were mutagenized (one might
imagine mechanisms for that) but rather that specific regions were targeted for mutagenesis, with the idea
that these regions were more likely to give the desired advantageous mutation. Models for this very
radical notion are discussed at some length in Genet126:5[90] and included: reverse transcriptase
(though probably not a major factor in bacteria), targeted failure-to-repair, and targeted mutation
generation. The generally accepted term for this class of mutations was adaptive. Now, I am skipping a
few of the ugly details here, because at least some of the original Cairns phenomenon happened to
depend on the fact that the genes were on a plasmids (?!?!) and it is clear that there are several different
possibilities at play. However, I dont believe that there was ever a particularly compelling model on how
mutations would be targeted to specific regions. For those interested in the general issue, the following
references should bring you up to date: (JBact182:2993[00], JBact179:1550[97], EMBOJ16:3303[97],
Genet148:1453[98], Genet148:1559[98], Genet154:1427[00]).
I should emphasize that the term "adaptive" should be used carefully. In a linguistic sense, all
mutations that allow an organism to grow better are adaptive, but the term was used with the very specific
notion of targeted mutations. Unfortunately (or fortunately, depending on one's world view) the term is
now more typically used in the general sense and not in the specific (and probably fictitious) sense
proposed by Cairns et al.). SO when you see the term, think a bit about what the author means to imply.
More recently, another hypothesis to explain the possibility of targeting phenomenon has been
proposed. This model notes that the original mutation used in the Cairns experiments was a leaky
frameshift and proposes that duplication/amplification of this region increased the level of functional
product in the cell, supporting some growth. The amplification necessarily resulted in homologous
recombination, which in turn induced the SOS response, leading to general mutagenesis
(Genet161:945[02]). This simpler hypothesis seems completely compatible with the available data on the
phenomenon and has the charm of not requiring the invocation of new regulated mechanisms for
transient mutagenesis of portions of a bacterial population.
To summarize, I think it is broadly agreed that mutation rates can fluctuate and can be particularly
high in periods of growth stress. It is much less clear if this affects the entire population or a subpopulation, nor is the mechanism of the effect(s) obvious. Finally, it is conceivable that some sort of
region-targeting might take place, but there seems to be only one viable model for that.
31
Clustered spontaneous mutations. In yeast, Sherman has

sequenced a number of chemically and spontaneously
derived mutations and found that ~10% were actually
multiple base changes, with the several changes being
within 6-20 bp of each other. Remarkably, in many of these
cases, there was a nearby sequence that was either a
direct or indirect repeat of the new sequence. This
suggested a mechanism whereby a nearby, similar, but
incorrect sequence is used as a template in polymerization.
In such cases, mutagens seem to exert their effects more
by stimulating this erroneous template response than by
their direct alteration of bases (JMB201:471[88]).
A similar analysis has been done in Salmonella,
starting with a frameshift mutation (Genet149:17[98]).
There were five general classes of revertants: a 2-bp
deletion hotspot, insertions, duplications, deletions and
complex mutations. Some of these were further stimulated
by mutagens or by a uvrB mutation and others were not,
and plausible hypotheses for the modes of generation were
proposed. These typically involved template slippage, such
that polymerase utilized an inappropriate section of DNA as
template, creating a mutation.
Figure 2-2. Frameshifts and reversion.

The top line shows a WT sequence and
the second shows the result of an
addition of a G to a run of Gs. In this
example, the mutation not only perturbs
the sequence, but it also shifts the
reading frame to disclose on stop codon
in the new reading frame. A possible
reversion mutation at a second site is
shown that would restore a WT
phenotype if the leu and gln residues
coded in the WT sequence happened to
be unimportant for protein function.
Frameshifts. Frameshifts are defined as the addition or

deletion of 1 or 2 base pairs. While frameshifts tend to
occur at sequences that are monotonous runs of bases
(e.g. {G}n and {GC}n), there certainly are exceptions to this rule (JMB191:601[86]). At least some
spontaneous frameshift mutations are apparently formed by "slipped-strand mispairing", in which the
redundant sequence allows the template strand to slip in its pairing with the new strand
(NAR15:5323[87]). Depending which way the slip occurs, replication either adds or deletes a base in the
new strand. Evidence that these sorts of changes are due to errors in replication is suggested by the fact
that recA mutations do not affect their appearance, while mismatch repair mutants are >10-fold higher for
this error type (NAR15:5323[93]). These sorts of effects are also probably relevant to human diseases, as
mismatch repair has been shown to be important for stability of repetitive tracts of bases in yeast
(Nat365:274[93]). As shown in Fig. 2-2, frameshift mutations shift the reading frame and can thus cause
termination codons to appear in the new reading frame.
Surprisingly some base substitution mutations can stimulate adjacent frameshifts, if the
substituted base can pair with a base on either side of the correct one. This effect has been demonstrated
in vitro and in one particular in vivo model system (PNAS87:4946[90]).
Tautomeric shifts. It has been suggested that tautomeric mismatches (base pairs that form hydrogen
bonds because one of the bases is in an atypical chemical tautomer) are the source of some spontaneous
mutations (NAR16:9377[88]). It appears that these mismatches, unlike G:T or A:C, are stable enough to
survive until another round of replication change the lesion into a mutation. This paper has a very nice
discussion on the chemistry underlying mutagenesis and notes that there might be a category of
mutagens ("tauterogens") that mutagenize indirectly by stabilizing certain tautomeric shifts.
Neighbor effects. 6-OMeG preferentially unstacks the 3' neighboring base as well as the pyrimidine to
which it is H-bonded, resulting in DNA bending. Because of these effects, the flanking sequences can
influence the frequency of insertion of a T opposite a MeG (JMB228:1137[92]).
Resistance to UV. At least in Bs, spore DNA is heavily bound with small acid-soluble proteins (SASPs)
that prevent formation of a range of mutagenic pyrimidine dimers, but allow formation of a different
dimeric product that is relatively efficiently corrected during spore germination (JBact174:2874[92]). The
basis for amazingly high UV-resistance of Deinococcus radiodurans appears to be the following. First, it
has multiple copies of its genome, but more importantly after UV damage it stops DNA replication,
something that most other prokaryotes have not figured out how to do. It then uses homologous
recombination to recombine its many genome copies to reform at least one normal copy, which then
begins replication. (JBact184:1649[02]).
32
More recently, people have asked what sorts of mutations confer enhanced UV resistance onto E. coli by
giving independent cultures repeated does of UV and then analyzing the genomes of the more-resistant
progeny. Though causation of the detected mutations has not been established, it is highly suggestive
when mutations in certain genes are found in independently derived strains. The tentative answer, for this
starting strain of E. coli at least, is that there are actually multiple changes that can enhance resistance,
though the data suggest that changes in recombination, and the replication restart system are most
common (JBact,191:5240[09]).
Spontaneous alkylation. At least some spontaneous mutations in Ec are due to DNA alkylation,
presumably from endogenous metabolic products (this is also true for S. cerevisiae - PNAS90:2117[93]),
and to the oxidation of DNA and its precursors by activated oxygen species. 8-oxo-dG has been
hypothesized as a major product of such oxidation and appears to have a range of miscoding properties
in vitro (TIG9:246[93]).
Repair of lesions. (ASM2,2277 & 2236[96]) The cell clearly is willing to devote vast amounts of energy
and effort in the repair of damage to DNA. As is the case with mechanisms of mutation, the mechanisms
of repair are difficult to study because the actual number of events to be repaired in vivo is very small. As
a consequence, we have only a vague idea of what the actual signals for different repair systems are, and
their relative impact in a normal cell. Much of the identification of repair systems has come through the
isolation of mutants defective in repair, and which therefore display higher mutation frequencies. Some of
these were recognized by increased sensitivity to certain mutagens, like UV, and some were seen to have
a higher spontaneous mutation frequency. Members of this latter class are termed mutators and are also
addressed later.
The following are some of the repair mechanisms that have been studied:
Proofreading. DNA pol I, II, III, and T4 pol all have associated 3'5' proofreading exonuclease activity.
Direct repair. There are several types of direct repair. In the first, photolyases cause the direct catalytic
reversal of some pyrimidine dimers caused by UV crosslinking. Specifically, cyclobutane dimers are
repaired, but not pyo 6-4 dimers. For its activity, the enzyme requires near-UV light for its two
chromatophores. Interestingly, while photorepair is of importance in studied bacteria, it is of no apparent
significance in humans (PNAS90:4389[93]).
Another form of direct repair is sometimes termed inducible repair. The product of the ada gene
6
4
removes the methyl group either from O -MeG or O -MeT and transfers it to cys-321 (as well as the
methyl from Me-phosphotriesters to cys-69). Ada destroys itself in the process since it can only run the
reaction once - it is therefore non-enzymatic. Ada is positively autoregulated at the transcriptional level
(JMB205:373[89]) by its cys-69 methylated derivative, since that signals that DNA damage has been
encountered (methylated cys-321 does not have this effect). This positive autoregulation raises the
possibility of runaway regulation, but the problem is solved by the demonstration that the unmodified Ada
interferes with this transcriptional activation. The regulated operon contains three other genes, at least
one of which encodes a product that repairs other forms of methyl damage (JMB216:261[90]).
Bs has these two activities encoded in two co-transcribed genes, only one of which performs
transcriptional regulation (NAR18:5473[90]). Ec has another methyltransferase with roughly similar
substrate specificities, but whose synthesis is constitutive, providing constant low-level protection
(JBact173:2068[91]).
Base excision. When certain incorrect bases are recognized, the repair process starts with a glycosylase
cut at the N-glycosidic bond between the base and the deoxyribose, leading to an AP (apurinic or
apyrimidinic) site. An AP endonuclease cuts the phosphodiester bond next to the AP site, the sugar is
removed, the nucleotide is replaced and ligated in. There are a variety of different glycosylases to handle
uracil (a result of misincorporation or deamination of C), hypoxanthine (a result of A deamination),and 3MeA (resulting from the action of alkylators or S-adenosyl methionine).
Nucleotide excision. (Bioch42:12654[03], Sci286:1897[99]) The ABC exinuclease (uvrABC) is thought to
be used for only those lesions that distort the DNA helix. For example, it works on thymine cyclobutane
o
dimers, which cause a 30 bend in DNA (PNAS85:2558[88]), and psoralen and cis-platin adducts, which
o
6
cause >30 bends (JBC264:5172[89]). It has also been shown that this system works on O MeG, though
the reason for that was a mystery, since it was assumed that this base caused little or no helix distortion.
In fact this assumption has proven to be incorrect. The system's recognition of damage is also affected by
the surrounding sequences.
33
When an error is recognized, a complex of UvrA2UvrB binds to the damaged region and then the
UvrA2 dimer dissociates, leaving a UvrB-DNA complex that recruits UvrC. This complex triggers the
endonucleolytic activity of UvrC, which cuts the DNA. This activity makes two cuts, so that 13-14 bases
are removed, filled in by polI and ligated. Helicase (UvrD), which promotes unwinding of duplex DNA
(PNAS87:6383[90]), is also involved. Surprisingly, it has been shown that pyrimidine dimers are much
more rapidly removed from the DNA strand acting as a transcription template than from one that is not;
the factors involved in this effect are unknown (Nat342:95[89]). It is also apparent that a transcribed
region of the chromosome is repaired more rapidly than a non-transcribed one (see Micro Rev54:p42[90]
for models.
There is a line of evidence implicating RNA polymerase in this process: There is preferential
repair of targets of this repair system when they occur on the transcribed strand. The suggestion is that
the stalled RNA polymerase signals this system, but then the stalled complex must be removed
(PNAS91:8503[94]). Separately, there is a body of old data that suggests that transcribed regions are
more chemically mutable than non-transcribed regions (Genet66:31[70], JMB89:17[74], see also
JMB200:681[88] and effects of NTG MGG208:4481[87]). Another argument is that transcription itself
might be mutagenic under stress conditions (PNAS86:5005[89]).
Mismatch repair. (CurrBiol10:R788[00] & MutRes460:245[00]) The role and mechanism of the mismatch
repair (MR) system was greatly clarified with the demonstration of functionality in a completely defined
system (Sci245:160[89]). The system can repair all mismatches but C:C and, since many of the mispaired
bases involve little helix distortion, the system must detect very subtle perturbations. The logic is that the
MR system follows the replication fork and when it finds a mismatched base pair, it assumes that the base
on the old strand is correct. It tells old from new strands though there is a system termed Dam, for DNA
adenine methyltransferase, encoded by dam. This protein methylates the A base on each strand of the
palindromic sequence GATC. At
steady-state, both As are methylated,
but immediately after replication, the A
on the new strand remains
unmethylated for a while until the Dam
protein sees it. In this period, then, the
old strand is the one on which the
GATC sequences are already
methylated. As youll read later, hemimethylation of a dam site is also used
as a signal for other sensing systems
in the cell that a region has recently
been replicated and Dam is therefore
Figure 2-3. Mismatch repair. The mismatch is noted by the
indirectly involved in a curious range of
"bulge" in the DNA and there is a hemi-methylated GATC
issues. For example, Dam appears to
site that has recently been replicated. Note that the
be involved in the posttranscriptional
distance between the mismatch and the nearest GATC can
regulation of Vsr, another repair
be as much as 1 kB.
system (see the section on VSP repair)
(JBact183:3631[01]).
The MR system responds to a mismatched base pair by MutS binding to the mismatch. This
binding is recognized by, and complexes with, MutL (the genes are termed "mut," since their mutational
loss leads to a mutator phenotype). By an unknown mechanism, this complex finds MutH and activates its
endonuclease activity. Either before or after this event, MutH finds a hemi-methylated GATC (Dam) site,
cuts the unmethylated strand, and stabilizes the nick. Depending which side of the mismatch the Dam site
falls (and how does the complex know that?) the cell uses either exoVII to degrade 5'3' or ExoI to
degrade 3'5', removing the region between the Dam site and the mismatch. Then Ssb, DNA polIII
(MutD), and DNA helicase (MutU) resynthesize the deleted region and ligase seals the gap
(ASM2:766[96]). The functionality of the MR system has some dependence on the distance of the
mismatch to the nearest Dam site (NAR16:4875[88]), as well as on the context of the mutation
(Genet115:605[87]).
Note that this scheme says something about how enzymes sense strandedness and use that
information some distance away. One might have imagined that a protein would bind to one stand and
then slide along it, but in this case, they do not - they just degrade the intervening region. Similarly,
proteins do not seem to track in choosing sites for site-specific recombination (Nat325:401[87]). On the
other hand, there must be some sort of tracking going on, so that the MutHLS complex knows where it is
with respect to the mismatch (i.e. which side of the hemi-methylated DNA site the mismatch is on).
34
On the MR system's specificity, it has been noted that while some A:G's can be repaired, those
allowed by MutT cannot be (MGG219:256[89]). It is conceivable that there are different forms of a given
mispair (AantiGanti, AsynGanti, and AantiGsyn), some of which are produced in a mutT background and others
correctable by MutHLS. This argument, of course, raises any discussion of the nature of repair and
mutagenesis to a new level of complexity, since it dramatically increases the number of relevant forms. It
6
is also clear that both syn and anti-conformers of O -MeG (only the former base pairs with C) exist in Ec.
These two forms quite possibly differ in repairability as well as in their interaction with DNA polymerase,
6
possibly explaining why O -MeG is both mutagenic and lethal (PNAS90:3983[93]). By whatever
mechanism, pur-pur pairs are repaired more efficiently than either pyr-pur or pyr-pyr, presumably because
the effects on the DNA backbone make the lesion more detectable (PNAS90:804[93]). There is also an
A:G-specific mismatch repair system involving micA and mutY. The MutHLS system also works on 1-3 nt
insertions or deletions, indicating that MutS can recognize a curious set of errors
(JBact171:6473[89],Genet125:275[90]).
Curiously, Dam sites are claimed to be most common in translated regions and rarest in rRNA
and tRNA genes (NAR16:9821[88]), though its unclear if this is connected to its role in repair. It has also
become clear that not all GATC sites are methylated with the same rapidity following replication, a fact
that has implications in apparent localized mutation rates (Gene74:189[88] & JBC264:4064[89]). Indeed,
certain Dam sites may be typically (always?) unmethylated due to unknown mechanisms and for unclear
purposes (PNAS89:4539[92]).
MR also severely affects intraspecific recombination and any other recombination where there are
more than a few mismatches apparently through two mechanisms. (Note that this is a bit tricky because
you do not want MR to interfere with the normal recombination repair, which certainly also involves a
mismatch or two.) The first involves MutLS (and is MutH-independent), which act by aborting
recombination at the level of UvrD helicase. The second is MutH-dependent, requires unmethylated Dam
sites, and might involve incomplete long-patch mismatch repair. Because of these differences, mutH
mutations increase interspecies recombination 20-fold, while mutLS mutations increase it 1000-fold
(Genet150:533[98]). This means that mutants lacking mismatch repair have two enhanced abilities to
adapt to challenging environments: they have a higher frequency of spontaneous mutations, and they can
more readily accept heterologous DNA in horizontal transfer (Genet164:13[03]). Remember though that
these enhanced abilities have other mutational costs as well.
As described in a few pages, there is actually a selection for mutants lacking MR in any
environment that has rapid changes in selective conditions. MR has been detected in eukaryotes and its
absence causes a mutator phenotype (Nat362:652[93]).
VSP (very short patch) repair. This system corrects T:G mismatches arising by deamination of 5-MeC to T
by replacing the T. This system also repairs U:G to C:G although not as efficiently as base excision by
uracil DNA glycosylase (MGG243:244[94]). However, overexpression of vsr stimulates all sorts of
mutational events including frameshifts, leading to the hypothesis that vsr might physically inactivate the
mismatch repair system (JBact178:4294[96]). More recently it has been shown that VSP requires dam
methylase, apparently because Dam is involved in post-transcriptional regulation of Vsr
(JBact183:3631[01]).
The GO system repairs 8OH-G, which is an oxidation of product G. In the syn configuration, 8OH-G can
mispair with A. There are three levels to the GO system: MutT converts 8OH-GTP to 8OH-GMP, reducing
the likelihood of incorporation in the first place. MutM recognizes a variety of modified purines when
incorporated into DNA and removes the entire nucleotide. Finally, MutY recognizes A:8OH-G base pairs,
deleting the A for its replacement with C (JBact174:6321[92]). Oddly, MutM also appears to block
illegitimate recombination caused by oxidative stress, while MutT has not such role (Genet151:439[99]).
Recombination. (see ARG35:53[01]) Recombinational repair involves the products of recABCFJN, ssb,
and uvrB. This system repairs single strand gaps and interstrand crosslinks. As argued in LT1, repair may
be the primary reason for the existence of homologous recombination. The value of this system is
exemplified in the extreme radiation tolerance of Deinococcus, for which homologous recombination is
central.
Now there might seem to be a paradox here, as we will say later that deletions and duplications
are very common types of mutations and that they are substantially create through the homologous
recombination system. How can a repair system be so error-prone? The paradox is actually a valid one in
that the repair system does create mutations sometimes, though rather less frequently than it repairs
them. However, the breakage-and-rejoining function of homologous recombination system is highly
accurate, so there are no mutations introduced when the correct regions are matched and recombined.
35
The errors come in when the repair system accidentally lines up two sequences that are similar (or even
identical for a stretch) but where the sequences are not the two copies of the identical region in the two
daughter chromosomes (which is what homologous recombination seems to be set up to use). Such
recombination events then result in the deletions and duplications has described in LT5 and 6.
Error-prone/SOS repair. The recA gene product, when bound to single-stranded DNA, becomes activated
and stimulates the autocleavage of the lexA gene product, inducing the expression of recANQ, lexA,
*
uvrABD, and umuCD (PNAS98:8350[01]). Activated RecA also cleaves UmuD to the active form, UmuD .
*
The role of the umuCD gene products is apparently to function as E. colis fifth DNA polymerase (PolV)
and to allow the synthesis of DNA opposite an unreadable template (Bioessays24:141[02] &
PNAS98:8350[01]). The recA gene product is also involved in this "mutasome" complex, perhaps by
inhibiting the activity of the dnaQ gene product, a 3'5' editing exonuclease (PNAS83:619[86]). Mutants
r
lacking UmuCD are surprisingly UV , suggesting this pathway is of minor significance to cell survival in
response to UV. These results make one wonder about the actual in vivo role of the system. In fact, it has
become increasingly clear that SOS repair and the production of error-prone polymerases are actually
induced under non-growing conditions (JBact187:449[05]). Strikingly, the mutagenic effect of this system
is substantially lower in a rpoS mutant because lower amounts of the mutagenic polymerases are
synthesized (Genet166:669[04]). The notion would be that the cell would transiently mutagenize itself in
the hopes of creating a beneficial mutation. One then wonders if the repair aspects of this system are
there simply to deal with the necessary problems created by this increase in mutagenesis, rather than to
actually deal with spontaneous mutations.
Mutators. Mutators have typically been identified as strains with a substantially higher than normal
mutation frequency (ASM2,2231[96]). These can involve defective replication systems that make more
lesions, or defective repair systems that either fail to repair existing lesions or create new lesions in the
effort to repair real or imagined lesions. Obviously the latter class will be more common because they
involve loss-of-function mutations rather than the rare alteration of replication functions. mutD mutations
5
affect the epsilon subunit of DNA polIII leading to decreased 3'5' proofreading activity and causing 10 fold increase in all types of point mutations, including frameshifts. Part of this is due to more errors and
part is due to saturation of the mismatch repair system (JBact171:4494[89]). The mutagen 2-aminopurine
was thought to mutagenize by mispairing during replication, but it actually mutagenizes by saturating MR
(JBact170:3485[88]). In other words, MR thinks that the presence of an-aminopurine is a dangerous
lesion and wastes time repairing it, thus failing to repair other spontaneously occurring lesions. mutH, S, L
3
and uvrD mutations affect components of MR, leading to a 10-10 fold increase in mutations, depending
on the degree of damage to the MR system. mutT, M, and Y are described above. Two other Ec mutator
loci, mutA and mutC, cause transversions exclusively. The normal products of these genes are probably
part of an as yet undescribed error correction system (PNAS87:9211[90]).
miaA mutations cause GT changes, and the miaA gene product is known to catalyze the
26
second-to-last step in the synthesis of the ms i A modification next to the anticodon of many tRNAs. Its
activity is modulated by environmental effects like Fe-limitation and anaerobiosis. This modulation then
affects a wide range of physiological properties, including poor translation by under-modified tRNAs. It
now appears that increased mutation rate caused by miaA mutations require the presence of a functional
recombination system (?!?!). This is rather similar to another system, termed the translation stressinduced mutagenesis pathway that is turned on the mutA mutants. Rather amazingly, this last pathway
seems to result from a common mistranslation of a critical part of the proofreading subunit of DNA
polymerase, so that it now fails to correct many mutations. The basis for the requirement for
recombination in this pathway is not apparent to me (JBact183:1796[01], MolMicro30:905[98]).
Mutator mutations are clearly going to contribute to the genetic load of an organism, as every
cycle of replication will have much less fidelity, and therefore many more mutations, than would be the
case in the wild type. (The genetic load refers to the relative decrease in the fitness of the organism,
though it really only has meaning in a population sense.) However, there are situations where this load is
more than compensated for by the mutator organisms to rapidly respond to changes in the environment.
Essentially any situation where there are frequent strong and different selections enrich for mutators and
this has been shown in the lab to be amazingly efficient (JBact179:417[97], JBact181:1576[99]) and
defended through computer simulations (Genet152:485[99]). Among the implications are that mutation
rates of organisms in the real world might actually be higher than we appreciate and that successful
horizontal transfer, which is typically precluded by mismatch repair unless the sequences are extremely
similar, might actually be more common, at least in a reasonable fraction of certain populations.
The general argument is as follows. If an environment is constantly changing, then the average
bacterium would not as fit as it could be and therefore certain mutations would significantly increase
36
fitness. Now any such mutation would be rare, but their frequency would be much higher in mutator
strains. As a consequence, the organisms most likely to adapt to the new conditions first would be those
with higher mutation rates - effectively the mutator mutation would be taking advantage of the fitness gain
caused by advantageous mutation. Thus, while the mutators would still create more deleterious mutations
(the higher genetic load - see Genet154:959[00]), this might be compensated for by the increased fitness
of the advantageous mutation. This last citation is a very readable theoretical paper explains how this
phenomenon is a function of the population size (in a big population, you dont need a mutator in order for
it to be likely that at least one cell in the population acquires the desired mutation), the strength of the
mutator (a higher error rate confers both a greater genetic load as well as a greater advantage in
fluctuations), and the magnitude of the gain in fitness caused by the desired mutation (if the maximum
gain is small, then it is simply not worth the cost of being a mutator). Discussions of the effects of
mutators on a population are in Nat387:700& 703 [97] & Genet152:485[99].
So now the bug has a terrific advantageous mutation that more than compensates for the mutator
allele, but how would you ever get rid of the mutator allele (and it seems you must or all natural population
would be seen as mutators, if that was an irreversible decision)? One way, of course, is to simply revert
that allele back to the wild-type sequence, which might be possible depending on the original mutator
mutation. Alternatively, gene transfer from a very similar organism might replace that mutator allele, yet
retain the advantageous mutation. Presumably both occur.
Now the bad news. In late 2010, Richard Lenski gave a seminar about his very long-term
chemostats. In a couple of these, after many thousands of generations, strong mutators took over, even
though the potential gain due to advantageous mutations was very small (we know this because fitness of
the culture was improving at a very slow rate by this point). Now this makes no sense to me, especially
3
since they were strong (as in, a mutation rate 10 higher than WT!). I queried Jim Crow by email who
responded "I was first astounded by this in the experiments of Cox and Gibson on increase of a mutable
strain in chemostats. Apparently in the artificial environment of the chemostat an increased mutation
produces more cells that can thrive in this special environment." True enough, but it does mean that we
are missing something important here.
Anti-mutators. (ASMpp1022 and 1051) Anti-mutators can result from either more accurate polymerases
or, curiously, by affecting nucleotide pools (driving A pools down or C pools up). Anti-mutators are
extremely tough to find because you are looking for even fewer mutations than the already low normal
level. A mini-review on anti-mutators in phage T4 and Ec is in Genet148:1579[98] and another minireview on the role of DNA polymerase fidelity in this process is in Genet148:1475[98]). A mini-review on
mutators and anti-mutators in Saccharomyces is in Genet148:1491[98]). Remember that, as a rule of
thumb, speed and accuracy are inversely related - more precise polymerases are slower - so there is a
trade-off here to be considered.
Chemical Mutagenesis. (ASM2,2218 & 2277[96]) The major difficulty with use of mutagens is that you
are beating up your organism, possibly generating other mutations that might complicate your life later.
Deliberate use of transposons for mutagenesis will be discussed below and can be very useful since they
tend to be single hits of a known quality, but they do not generate the most biochemically interesting
mutations (those causing altered gene products) in that they are always "knockouts" (complete loss-offunction for the encoded product) and are nearly always polar. Sometimes you are best off with a light
chemical mutagenesis followed by good screens and enrichments. For particular classes of mutation
types, you should always spend a little extra time thinking about how to devise a selection, rather than
rely on a screen.
Mutagens increase the frequencies of mutations and their effectiveness is the result of a complex
set of factors: how well the mutagen gets into the cell, how well it reacts with DNA, how well the various
repair systems respond to the particular chemical change stimulated by the mutagen, and how much
killing occurs relative to the amount of mutagenesis. With these factors in mind, a gross generalization
2
4
can be made that alkylators are very effective (10 -10 -fold stimulation compared to spontaneous
1
2
frequencies) while the base analog and frameshift mutagens are less effective (10 -10 stimulation).
Remember that these frequencies refer to the surviving population, which might be somewhat smaller
than the starting one.
The analysis of chemical mutagens is interesting because an understanding of their effects
provides insight into normal cell repair systems and their response to both spontaneous and "external"
damage to the genome. Work on mutagens also might suggest environmental factors of importance to
long-term human health. As noted below, however, the problem of understanding mutagen action is
exceedingly difficult, because of the relative rarity of the events. As always, an effort to increase the
frequency of these events, which would make the analysis of mutagenesis more tractable, might
37
imbalance the system so much that the results are not reflective of what is going on inside the cell under
normal situations. This section and the following one provide some ideas on how to think about the
problem.
Mechanisms of mutagen action. The exact chemical mechanism of mutagenesis is largely an
imponderable, and many arguments based on correlations are simply not compelling. For example,
simply because a given mutagen reacts with DNA in vitro and yields a preponderance of a particular
lesion does not prove that this type of lesion leads to mutations. It is possible that the common lesion is
extremely well repaired and that a different lesion cause by the mutagen, although less common, is poorly
repaired and is the actual source of the observed mutations. Exactly this situation is the case for the
common but incorrect claim that cyclobutane dimers are the mutagenic product of UV, for example. In
fact, it is the rarer, but poorly corrected pyo 6-4 photoproducts that cause mutations. It has become
technically possible to generate altered bases in vitro and introduce these in vivo to see what happens
(Sci245:169[89], JMB207:355[89], & PNAS90:5989[93]). This has the virtue of asking if a hypothesized
pre-mutational lesion really does lead to mutations. There are problems with this approach, however,
including the possibility of overwhelming repair systems and the fact that even a positive result does not
mean that the generated lesions actually occur in the cell.
Alternatively, the analysis of a large number of independent mutational events can provide insight
into mutagen site preference. The generation of a sufficiently large body of data is not trivial and a
statistical test has been proposed to compare data from different labs (JMB194: 391[87]). Part of the
problem in determining actual mode of action is that the effects of mutagens can be either direct or
indirect, by affecting replication or repair (JMB201: 471[88]). A further complication is that mutagens can
have effects through multiple biochemical pathways, each of which results in different mutation types. An
example is the case of nitrosourea, which methylates Gs and causes GA transitions. Ho wever, when
SOS is induced, nitrosourea appears to cause AG transitions (JBact 171:4170[89]). As noted on ASM
p1021, while different mutagens induce SOS repair, they generate different spectra of mutations, as if the
specific chemical modification affects the way the SOS systems mishandles it. Finally, even if a modified
base can be shown to lead to mutations when created in vitro and introduced into the cell, it does not
follow that pathway is of significance in vivo. It must also be demonstrated that the modified base itself
can be created within the cell under some circumstances.
The importance of a mutagen's target specificity. Generally speaking, this is not too important a
consideration for screens/selections involving loss-of-function mutations, since there will be many
appropriate sites within any given gene. For example, even a G-specific mutagen will find many targets in
any gene; no gene is so low in G's that it is a poor target for such a mutagen and at least some
substitutions at Gs will certainly destroy gene product function. Where a very specific base change is
demanded by the selection (typically in a selection for gain-of-function mutations), some base substitution
mutagens will be better than others depending on the particular base changes they generate. For
example, in seeking a genetically altered aminoacyl tRNA synthetase that discriminates against a specific
amino acid analog in favor of the proper amino acid, there might be very few specific base changes that
create the precise active site with that ability. Only a mutagen that stimulates on of those specific changes
will increase the frequency of analog resistance.
The fact that frameshifts, whether induced or spontaneous, often occur in redundant runs of
bases implies that frameshift mutations will be less random than base substitution mutations. Consistent
with this, binding of a known frameshift mutagen (9-aminoacridine) has been demonstrated at a frameshift
hot spot (Bioc27:8904[89]). Nonetheless, such short redundant runs of bases occur a number of times in
almost every gene, so the frequency of frameshift mutations in screens for loss-of-function mutations are
a function of the size of the target gene.
There are mutagens (insertion sequences and transposons) that possess much more specificity
in their choice of target sequences. If the specificity is sufficiently demanding, some genes will rarely if
ever be mutated by a given mutagen (as in the case of certain transposons). The critical question for the
experimenter is, what is the range of genetic alterations that cause the desired phenotype and does the
mutagen in question cause those mutations at a reasonable frequency.
Requirements for DNA replication during mutagenesis. For reproducibility, you want to control the
amount of mutagenesis and therefore you want to expose a culture to a mutagen for a set period of time
8
9
and then remove it from that mutagen. In practice, a culture of bacteria (10 -10 cells) is exposed to a
mutagen for some short period of time, the mutagen largely removed by centrifugation or filtering the
cells, and then the cells are grown under non-selective conditions for a few generations to allow fixation of
mutations and expression of mutant phenotypes. The culture is then analyzed by selections, screens or
38
enrichments for cells with the desired phenotype. Typically, then, the mutation occurs (i.e. is fixed) only
when DNA replication creates a double-stranded DNA that has the non-wt sequence on both strands.
Hot spots. (ASMpp1021) Hot spots reflect the non-random occurrence of mutations at particular sites in
the DNA. These might be as general (and therefore relatively unimportant as a factor in mutagenesis) as
"every C" or as specific as a particular base sequence occurring once in the genome (in the case of Tn7).
Hot spots arise because of one of two rather different mechanisms: selective creation or selective failure
to repair. In the first, some lesions simply happen more often in certain places. For example, frameshift
mutations certainly occur in redundant runs of bases, whether they occur spontaneously or through
mutagenesis. Spontaneous deletions and duplication typically arise because of recombination between
small regions of homology (LT5 & 6). Certain mutagens have preference for specific bases and even the
adjacent context of those bases.
Selective repair refers to the fact that some lesions are more readily corrected than others. Again,
this might be context-dependent in which an identical lesion is repaired more readily in one context or
another. All other things being equal, then the lesion that is repaired less well has a higher probability of
remaining until the next round of replication, when it has a decent chance of becoming a mutation. A
5
slightly different case is that of Me-C bases, which are occasionally created in the genome in the course
of restriction/modification or other purposes. When a normal C spontaneously deaminates, it becomes U,
and the resulting G:U lesion is readily recognized as an aberrant G:C pair and the U is replaced.
5
However, Me-C deaminates to T, which results in a G:T lesion and the repair system has a problem here
- it is an aberrant base pair, but which base is incorrect? Note that mismatch repair doesnt help here, in
part because the lesion had nothing to do with an error in replication, so the new strand is not the one
more likely to be erroneous. The cell actually makes specific repair enzymes to seek this lesion and
assumes that the T is the wrong base, but of course this will not always be the case. As a result, sites of
5
Me-C modifications tend to be hot spots.
Optimal mutagenesis. Clearly you want to mutagenize enough to make the desired mutations detectable
without generating excessive damage to the genetic background. The question of optimal levels of
mutagenesis therefore largely hinges on the ability one has to screen for the desired mutant type. A
discussion on optimal mutagenesis for function-altered products is found in p1154 of Sci244:1152[89].
Excessive mutagenesis has another problem that might not be obvious. It can effectively kill all cells
exposed to the mutagen, so that survivors will be those that have been physically protected from the
mutagen and are therefore actually unmutagenized.
Summary of mutagens. (i) No mutagen is completely random; there is a wide range of degrees of
specificity. (ii) When loss-of-function mutations are sought, the number of different mutations yielding that
phenotype is probably large and a variety of mutagens (or a variety of different spontaneous events) will
be capable of producing such mutations. (iii) When gain-of-function mutations are sought, the number of
particular mutations that cause the appropriate alteration of the product will be small and only those
mutagens (or specific spontaneous events) that can induce those particular mutations will be useful.
Systems for the assay of mutagenesis. (ASMp1023ff) A number of different systems have been
developed for testing the degree of mutagenicity of a given treatment. These have the inherent problem of
attempting to detect very rare events. There are two rather different ways to solve this problem:
demanding rare gain-of-functions mutations (referred to as backward mutations in this context) or seeking
more common loss-of-function (referred to as forward mutations in this context) mutations.
Detection of backward mutations. (backwards" refers to a return to the wild-type phenotype) This
approach demands the phenotypic reversion of a characterized mutation and therefore has the advantage
of allowing the detection of extremely rare events with resulting good statistics. Because it involves a
selection, it is technically very easy to perform the analysis. The approach is limited, however, in the
number of different sites, and therefore contexts, that can be examined. In other words, you are able to
determine if a potential mutagen does or does not increase the mutation frequency for a specific base
change at a specific site, but a negative result does not necessarily mean that the compound is not a
mutagen at other sites.
(i) trpA: Reversion studies in the trpA gene of Ec are based on the understanding of the range of
acceptable amino acid substitutions at a small number of known mutation sites. It therefore allows the
determination of not only the sites at which reversion is stimulated, but also which of the acceptable
changes are stimulated and which are not.
(ii) The Ames test: The use of the his region of St is similar to the trp case, but with better
39
background information on the range of effects of different mutagens (see MutRes455:29[00]). Amusingly,
it was an adaptation of something Waclaw Szybalski had described long before.
(iii) Deletion screens: (see ASM p.1025) Such selections include demand for loss of two linked
loci, relief of polar blocks, or activation of a promoterless gene. All these can be satisfied by events other
than deletions, but there will be a higher frequency of deletions in the population than after a less stringent
screen.
Detection of forward mutations. ("forward" refers to the creation of a mutant phenotype) (ASM p.1026)
This approach uses an entire gene as a target and detects loss-of-function mutations. It has the
advantage that you are potentially analyzing any type of mutation that can affect that gene, as well as a
large number of different contexts. The disadvantage is that spontaneous mutations with the same
phenotype will also be common, so it can be tricky to see weak mutagens since they might cause only
modest increases in mutations over the background. Another challenge is that you only detect those
mutations that have a sufficiently dramatic effect on the phenotype to be seen, so that all mutations are
not detected equally well. Finally, because you dont know which mutations would necessarily cause a
detectable phenotype, then the failure to see any specific mutation might be because it either did not
occur or was not detected when it did occur.
(i) Generation of amber mutations: Miller solves the problem that not all mutations are comparably
detectable by concentrating on the creation of only amber mutations in lacI, but this creates other
problems in terms of the number of mutation types that can be sought. For example, only contexts that
are one base from amber codon can be examined in this scheme. When one focuses on a small class of
mutations like this, hotspots can also skew the statistics and in Miller's assay, there are rII modification
sites that happen to yield ambers at high frequency (these are sites methylated by a restrictionmodification system that just so happen to lead to amber mutations occasionally). It was further
complicated in the analysis of NTG, which turns out to make context-dependent G/C to A/T transition
mutations that only rarely occur in the very contexts required to create amber mutations.
(ii) Generation of any lacI knockout: The Burns analysis of lacI mutations is nicer
(JMB194:385[87]), since it examines a much larger set of potential targets. Essentially they have
generated all possible single-base pair mutations in lacI and know which have a detectably mutant
phenotype (hard to believe, but true). Therefore the inability to find a particular mutation (known to give a
detectable phenotype when it occurs) in their analysis indicates that the mutation did not occur at a
detectable frequency. However, remember that they are limited in how many individual mutations they
can sequence, so they are just detecting the most common ones. LacI has also been used as a mutation
assay in a wide variety of eukaryotes as reviewed in Genet148:1441[98]).
(iii) Loss of lambda cI function: The lambda cI screen is a positive selection for forward mutations
because loss of cI allows cell survival under conditions when the phage lysogens would otherwise lyse
the cell. Unfortunately, it does not have nearly the database that the lacI system does. It has been used to
analyze the variety of forward mutations caused by Aflatoxin B1 (Genet120:863[88]. A similar screen for
survivors of lambda induction uses cells with phage DNA fragments that contain killing functions under the
control of a temperature-sensitive repressor. These are used to screen for mutagens that increase cell
survival by damaging these killing functions. The defectiveness of the prophage ensures that there is no
killing of cells from outside by produced phage (MGG222:17[90]).
(iv) Loss of lacZ function: Pathak and Temin published a neat protocol for determining (roughly)
the forward mutation rate of a retrovirus. They cloned lacZ into a shuttle vector that depended on viral
replication in vivo, then isolated the target DNA by cutting whole-cell DNA and isolating the right fragment
by binding to lac repressor. These fragments were then cloned into Ec and screened for color
(PNAS87:6019,6024[90]). The authors were concentrating on the mutation rate for a region and not
worrying about the more subtle context effects that the above assays addressed.
Localized approaches to mutagenesis. There are two general ways of altering the genome of an
organism. Either you can create a specific mutation through some sort of in vitro manipulation and
subsequent engineering that allele into the organism, or you can create a population of mutant organisms
and then screen for ones that have an interesting phenotype (and then analyze the genome to determine
the genetic basis for that interesting phenotype). The first approach is called reverse genetics or perhaps
genetic engineering. One advantage of this approach is that you are testing a specific hypothesis (i.e. an
R157T substitution should alter the activity of this enzyme in a certain way or the deletion of gene X
should have the following effect on the phenotype, reflecting the hypothesis that the gene product runs
reaction Y in the cell.). This can be particularly powerful in trying to understand the structural basis for
biochemical properties. The problem with the approach is that a negative result (i.e. that the created
mutation doesnt have a discernibly mutant phenotype) doesnt mean very much except that your
40
hypothesis was wrong or incomplete. Since such negative results are not readily interpretable, you cannot
afford too many of them, so your hypotheses have to have a decent chance of being correct in order to
justify the mutant construction and analysis. As a consequence, for this approach to be effective, you
really have to know a lot about the gene/protein first.
The other approach is the classical one (simply because it was only rather recently that any sort
of inverse genetics was technically possible, even with E. coli). It relies on ones ability to detect an
interesting phenotype in a population (the general issue of looking for phenotypes is covered in LT8). Now
the population analyzed might be unmutagenized (relying on the presence of a number of spontaneous
mutations for reasons described earlier in this LT) or mutagenized with UV or a chemical mutagen. These
create mutations with certain frequencies and with certain target preferences, and the causative mutation
(i.e. the mutation that actually causes the interesting phenotype, as opposed to the large number of
mutations that have coincidentally been caused by the mutagen but are not responsible for the interesting
phenotype) can typically be located genetically. However, there are certainly cases where neither
approach is quite right: you dont want to mutate the entire genome because you already know that a
certain gene or region of a gene is interesting, but you also dont know exactly which mutations to create.
Here are some ways of dealing with that situation, where you more-or-less randomly mutate a specific
region and then screen that population for interesting mutants.
In one of the first applications of localized mutagenesis, Hong and Ames chemically mutagenized
phage P22 lysates and then performed generalized transduction, selecting a given marker. By definition,
the section of DNA that recombined into the chromosome with the selected marker had been heavily
mutagenized, but the rest of the chromosome in the recipient was unaffected. Any mutants found after
such a transduction would therefore have the interesting mutation (and some uninteresting ones)
transductionally linked to the selected marker (PNAS68:3158[71]).
Localized mutagenesis by this method, or by the more focused in vitro versions (LT4), has the
following advantages: (i) The mutagenesis can be so extreme that you can safely assume that almost any
possible phenotype that could exist by mutation in the region will be detected. (ii) If mutagenesis is
efficient (in that most every examined isolate has one or more mutations) then one is willing to use difficult
screens of all survivors of the treatment, looking for subtle phenotypes. This would not be reasonable
following truly random mutagenesis, since the frequency of interesting mutants would simply be too low.
(iii) This approach allows one to focus on a single region, so that mutations in other more mutable regions
do not statistically swamp them out.
Even very heavy localized mutagenesis can only create so many mutations in a region without
also creating gibberish. That is, if one wants to test a variety of different amino acids at a given position in
a protein, the genetic codes redundancy means that many single base changes lead to substitutions of
residues that are similar or identical to the one originally encoded. If you mutagenize a gene so much that
you can potentially change all three bases in a given codon, then it is certain that the gene will otherwise
have so many mutations that the product is certain to be killed. So how can you test all possible
substitutions without destroying a gene? The answer is an even more localized approach, the
randomization of one or more codons, as described in LT4. By this approach, all possible nucleotide
sequences are created through chemical synthesis, moved into an appropriate genetic background and
then screened.
Effects of mutations. Some mutations cause effects on the phenotype of the organism, but many do not,
and it is important to understand why this distinction should exist.
Mutations in translated regions. Base substitution mutations can still encode the wild-type amino acid
(sometimes termed a samesense mutation). Though such mutations will probably not affect the
phenotype, there are ways that they might: effects on DNA structure, sites for DNA modification, sites
involved in mRNA synthesis, sites necessary for mRNA stability or instability because of binding of
protective or degradative proteins, changes in mRNA structure, and effects on translation through
changes in codon choice or context.
A missense mutation is a base substitution that causes a non-wild-type amino acid to be inserted.
These can give rise to normally active proteins, inactive proteins, less active proteins, unstable proteins
(where the protein is prematurely degraded by proteases within the cell), conditionally active proteins
(where the conditional step is either at the level of protein stability or protein synthesis), hyperactive
proteins, proteins with new function, charge-altered proteins, proteins that are unprocessed (e.g. where
processing is required for insertion into or extrusion through the membrane), and finally longer proteins
(where the termination signal itself is altered). The typical result will be a product with the same or lower
activity than that of WT.
This is probably an appropriate place to comment a bit on temperature-conditional mutations,
41
which are a bit more complicated than they seem at first glance. The typical mental image is that of a
base substitution mutation that causes an inappropriate amino acid to be inserted as a site in a protein
such that the protein functions well at one temperature (the permissive temperature) and not at another
temperature (the non-permissive temperature). This certainly can be the case sometimes, though
remember that the effect on the protein is probably not all or none. That is, at the permissive
temperature, the protein has sufficient activity that it provides much better growth than at the nonpermissive temperature, where it obviously has less activity. However, the absolute amounts of activity
relative to that of wild type under either condition cannot be easily guessed at because that depends on a
host of other considerations as detailed in Detectability of mutations... in a few pages. There are,
however, other reasons for a temperature-sensitive phenotype. In one example, a protein might be
normally active at both temperature, but its synthesis might be aberrant at the non-permissive
temperature, perhaps because it folds improperly. More strikingly, mutations that completely destroy a
product (like a frameshift, insertion of deletion below) can cause a temperature-sensitive phenotype if the
gene product is only critical at one temperature range. In other words, if a protein is necessary for growth
only at high temperature, then any mutation that significantly lowers the level of functional protein will
have a temperature-sensitive phenotype. This is particularly true in multi-cellular eukaryotes, presumably
because proper development involves many proteins and subtle temperature differences can greatly
increase the importance of many of these proteins. Ganetsky notes that many temperature-sensitive
Drosophila mutations are actually loss-of-function.
Nonsense mutations cause shortened polypeptides that typically have little or no activity. In
prokaryotes, these peptides tend to be highly unstable because of intracellular proteases that degrade
such improper proteins. In eukaryotes, however, the degradation systems seem less robust, and such
shortened proteins can often accumulate. In prokaryotes, nonsense mutations often display some polarity
onto transcriptionally downstream genes (see LT1), but the rarity of polycistronic transcripts in eukaryotes
makes this property largely irrelevant. In bacteria, UAA (a.k.a. ochre), UAG (a.k.a. amber), and UGA
(a.k.a. opal) are usually recognized as stop signals.
Frameshift mutations tend to have a profound effect on a protein product, including both loss of
function and instability. Their primary effect is to place all of the mRNA 3' of the mutation in the wrong
reading frame, so that all protein encoded by that region is junk. Since frameshift mutations put ribosomes
out of the proper reading frame, they often disclose nonsense codons that then result in polarity onto
genes that are downstream. However, do not make the mistake of thinking that such nonsense codons
are the reason that the protein from the mutated gene is dead - these codons merely signal the truncation
of a section of the protein that is already useless. Obviously it is possible that frameshift mutations near
the 3' end of a gene might allow some functional, albeit it aberrant, product to be made.
(The above-described mutations are referred to collectively as point mutations, though the latter
term is sometimes erroneously used only for base substitution mutations.)
Small deletions typically give a complete loss of protein function and cause proteins to be
synthesized that are less stabile. Approximately two-thirds of such small deletions (if internal to a gene)
also shift the reading frame and therefore have properties of frameshifts including some degree of polarity
onto downstream genes.
Mutations in untranslated regions. In transcribed but untranslated regions, mutations might still affect the
translation system by affecting the recognition signal for binding of ribosomes. Alternatively, they might
affect mRNA stability, attenuation, and, where the gene product is an RNA, mutations might cause a loss
of product function or cause improper processing or modification of the product.
Mutations in regions that are neither transcribed nor translated might affect either transcriptional
start or stop signals and thus the regulation of the region in question. It is also possible they might affect
the structure of the DNA, and therefore affect gene expression indirectly.
Large deletions, inversions, and duplications. Such mutations can span both translated and transcribed
regions and obviously can have a variety of effects. For example, they can generate transcript fusions or
gene fusions, they can be strongly polar, and deletions and inversions typically eliminate the products of
the affected genes. Large inversions also have the subtle effect of changing gene position with respect to
the origin of replication, thus changing the average copy number per cell. Large duplications can have
mutant phenotypes when the copy number of that particular region is critical. Duplications within a gene
typically eliminate product function.
Insertions, insertion sequences, and transposons. Mutations involving such mechanisms will destroy
whatever function is encoded by the affected region. They typically cause polarity due to their encoding of
transcription termination signals, but can occasionally provide new promoters reading into the flanking
42
regions.
Numerology. numerology: 1. a system of occultism built around numbers...; 2. divination by numbers.
divination: ... 3. a successful guess; a clever conjecture [from Webster's Unabridged Dictionary].
One of the central themes of this course is that the frequency at which events occur makes
predictions as to their type. The frequency of detection of a particular phenotype in the population is a
product of the number of different types of mutations that cause that phenotype times the frequency of
each of those different mutations. To put it another way, phenotypes that appear frequently must occur by
any one of a large number of infrequent mutations or by one or more very frequent mutations. The
frequency of events also determines the mode of analysis necessary in order to find the desired mutants
(rare mutants require a selection or powerful screen to find them in the crowd). Ignoring hot spots, the
following list gives an approximate idea of the sorts of frequencies with which various sorts of events
occur. These numbers refer to the well-studied enteric bacteria, but they are probably not unreasonable
for other prokaryotes and even yeast, though there is a note of caution here. Note too that these are
8
operational numbers what you would see if you grew up a culture and plated 10 cell. But these ignore
the fact that there is typically some growth on the selective plate, because the bacteria are carrying along
goodies from the previous non-selective medium. Thus the true frequency might well be several fold
rarer, depending on the particular mutant.
very frequent:
1
-2
-5
loss of various constructed plasmids- 10 - 10
-l
-2
loss of tandem duplications - 10 - 10 (by recA gene product)
-3
occurrence of a duplication of a given region - 10
-1
-2
site-specific recombination events - 10 - 10
frequent:
-4
-5
spontaneous knockout of a gene 10 - 10
detectable:
typical spontaneous reversion frequency of a frameshift,
-6
-8
reversion of a missense or nonsense mutation - 10 - 10
-6
-9
precise deletion of a transposon - 10 - 10
-8
loss of most natural plasmids - <10
-3
-10
spontaneous deletions - 10 - 10 (depends on the precise region to be deleted)
-8
occurrence of any specific base substitution - ~10
1
Notes: Constructed refers to plasmids that have been manipulated to make them smaller and easier to
handle, which eliminates many of the systems use for proper segregation. Most naturally occurring
2
-5
plasmids are extremely stable. Spontaneous knockout can arise by deletions (~10 ), base
-6
substitutions that destroy the product such as most nonsense mutations and some missense (~10 ) and
-6
frameshift mutations (~10 ).
As a reminder, this table refers to the frequency of mutations in a population and not to mutant
phenotypes. Whether or not different mutations detectably alter the phenotype is the topic of the
discussion below. It is also a bit unclear how general these numbers are for prokaryotic organisms. For
obvious technical reasons, scientists abhor unstable phenotypes, so it is quite possible that the set of lab
strains we routinely use has been selected for in the past for phenotypic stability. To the extent that this is
true, dramatic phenotype instability might be more the norm among non-lab strains than we fully
appreciate (thanks for Mark Martin for this insight).
Obviously, the above table is a bit deceptive. For example, some mutations due to insertion
sequences revert at a reasonable frequency and some scarcely revert at all. On the other hand, if an
-2
event is occurring at 10 frequency, you may safely assume that it does not involve the occurrence of
some very particular base substitution mutation since such events are simply too rare. It must involve a
loss of a plasmid or something to do with duplications since these are the events in bacteria that can
occur at that frequency. Similarly, if a phenotype is occurring at very low frequency, then you may assume
it is not arising by some reasonably common event like the loss of a plasmid or the knockout of a gene,
but rather by some very specific base substitution or base change of some sort. The frequency of an
event reflects both the likelihood of the mutational mechanism as well as the target size for events that
cause the desired phenotype.
Detectability of mutations is a function of their effect on the phenotype. All of this discussion has
focused on mutations (i.e. alterations in the primary structure of DNA), but what you typically look for are
43
mutants (cells with altered phenotypes that arise because of those mutations). The likelihood of a given
mutant being detected (with or without mutagenesis) is a function of a variety of factors, all of which reflect
the fact that one detects only those mutants that display a sufficiently altered phenotype. Other than
duplications the most common type of mutations are base substitutions. This is true for both spontaneous
mutations as well as those following mutagenesis by a base analog, alkylating mutagen or UV. An
inspection of the genetic code shows that the majority of base substitutions cause missense mutations. In
the case where you are mutating a wild-type gene, the product of the mutated gene would typically have
the same or lowered activity than that of the wild type, but clearly that activity will not necessarily be zero.
It is particularly common, for example, that temperature-conditional mutants have significant amounts of
activity under the non-permissive condition (and indeed activity than wild type at the permissive
condition). The following arguments suggest the ways in which the typical partial loss of a product function
would affect the phenotype and therefore the detectability of the mutant.
Toughness of the gene product. If a protein serves a strictly structural role in the cell, there may be no
one region critical for function. Thus missense mutations will only occasionally have a sufficiently
detrimental effect on function to be detectable. The only mutations detected would be those with drastic
consequences for the function of the protein. By a similar line of reasoning, there are always regions of a
product that are more critical than others: a missense mutation affecting an amino acid at the active site
will be much more likely to have a discernibly mutant phenotype than one that affects a non-critical region
of the product. Therefore more random base substitution mutations will be detected in regions of the gene
encoding critical portions of the gene product, though presumably mutations will be generated in all
regions with roughly equal frequency.
Metabolism of the system. If a wild-type cell produces a protein at a level just barely sufficient for good
growth, then any loss of that activity in a mutant would likely be detectable because the protein activity
would become rate-limiting for the cells growth. On the other hand, a product produced in large excess of
growth requirements would need to be damaged (in a mutant) much more severely for it to become
growth-limiting. As an example, assume there are two steps in arginine biosynthesis (there are more, of
course), ArgA and ArgB, and that ArgA has ten times the activity (the number of protein molecules times
the specific activity of each protein molecule) of ArgB. A missense mutation in argA that lowered ArgA
activity by 80% would still be undetected since the activity of ArgA would not be sufficiently low to cause a
discernibly mutant phenotype. A missense mutation in argB causing 80% less ArgB activity would
probably be detected (assuming, of course, this effect was sufficient to make the arginine supply limiting
for growth).
Regulation of the system. There are systems of genes that are typically expressed at a modest level but
can, through appropriate regulation, be expressed at much higher levels if the situation demands
(anabolic pathways are often examples of such systems). There are other systems in the cell that are
expressed at maximum level when they are turned on, but are otherwise expressed at a very low level
(examples tend to be catabolic systems). Expression in these cases refers not only to the amount of RNA
transcription of the region (the typical meaning), but also to the effects of post-translational regulation of
mRNA translatability or stability. Highly tunable regulatory systems can interfere with the detection of
mutations in the genes that are regulated, as described below. However, please understand that such
regulatory possibilities do not exist because the cell plans to be mutated, but rather because there certain
physiological conditions where elevated expression is useful to the cell. It is simply a coincidence that the
regulation affects mutant detection.
As an example of regulation affecting the detectability of mutations, E. coli, even growing on
minimal medium where it needs to synthesize its own histidine, expresses its his genes at only about 5%
of maximal derepression levels. A his mutation that causes a 90% reduction in activity of a his enzyme
(even in the limiting step in the pathway) will be undetectable, because the cell is still able to turn up the
expression of the mutated gene (in response to its perceived histidine shortage) to a level sufficient to
produce enough total enzyme activity (of the mutated gene's product) to make sufficient histidine for
growth. In the his regulon of E. coli, therefore, the only mutations detected as His are those with a drastic
affect on a gene product.
A counter example is the E. coli lac operon, which encodes the gene products involved in lactose
utilization. This system is fully expressed when lactose is the best available carbon source and any
diminution in activity of the pathway results in slower growth on lactose. Not surprisingly, missense
mutations are frequently isolated in lac (greater than 95% of detected Lac mutations) because they are
often detectable while in his they are rather rare (about 10% of detected His mutations).
44
Essentiality of the system. If a gene product is essential for cell growth then a large fraction of mutations
in that gene will cause the cell to die, and strains bearing such mutations will not be found. The only
mutations that will be detected in such genes will be either conditional (e.g. they destroy the essential
function only at certain temperatures) or partial loss of function.
With fair frequency, when we actually create non-polar loss-of-function mutations in
unmutagenized ORFs, we find little or no alteration of the phenotype. Obviously this sometimes reflects
our failure to examine the phenotype under the proper conditions (i.e. a lacZ mutation only has a mutant
phenotype when lactose is the best carbon source), but there are other reasons as well. It is becoming
increasingly clear that a common reason for such observations is that there is a fair amount of functional
redundancy in the cell, so that the complete loss of a given protein is replaced in part by the function of
another. I do not believe that this redundancy reflects back-up systems on the cell's part, but rather the
fact that most genes have been created through gene duplication, so that there are often rather similar
proteins already extant in the cell. Not surprisingly, these replacements tend to work less well than the
original (for the specific function encoded by the original one), so that we often detect a leaky phenotype
in the mutant. I believe that if we looked carefully at the specific biochemical reactions, we would find that
the replacements function much less well. The general issue of genetic redundancy is addressed in
ASM2,p.2151[96].
Expression of the phenotype. A mutation may not express a mutant phenotype immediately upon its
creation in the genome, either because:
(i) In the case of a mutation causing a loss of function, it may take some time for the product of
the (pre-mutated) gene to be diluted out enough for the phenotype to become apparent, a phenomenon
referred to as phenotypic lag. For example, a cell that has just acquired a mutation in hisD (which
encodes the final enzymatic step in the biosynthesis of histidine) will still have a significant amount of both
functional hisD gene product and a pool of histidine that this protein has synthesized. It is only when the
amount of functional hisD gene product per cell gets too low (because of cell growth and therefore
partitioning of the functional hisD gene product to more and more progeny cells) to maintain the level of
histidine necessary for growth that the genotypic mutant displays its mutant phenotype. Part of the issue
is that normal proteins tend to be rather stable in bacterial cells.
(ii) In the case of a mutation causing an acquisition of function, it may take some time for the
product of the newly mutated gene to accumulate, through new synthesis, to a level high enough to
+
display the new phenotype. An example of this might be the reversion of a hisD mutation to a his
+
genotype (a re-acquisition of function). A His phenotype will be detectable only when sufficient functional
hisD protein has been accumulated to satisfy the cell's requirement for histidine.
In both examples, the time required will be a function of (a) the level of gene expression before
the mutation took place, (b) the level of gene expression after the mutation took place, (c) the amount of
function in the product of the mutated gene, (d) the level of product required for the appropriate phenotype
under the conditions used, (e) the growth rate of the cell, and (f) the instability of the gene product.
Measuring levels of macromolecules and metabolites in vivo. The several preceding sections are
necessarily hand-wavy. They refer to relative levels of a compound compared to the level of that
compound necessary for good growth. Ignoring for a moment that this is a comparison of two
concentrations that themselves fluctuate under different conditions and that the second is a complex
value based on everything else occurring in the cell, what can we say with confidence about the levels of
any single molecule in the cell? This is a hard problem , but the tools are now being developed, even if
they are no perfected. As an example, people are fairly rapidly measuring the concentration of all
measurable metabolites in the cell (NatChemBiol5:593[09]). One trick here is how do you harvest cells so
rapidly that you are confident that you have not perturbed these rapidly fluctuating pools and the answer
largely involves the use of very small numbers of cells. Note that this still has the problem that you
measure total pools in the cell and you must ignore the issue, important for many metabolites, that much
of that pool is bound by macromolecules at varying affinity, so the notion of a free concentration is
problematic.
The related challenge is measuring the levels of specific proteins and specific mRNAs. A
technically dazzling approach has been described wherein all (possible) proteins are tagged with YFP and
then both protein and mRNA levels are analyzed simultaneously in either a population or within a single
cell (Sci329:533[10]). Having said all this, if you read the paper critically, you will see that there are some
very serious issues to resolve, many of which the authors happily ignore. But this is certainly a place
where technology will solely most of the problems fairly soon.
The relevance of all this is most clear in terms of "systems biology" (a term, but not a field, that I
am afraid that I dislike). If one is to model metabolism in the cell, then one certainly needs to know with
45
confidence the real concentrations of both metabolites and proteins.

Failure to detect a particular mutant type. There are three reasons for not seeing a mutant class: (i) the
mutation at the desired gene or site did not occur in your population (you did not alter the genotype
appropriately); (ii) you did not recognize the generated mutant because it had an unexpected phenotype;
or (iii) you did not look hard enough.
Siblings/independence/non-identicalness. If you are going to go to the trouble of biochemically analyzing
the mutants that you find in a selection or enrichment, it would be a disappointment to analyze the
identical mutant 100 times. How might such a situation occur? Any time your enrichment scheme allows
significant periods of growth following mutagenesis or selection, you will tend to get a number of progeny
from any cell whose phenotype satisfies the hunt. Such identical progeny are termed siblings and by
definition they are genotypically identical. This problem can also arise if there are spontaneous mutants,
whose phenotype satisfies the particular selection or enrichment, pre-existing in the culture prior to the
mutagenic treatment.
What can be done about the problem? There are essentially two solutions: The first is to demand
independence and the second is to prove non-identicalness. When two mutants are deemed independent
it is because they "had to have arisen separately or independently." This does not guarantee that they
will be non-identical but it increases the likelihood that they will be because they certainly cannot be
siblings. Non-identical strains are those that have been shown to be genotypically or phenotypically
different from each other". If, for example, you perform separate enrichments from each of two singlecolony isolates, then mutants arising from each enrichment will be independent of each other and will very
likely be non-identical. Obviously, if one gets several mutants from the same selection or enrichment (and
therefore not independent), but shows them to be non-identical, then you have achieved the desired
result. On the other hand, it may well be very time-consuming to prove non-identicalness. What you want
are genotypically different mutants for further analysis, and independence provides some assurance of
that. However, if you do have a decent genetic or biochemical screen to assay non-identicalness, you
may want to analyze a number of non-independent isolates from the same enrichment. If you do not want
to do that, stick with independent mutants.
Deleterious phenotypes. Any time a genotype causes a deleterious growth phenotype, there will be a
selection for compensatory (suppressor) mutations under those conditions. Many mutations confer poor
growth under all conditions and can therefore never be grown without a concern for changes in genotype.
Temperature-sensitive mutations often display a non-wild-type phenotype even at the permissive
temperature, since they often affect essential functions (a common reason for resorting to temperaturesensitive mutations in the first place). This selection for suppressor mutations that modify the deleterious
phenotype is a problem since your strain's behavior will then be due to the combination of mutations,
rendering it impossible to correctly determine the biological role of the gene product known to be affected
by the initial mutation. As an example, topA mutants lack topoisomerase and are very sick, but fastgrowing derivatives readily appear with mutations in gyrA (gyrase). Analysis of such a double mutant
would be (and was) very deceptive when interpreted only in terms of the topA mutation. Another example
of this phenomenon is the fact that dam mutants of Ec accumulate mutator mutations. Analysis of a
particular dam mutant, with the T4 version of the gene supplied in trans, appeared to show that the
mutator phenotype was dominant, as though the Ec dam gene product had another specific function
(JBact172:2812[86]). That result now appears to be due to accumulation of other mutations in the strain
background that caused the mutator effect and these were the cause of the dominant mutator effect
(JBact172:2812[90]).
There is no perfect solution to this dilemma of suppressor mutations, but it is best to (i) keep the
culture frozen (to minimize growth that selects for suppressors); (ii) always go back to the original frozen
stock, not to a subculture (where growth has already increased the frequency of suppressors); and (iii)
when it is critical, cross the mutant region into a new background prior to an important analysis (to remove
that mutation from any suppressors that might have accumulated in the original strain background).
Viable-but-non-culturable (VBNC) states. This doesnt exactly go in this section, but it deserves a mention
somewhere. It is not exactly genetics, but it does affect out ability to work with some bacteria. VBNC
refers to the situation where bacteria are alive but not culturable under standard conditions and it was first
suggested for Vibrio in 1982 (Micro.Ecol.8:313[82]). The notion was attractive to many people, since it
appeared to be a matter of technique - since viability is more or less a function of our ability to detect
growth, one could simply doubt the reliability of the claims of non-culturability. The situation changed
significantly when it was shown for Micrococcus luteus that cells could be reliability resuscitated by the
46
addition of a protein secreted from M. luteus cultures, and it was termed the resuscitation-promoting factor
(Rpf) (PNAS86:8916[98]). This protein turns out to have muralytic activity, but it remains unclear why
peptidoglycan hydrolysis leads to resuscitation.
Summary of mutant detection. Because of the physiology, biochemistry and regulation of any metabolic
pathway, there are a finite number of different genotypic alterations that produce a desired phenotype.
The actual frequency at which mutations are seen, with or without mutagenesis, is quite clearly a complex
function but several rules apply:
(i) There is a much larger target for loss of a product's function than there is for acquisition or
alteration of its function. For example, let's say that there are two ways to become resistant to an aminoacid analog, altering the essential target enzyme or eliminating a non-essential permease. The latter will
2
3
occur 10 -10 times more frequently than the former.
(ii) Due to their effect on the gene product, some mutation types nearly always have a detectable
phenotype: deletions, nonsense, frameshift, and insertion mutations (this of course assumes that the
complete loss of the gene product causes a discernible phenotype.) The mutagens that induce mutations
exclusively of this type (frameshift mutagens and transposons) thus produce mutations solely on the basis
of target size.
(iii) The frequency with which mutant classes occur provides some information about the system
that they are affecting. For example, a mutant phenotype that is detected relatively frequently must be
arising due to a frequent mutational event like the knockout of a gene product.
As stated in the Definitions sections of the Preface, remember that the developmental and
morphological complexity of eukaryotes makes mutant detection rather different than in prokaryotes.
Generally, much more subtle alterations of gene product function give a detectable phenotype in
eukaryotes, simply because detection itself is easier. This does not completely negate the arguments
above when thinking about eukaryotes, but it means that the there are some important differences to
consider.
607 Lecture Topic 3............ REGULATION

Wer die Wahl hat, hat die Qual. German proverb (Who has the choice, has the agony.)
What is a regulatory factor? For example, IHF is a protein that is necessary for many regulatory
responses, but is it a regulatory protein? In a similar vein, RpoN is a factor necessary for transcription of
certain operons, but since neither its expression nor activity seems to be regulated, is it regulatory? (Note
that the level of many sigma factors certainly is regulated. I would argue that only proteins that are
normally involved in the decision to change the level of expression (or activity, etc.) are regulatory, and
not simply all proteins that participate in the regulation. This view is not universally shared, however.
The ASM2 books have a vast amount of information on the subject regulation, including
transcriptional regulation (p.1232, 1246), attenuation (p.1263), negative control (p. 2187), and positive
control (p.1300). This is then followed by 250 pages of examples in 12 different regulatory systems.
I want to start the chapter with some themes that are not specific to a particular system or
mechanism, but are found in many situations. Most of these are rarely discussed per se, but I think that
there is validity to the arguments nevertheless.
Themes in regulation.
On the notion of simplicity being correct. There is a tradition in science of seeking the simplest
hypotheses to test. This notion actually came from the middle ages, when William of Occam made the
proposal that, in terms of explanations, things should not be multiplied unless necessary (Entia non sunt
multiplicanda praeter necessitatum). Another way of viewing this is that it is always possible to come up
with more complicated explanations of something, but there are relatively fewer simple explanations. In a
sense, then, a simple explanation is more general because if it is falsified, then there are a large number
of more complicated derivatives of it that are also falsified. So we have this prejudice toward simple
hypotheses. But heres the dilemma - no biological system is ever really simple. The more we study it, the
more its complications are revealed. So if systems are necessarily complicated, are we making a mistake
to seek simple hypotheses?
I think that the solution to this paradox is the following. Each small part of a biological system
refers to the interaction of two macromolecules or a macromolecule and one or more small molecule. As a
consequence, each part is relatively simple by itself. However, biological processes are typically made up
of multiple sets of such interactions, so the process as a whole is almost always complicated. Often these
47
additional levels of interaction, such as allostery, do not change the fundamental process, but modulate it
in some way. So the overall logic of any biological process should be amenable to a simple model, but the
actual details of its behavior, with all of the modulatory features, are often complicated.
Regulatory processes are particularly prone to the addition of modulatory features that tweak the
systems just a bit, or modify its properties in some cases. All of these complications have presumably
arisen in evolution because the cost of their creation and maintenance is low, so if they provide any small
advantage, then they will be retained. In any event, it is clear that there is no selection for simplicity in
biological systems.
Integration of multiple signals. Having just discussed the layers of complexity in many biological systems,
it is appropriate to discuss integration, because the role of multiple layers is often to integrate dissimilar
signals. Integrative regulatory systems are those that use multiple inputs in determining the proper
response. It can be argued that this is the case for any multiply regulated operon, e.g. the hut operon
(histidine utilization), regulated by both nitrogen limitation and the presence of histidine. Typically these
systems involve a cascade of regulatory factors that sense different signals and interact with each other in
response. Some cascades, like the gln/ntr system for regulation of nitrogen levels in the cell
(ASM,p.1318[87]), have a goal of fine tuning the function of a handful of critical enzymes, as well as
tuning transcriptional regulation. Other cascades, like the lambda regulatory system are designed to
accentuate the regulatory decision, to yield a yes or no decision.
Again, the theoretical description of the implications of overlapping regulatory systems has been
described in ASM2,1310[96]). In that article, they suggest several useful terms: regulons, regions that
share a common regulator; stimulon, regions that respond to a common environmental stimulus; and
modulons as groups of regulons that respond to a single (global) regulator. Overlapping integrated
regulatory systems are a partial solution for the noise caused by the uneven partitioning of important
regulatory molecules during division (TIG15:65[99]).
Two-component regulatory systems. (TIBS26:369[01]) This paradigm of bacterial regulation is used by a
wide range of homologous, but not identical, regulatory systems. In general they involve a sensor protein,
often membrane-associated, that detects some chemical signal, autophosphorylates itself, and transfer
the phosphate to a regulatory factor, causing it to become active. This regulator might act at any level,
though transcriptional activation is fairly common. In different cases, other protein factors can be involved
in any of these steps. An interesting variation on this is the extent to which these similar systems may
cross regulate, with the sensor of one affecting the regulator of another. It also appears the certain small
phosphate-containing molecules, when present at significant levels, can bypass the need for certain of the
sensor molecules by activating the regulator directly.
In a clever variation on this theme, there are phosphorelay systems that control sporulation in Bs,
where two additional factors are interposed between the typical players in a two-component system.
These receive and transmit the phosphorylation between the sensor and regulator, but also provide
regulatory opportunities for more sensing input.
Stochastic events in a cell vs. populations. I would like to make three different points here: (i) Bacteria are
individually very small and therefore a number of interesting molecules are present in very small numbers,
so that there is a high probability that two daughter cells will differ transiently with respect to a number of
physiologically interesting properties. It would be difficult for a cell to maintain a precise level of any
macromolecule, especially ones present at low level such as many regulatory proteins (some repressors
are thought to be present at approximately 10 molecules per cell), but the situation is exacerbated upon
cell division. Given the importance of having a threshold level of a large number of important regulatory
molecules, it is likely that there will be at least some interesting variation between any two members of the
population. Some of the bases for this variation and the implications are discussed in ASM2,1640[96] and
the general problem for regulation is described in TIG15:65[99].
(ii) By examining populations of bacteria, we smooth out these differences between cells, but the
physiologically different individuals can sometimes affect the observations in a manner disproportionate to
their number. While some (most?) of these fluctuations are transient, it still means that at any given time,
a fraction of the population is poised to behave in a different way. For example, the response of a culture
to IPTG shows a very rapid appearance of a low level of -gal activity followed by a slower appearance of
a much higher level. Apparently the first response reflects those cells that have spontaneously induced
lac and have therefore accumulated sufficient permease to rapidly respond to the added IPTG. The rest of
the culture must bootstrap its way up more slowly. (A trace of IPTG enters the cell and causes a bit of
induction, which increases the level of the permease, which brings in more IPTG.) Note that some of
these fluctuations can be permanent: If a cell happens to transiently lose all copies of the repressor for
48
their prophage, they lyse and this is probably the way that spontaneous lysis occurs. The frequency of
-5
10 for such induction is probably instructive for this entire class of situations.
(iii) It is a very common student error to confuse frequencies of events within a cell with
frequencies of events within a population. The following examples might clarify the point: Lets say, for
example, that it takes a minimum of 50% of wild-type levels of RNA polymerase for a cell to grow. A
mutant with only 40% of wild-type levels does not grow some of the time, rather, it doesnt grow at all. A
different mutant with 60% of wild-type levels doesnt grow most of the time, it grows all the time (almost see below), though its rate of growth might not be quite that of wild type. When we talk about frequencies
within a population we are talking about the frequencies with which there will be cells that have sufficient
function to display enough growth for us to see. In the case of the mutants with 40% of wild-type RNA
polymerase activity, the only growth would be the result of the suppressor mutations that appear. Now it is
true that a population that is close to growing, say one with 40% of wild-type polymerase levels, might
revert more frequently than one with 5% of the wild-type levels because there is a larger set of mutations
that will raise the level of activity just enough, but that frequency reflects the mutation frequency at which
suppressor mutations occur.
Similarly, if there is a mutation that is 50% polar, it does NOT mean that it is only polar in 50% of
the cells, nor that only 50% of the population has a problem. Instead everyone in the population has 50%
the normal level of expression of the downstream gene and will show about the same growth phenotype,
which might be wild-type or mutant depending on the absolute level of activity required.
Having said that, now I will complicate things a bit. A cell that is sick (say, it has just barely the
minimum level of RNA polymerase for growth) will grow poorly and it might have a poor efficiency of
plating. That is, every time you grow up this culture and plate for single colonies, you will actually see
fewer than you expect based on the culture turbidity (we are referring to per cents here and not orders of
magnitude). This certainly reflects the fact that bacterial cells are so small that they display a rather
stochastic phenomenon. There is a statistical likelihood that the partitioning of the small populations of
important molecules will perturb their ability to survive and this is even more important to mutants that
already have a growth problem.
The importance of operons. Why is it that prokaryotes have operons and eukaryotes do not? It seems
apparent that there is not a strong selection for or against them in prokaryotes because genes found in
operons in some organisms are not so organized in others (PLoS Genet2:e96[06]). They certainly do
allow for molecularly simpler regulation (i.e. one regulatory system for several genes), and their
existence also makes it easier to have multiple regulatory controls over a set of genes. They might
sometimes be the result of horizontal gene transfer, either because the genes traveled together or
because they became expressed in the new host by the creation of a new operon under the control of a
pre-existing promoter. However, the record of apparent operon creation and loss (ibid) argues that they
are of marginal importance.
What is an appropriate signal for regulation? Regulation should be a reflection of the functionality of the
regulated system relative to the organism's requirements. This is not as direct as assaying the presence
of the gene product, but it is more to the point. For example, when the gene product's function is to
produce a metabolite, it makes sense to have gene products activity and expression reflect the level of
that metabolite relative to the cells needs. When the gene product has some non-biosynthetic function
(RNase, DNA binding, etc), the cell tries to use this function to autoregulate the gene product's level.
However, how do you regulate the level of synthesis of an excreted product or an insoluble one (the case
for some secondary metabolites)? In other words, how does a cell decide if metabolic function is
appropriate when the metabolic product cannot really be measured? There is also the very odd case of
induction of lac where the inducer is not the apparent substrate lactose, but a transglycosylation product
of lactose: o-allolactose (JMB69:397[72]) or perhaps glycerol--D-galactopyranoside (ASM2:1642[96]).
Again, it is not usually the amount of a metabolic product that is important, but its amount relative to that
which is optimum for growth at that time.
The relative, rather than absolute, level of a compound is usually important. This notion comes up in two
rather different arguments. In the first, balance between certain things is often crucial to the cell. For
example, a cell is nitrogen-limited when nitrogen levels are proportionately low compared to everything
else necessary for growth, not because there is any particular level of nitrogen. For example, you can
often induce a nitrogen starvation response by the addition of excess carbon because this makes nitrogen
limiting.
In a rather different way, the cell often uses relative amounts or rates of things to make an
"absolute" determination, essentially because one of the two things to be compared is a constant.
49
Attenuation, for example, compares transcription rates to those of translation. In the pyrBI case,
translation is constant and transcription rate varies, while in the amino acid and antibiotic resistance
systems, it's the reverse.
Systems where components take part in >1 level of regulation. There are numerous examples of proteins
affecting both transcriptional and post-translational regulation of a given system, such as nifL and the nif
genes. However, each of the multiple layers responds somewhat differently to a given stimulus. If layers
are there to give maximum flexibility to the degree of regulation (the ratio of off to on) or to the timing
(post-translational regulation is inherently faster than transcriptional regulation), then the effect of stimuli
might be the same. Alternatively, if the layers are designed to allow sensory input from a variety of
metabolic sources, then the responses to a given compound will almost certainly be different.
On the timing/speed of regulation. Some decisions need to be made slowly and carefully (you don't go
into sporulation or begin nitrogen fixation unless you are really sure that these major commitments are
really necessary), but others need to be made quite rapidly. Phage, especially complex phage like T4, do
not have the time to modulate transcription through the sorts of feedback mechanisms used in many
bacterial regulons. Modulating expression by this mechanism is too slow for such an organism. This might
explain the apparently frequent use of post-transcriptional regulation by phage.
On reversibility of regulation. There are different degrees of reversibility of a regulatory decision, but some
switches are truly irreversible. An example is the regulated deletion of a region that allows proper nif
expression in certain cyanobacteria, but only in a specialized cell type, heterocysts, that will never
replicate (ARG26:113[92]). An irreversible decision might also result from the generation of a signal that
cannot be counteracted, or a signal that eliminates competitive signals (lambda cI regulation).
On the optimal regulatory system. While I feel that there is no optimal mechanism for regulation of a given
regulon, there are theoretical arguments that certain aspects of regulation (i.e. whether it is positive or
negative control) do depend on observable traits of the system, such as the basal level of expression, the
induction ratio, and the rate of change between low and high expression. The simplistic summary of a
complicated argument is that genes tend to be positively regulated when their typical expression is near
the upper end of their range, and negatively regulated when typically poised near the lower end. For a full
argument, see ASM2,1310[96].
Also be careful not to confuse optimal with maximal. A given DNA-binding protein typically has a
high affinity for its specific DNA sequence, but in at least some cases, variants of that protein can be
created with 10 or even 100 times greater affinity. In doing this, one has created a protein with greater,
and possibly maximal affinity, but it will certainly not work properly in vivo. It will presumably bind more
than it should or at times when it shouldnt. The evolved system is actually tuned to the optimal level of
function in the context of the rest of the cell.
Organisms do not plan on becoming mutants. Sometimes one hears the argument that "an organism has
a given pathway or regulatory scheme in case another set of genes is mutationally damaged." This almost
certainly cannot be the case, since the possibility of random mutations in any gene does not justify the
cost of carrying around a backup set (which itself can be mutationally affected). For example, having the
capability to derepress the his operon like crazy (remember it can turn transcription on 20-fold above the
level necessary for growth on minimal medium) might mask missense mutations, but that isn't the reason
that this reserve capability is built in. Instead such flexibility is there to provide a response to some
extreme physiological situations that are occasionally encountered.
What is the dynamic response range for a given regulatory mechanism? For any sensing system, there
will be a range in which you will get a more or less linear response to a change in concentration of the
signal. A single substrate/enzyme interaction can only respond over a rather limited concentration range
(1-2 orders of magnitude), simply because 2 orders of magnitude takes you from 1% on to 99% on. What
do you do if the desired degree of fluctuation is greater than that? If the physiological range the cell needs
is greater than 100-fold, then multiple layers of regulation are typically used with appropriately different
affinity for the molecule being sensed.
Responses to rate of change. Most regulatory schemes assay the accumulated levels of proteins or
metabolites, but some systems need to assay the relative changes in such levels. However, measuring
change is extremely difficult a microbe without a memory. The classic example of this situation is the che
system (ASM2,123[96]) and in a sense the solution really does involve a primitive memory system that
50
keeps track of changes in protein modification.

How does a prokaryote measure time? Prokaryotes can measure time either in absolute units or in
variable units like cell divisions. One mechanism is to start with a known pool size and ask how long it
takes to deplete it at a constant rate of depletion (reminiscent of plants and phytochromes). Remarkably
(at least in my cramped world view) is the fact that some prokaryotes have circadian clocks. That is, they
have internal clocks that work on the timing of a 24-hour day. These are best understood in cyanobacteria
that must balance oxygen-generating photosynthesis with oxygen sensitive nitrogen fixation. They do this
by performing photosynthesis during the day and nitrogen fixation at night (MolMicro21:5[96]
JBact180:2167[98]).
Regulation based on spatial arrangements among cells. Such regulation has been studied in
streptomycetes and myxobacteria, but it is almost certainly important in the real world for any prokaryote
in a non-aqueous environment, because the cells often exist in a colony in which they do not all see the
same environment. Reporter fusions have been used to demonstrate the non-identical state of cells in
different portions of B. subtilis colonies, for example (JBact178:1980[96]). Obviously, cells on the bottom
of the colony see a higher concentration of nutrients than those above, and cells in the "inside" of a colony
see little if any of the gas phase. Are there cases of spatial regulation within a cell? Certainly in
eukaryotes, with their organelles, this is possible, but our lack of understanding of the internal structure of
prokaryotes is a handicap here.
While the above argument deals primarily with bacteria in a solid matrix, there are also clear
cases of communication between cells that are not necessarily so constrained. The excretion and
subsequent uptake of small molecular weight compounds by bacteria is typically used to monitor density
of similar cells and is a very simple mode of deliberate cell-cell communication. It is often termed
autoinduction or quorum-sensing. Such compounds are known for many bacteria and can regulate
sporulation, infection, competence and luminescence, among other things (CurOpMicro6:56 & 191[03],
CurrOpBioT13:234[02]).
It has been suggested that one of the roles of antibiotics is to serve as signals between related
bacteria, though it is unclear what metabolic response they would be stimulating. At first blush this seems
silly, but in the real world the actual level of antibiotic production is exceedingly low, so that, whatever the
purpose, the antibiotics are likely to have an effect on only a very small immediate portion of the
environment.
Autoregulation. There can be autoregulation of the level of a gene product, or of the small molecule
product of a gene product, or of the small molecule product of the entire pathway. The general virtue of
autoregulation based on a given protein level is its directness, but this could have the disadvantage of not
testing for function, especially not testing if the level of function is appropriate for the physiological
conditions. An example is the apparent autoregulation, at the level of translation, of RNA polymerase subunit (rpoB), where the regulation does not reflect the gene product's normal role as a polymerase
(JBact171:6234[89]). Metabolic regulation is less direct (i.e. it might assay other protein functions), but it
has the advantage of testing for function. Depending on precisely how it is done, it can also ask the
relative sufficiency of function in that particular environment. It seems like autoregulation ought to be
predominantly negative, since it will usually be set up to determine when there is sufficient amounts of
something and turn down synthesis at that point. When might it be positive?
RNAs as regulatory elements. Since 2000, it has become increasingly apparent that RNAs, typically small
non-coding RNAs (ncRNAs) have a variety of regulatory activities historically thought to be the province of
proteins. Examples of their activities are sprinkled in the following sections and have been reviewed
(ARB74:199[05]). It is not surprising that RNA can have regulatory effects, but it is somewhat surprising
that it has taken us so long to appreciate the range of possible mechanisms and the breadth of the
biological impacts. Some of these regulatory effects are because the ncRNAs are interacting with specific
proteins in the cell and modulating their activity. However, given the utility of base pairing, it is
understandable that most situations involve the interactions of the ncRNAs with mRNAs. Most of the trans
interactions (interactions of ncRNAs with other RNA molecules) are assisted in some way by a protein
termed Hfq. But there is another wrinkle that makes computer analysis of ncRNA-RNA interactions
difficult to predict. While the ncRNAs typically have a specific region that interacts with targets, either this
region or the paired region in the target can often form loops and bulges, so there is not a strict sequence
alignment, which in turn means that guessing possible binding partners is nearly impossible.
51
Regulation at the level of DNA

DNA quaternary structure. Supercoiling affects and is affected by a wide range of other physical
parameters and its possible roles in gene regulation have been discussed in LT1. DNA replication
ensures that there tend to be more copies of regions close to the origin than of those close to the
terminator of replication, and copy number can certainly affect gene expression. The possible role of
bacterial histones in regulation was noted in LT1.
There are obviously numerous examples of regulation of transcription involving the formation of
protein/DNA complexes, which require the correct positioning of proteins at sites on the DNA, leading to
the necessary tertiary DNA structures (e.g. DNA loops). Some of these structures no doubt affect local
stability of the DNA helix, while others allow protein:protein interactions to occur. These structures are
regulatory in the sense that their stringent requirements prevent the occurrence of productive complexes
at incorrect sites, and they allow a greater difference between the "on and off" levels of the promoter
because these effects require multiple affinities.
Obviously the more precise nucleosome structure of yeast chromatin opens the door for a variety
of regulatory effects.
DNA secondary structure. It appears to be the case that very select DNA sequences can function as
promoters when they are single-stranded (SS), or more precisely, when the relevant region does not have
a perfect DNA complement present and therefore forms an imperfect DNA duplex with another section of
SS DNA. Presumably, there is not a lot of SS DNA in most cells, so these situations have only been seen
so far in phage (Cell70:491[92]) and plasmids (Cell89:897[97]). Though there are some differences in the
two cases, such as the role of single-stranded binding protein, the notion seems to be that the imperfect
double-stranded region is adequate for binding holoRNAP and then easy to convert to an open complex
because of the base mis-pairing. Thus, the same regions when found in a normal DNA duplex do not
serve as promoters. At least in the case of F factor, this means that the single-stranded DNA that enters
the recipient cell during conjugation has a promoter near the oriT that allows expression of certain genes
only in the recipient cell. The same promoter is also involved in conversion of the entering SS DNA to a
DS copy that eventually circularizes in the recipient. This is covered a bit more under conjugation.
Rearrangements. (see Mobile DNA II ,ASM[02]) There are the reversible systems that involve site-specific
recombination (ASM2,2256[96] and LT6 for overview). Examples include Mu gin
(EMBOJ7:1219,1229[88]), hin (MolMic51:1143[03], and pilin switching in Moraxella bovis
(JBact172:310[90]. Such systems often (but not exclusively) involve the orientation of one or the other of
two similar, but non-identical, genes/transcripts downstream of a promoter. By this mechanism, a given
cell makes one or the other set of gene products, but other cells in the population make the other set.
There are also irreversible rearrangements, e.g. deletions of material in the middle of the nif
operon during the development of Anabaena, allowing the expression of nif functions in certain terminally
differentiated cell types (JBact171:4138[89]; and Losick's sporulation sigma factor of Bs (spoIVC) created
by a deletion fusing two ORFs (Sci243:507[89]). The gene encoding the site-specific recombinase for the
latter system has been identified (JBact172:1092[90]). In each of these cases a precise deletion occurs
that fuses two genetically distinct regions together to form a functional coding region. The irreversible
nature of these confines them to terminally differentiated cells and their only obvious function to make
sure that the un-rearranged gene is never expressed in non-differentiated cells.
A more complicated case seems to be plasmid shufflons (see ARG33:171[99]), where various
inversions at repeats of a region (7-19 bp in the case of IncI1 plasmid R64) at the 3' end of the pilin gene
allow a range of different proteins, with different C-termini, to be produced. These different pili have
differential ability to recognize receptors on various enteric bacteria, affecting the plasmid's identification
of hosts (JMB243:6[94]). These inversions are mediated by the rci gene product encoded near the
shufflons (JBact172:2230[90]).
Silencing. An important eukaryotic phenomenon, which may or may not have a counterpart in
prokaryotes, is silencing. In this phenomenon, regions of a chromosome, or indeed entire chromosomes,
can be modified in such a way that the genes are not expressed under at least some conditions. The
mechanisms for this appear to be numerous and include direct methylation of DNA, and the methylation,
acetylation, phosphorylation and ubiquitination of the core histones. Remarkably, at least some of this is
done under the direct action of small interfering RNAs that direct chromatin modification (see
Nat431:364[04] for a mini-review). This continues to be a rapidly evolving field and it is not without
possibility that some aspects of this will eventually be found in some prokaryotes.
Modification. The methylation of Dam sites at oriC (the chromosomal origin of replication) and in
52
transposons is used, at least some of the time, as an indication of recent DNA synthesis and therefore of
3
the availability of a sister chromosome (LT9). In the case of Tn5, there is a 10 -fold effect of Dam
2
methylation on transposition; 10 of this is due to effects on DNA target site of transposase action and the
other 10-fold is due to differences in transposase level due to altered transcription of the gene
(JMB199:35[88]). It is also clear that not all Dam sites are methylated at the same rates post-replication
(Gene74:189[88]). Phage P1 only cuts (and therefore packages) DNA at methylated pac sites, apparently
using methylation as a timing mechanism to prevent premature DNA packaging (PNAS87:8070[90]). The
pap (pilus formation) locus of Ec seems to be regulated at the level of transcription in response to the
methylation state of two Dam sites. This does not seem to be a timing mechanism, since specific protein
factors seem to block this methylation and the blockage itself is regulated (JBact173:17789[91]).
Regulation at the level of RNA. There are very nice reviews in ASM2 on gene expression (pp.12251509) and on transcription and translation (pp 792-860).
Transcription initiation. (see ASM2,792[96]) This is affected by DNA structure at a variety of levels, as well
as by various DNA-binding proteins and some ncRNAs (ARB74:199[05]). These proteins might generally
be grouped into three general classes: (i) General DNA-binding proteins (IHF, Fis) that bind specifically,
but do not appear to cause the regulation of any particular class of proteins (at least I dont believe so).
Rather they are a useful component of the regulatory mechanism of all sorts of genes. (ii) Global
regulatory factors that are directly involved in the regulation of many genes, but in a way to reflect some
overall cell response or regulatory plan. These might include the regulators responsible for anaerobic
gene control, catabolite repression and the like. It would also include specialized sigma factors, which
may or may not all be regulatory. (iii) The operon- or regulon-specific factors termed repressors and
activators. For the ncRNAs, they often seem to affect production of periplasmic proteins, though this might
be based on insufficient evidence, and they tend to support integration among different regulatory
networks.
The most significant factors in the actual strength and regulation of a given promoter are: (i) its
affinity for RNA polymerase (largely reflected in its similarity to the consensus promoter sequence); (ii)
flanking sequences such as UP elements or other 5' or 3' sequences that might either bind factors directly
or facilitate DNA bending; (iii) degree of supercoiling; (iv) repressor binding in competition with RNAP; (v)
transcription activation through direct protein-protein contact between an activator and RNA polymerase
(see ASM2:792[96]); and (vi) stability of the initiated RNA polymerase complex.
Because the major determinant for RNA polymerase recognition of specific promoters is the
sigma subunit, it is not surprising that there are a number of cases where the activity of certain sigma
S
factors is regulated. The level of (rpoS) of Ec, which is necessary for expression of many stationary
phase and osmotically regulated genes, is controlled at the levels of expression, translation, and protein
stability (G&D8:1600[94], JBact178:3763[96]), as is RpoH, the heat-shock sigma (cf JBact173:3904[91]).
F
Anti- factors are perhaps more common in Bs, but they are also found in Ec: FliA is the flagellar and it
B
is inhibited by FlgM (MolMicro6:3149[92]. In Bs, , which controls stress response, is regulated by the
F
anti-, RbsW, and this interaction is itself affected by other proteins. In sporulation, SpoIIAB is the anti-
(MolMicro31:1285[99[). Indeed, Bacillus even has anti-anti- factors (MolMicr47:1251[03]). On the
extracytoplasmic function (ECF) sigma factors, which are heavily regulated, see AdvMicPhys46:47[02].
Transcription elongation. This section will first address elongation per se, and then the implication of that
for regulation. It has been known for some time that RNA polymerase pauses, but the precise nature of
that and its role has been somewhat unclear (as a recent ref, JBC222:37456[02]). Through a remarkable
set of measurements of the progress of single RNA polymerase molecules along DNA, it has become
clear that while polymerase is capable of elongating at 90-100 nt/sec, it actually works at 50-60 nt/sec in
most cases because there are periodic pauses (Cell115:437[03] & Nat426:684[03]). These pauses are of
one or a few seconds in duration and occur about every 100 nt. It remains unclear if these are sequencedependent or not. There are a variety of ramifications: (i) This slows the RNA polymerization rate to
roughly that of translation, allowing the two processes to be coupled. (ii) It creates times where newly
synthesized RNA can restructure itself, which can potentiate other levels of regulation. (iii) It also provides
opportunity for proteins to bind to mRNA, again potentially effecting regulation.
So how does such elongation fit into regulatory responses? One of the mechanisms is that of
attenuation, which leads to cessation of RNA synthesis. Attenuation is due to a competition for the
formation of different RNA:RNA structures resulting from either intermolecular RNA or intramolecular RNA
interactions (ASM2,1263 and 822[96] Sci292:730[01], BBA1577:337[02]), Bioessays24:700[02]). The
classic case involves the formation of certain stem structures in a leader region of an mRNA, leading to
transcription termination, when translation of the leader is unimpeded. Since the leader is rich in codons
53
calling for a specific aa (which is the product of the proteins whose expression is being regulated),
impeded ribosomes indicate a lack of the amino acid. This impedance causes other mutually exclusive
stems to form, which do not lead to termination of transcription. To a first approximation, the formation of
these stems depends on the rate of translation relative to the rate of transcription. (While I believe the
above statement and the following models are substantially correct, the actual molecular details of the
decision process in different systems are far more complicated. The simple view of attenuation effectively
being the relative ability of a particular stem to form can therefore be highly deceptive for a specific
situation.)
In the simple view, transcription rates are relatively constant (the fact that this is not precisely true,
as noted above, does not fundamentally alter the argument - one simply need posit that on the scale of
the leader in question, the rate is constant in the population of RNA polymerases). Therefore the
formation of terminators depends on the ability of ribosomes to translate a leader with an excess of
codons demanding a particular amino acid, with his and trp being the most studied. In other cases,
transcription rates can vary in regions that are non-random with respect to base composition. In the case
of the pyrBI system (PNAS85:7149[88]), the translated codons are essentially random serving as a
benchmark for the ability of the RNA polymerase to produce the U-rich mRNA. In this case, the ability of
the ribosomes to keep up with the RNA polymerase indicates slow transcription, therefore uracil
deficiency, and leads to failure to terminate.
CRP is a positive regulatory protein that negatively regulates its own synthesis by inducing the
synthesis of an RNA that is complementary to a 5' portion of its own mRNA. The presence of sufficient
CRP causes the accumulation of this RNA, which binds to crp mRNA forming a stem that simulates a
terminator and leads to attenuation. There is amazing case of protein-mediated anti-attenuation
described, whereby BglG prevents a termination stem-loop from forming by binding to the mRNA
(Cell62:1153[90]). This activity of BglG is regulated by phosphate levels through the BglF protein,
depending on the level of -glucosides (Cell58:847[89],ASM21278[96]).
Yet another variation on the anti-termination theme occurs in the biosynthetic operons of a
number of gram-positives. The presence of a specific uncharged tRNA leads to anti-termination, but not
through translation. Rather, (only) the uncharged tRNA base-pairs in both the anti-codon loop and the aaaccepting stem with sequences in the mRNA leader, which then favors structures that fail to terminate
transcription (PNAS100:12057[03], Bioessays24:700[02]).
Another case of attenuation is the regulation of the tna regulon of E. coli and Proteus mirabilis
(ASM2,1279[96] where there is a clear requirement for translation of the leader, as well as a clear role for
Rho. The system is regulated by the level of tryptophan and involves a Trp codon in a leader sequence.
trp
However, the sensed levels of tryptophan are far higher than that necessary to saturate tRNA . In other
words, the regulatory system is responding to changes in the concentration of tryptophan which should all
lead to completely charged tryptophanyl tRNA. The paradox was resolved when it became clear that the
level of tryptophan was sensed by a site formed by the nascent peptide on the ribosome. When this
peptide binds tryptophan, it blocks the ability of the release factor to cleave the peptide and transfer it to
the next tRNA, thus jamming the ribosome at this position (JBC277:17095[02] & Sci297:1864[02]). Thus,
though it had the general appearance of the process controlling trp biosynthesis, the mechanism is totally
different. This specifics of this mechanism has been recently reviewed (JBact190:4787[08]) but it should
be noted that this is merely one example of a nascent peptide on the translating ribosome having
important biological effects. Other include ermC regulation (see under "translational elongation" in this LT)
and "translation jumping" in LT1.
A different mechanism, but with a similar outcome, is seen in the trp region of Bacillus where
transcription elongation is regulated in large part by the binding (or not) of a ring-shaped 11-mer protein,
termed TRAP, that binds tryptophan as an allosteric regulator. This protein interacts directly with an
mRNA sequence to effect this regulation. There is even a protein that antagonizes TRAP function
(ARG39:47[05]).
In the last couple of years, a fairly remarkable mechanism of attenuation has been described in
some bacterial mRNAs in which the mRNA itself binds appropriate small molecules and this complex itself
signals transcriptional termination. An alternative RNA structure, stabilized by the absence of the small
molecule, does not support termination. Such a regulatory approach responds to flavin mononucleotide
levels in B. subtilis (CBC4:1024[03]).
Transcription termination. (see ASM2, 822[96], Cell114:157[03], BBA1577:240 & 251[02],
Bioessays24:700[02], ArchMic177:433[02]) Obviously, this is a lot like attenuation, the choice of terms in
part reflects whether or not you have already transcribed a gene prior to reaching the regulatory site. It is
also clear that sequences near a promoter can determine the ability of a transcribing RNA polymerase to
read through a rho-independent termination signal.
54
Termination often occurs at sites where RNA polymerase pauses, at least in vitro. While the
nature of the pause determinant might include the newly synthesized RNA (as a hairpin?), it is clear that
the yet-to-be-transcribed regions are also critical to the event (JBC265:15145[90]). These events are very
difficult to address in vivo.
There are also well-described cases of regulated anti-termination in which proteins modify RNA
polymerase when at certain sites on the DNA and cause it to be resistant to termination signals
encountered later in transcription. The N protein of phage lambda recognizes an RNA segment and
becomes physically associated with the RNA polymerase in a rather large ribonucleoprotein complex.
Regulation is apparently effected through modulation of N levels and possibly by the presence of
collaborating or competing proteins.
In the case of antitermination by the lambda Q protein, a DNA segment is recognized, leading to a
modification of the RNA polymerase complex that reads through termination signals. The biochemical
nature of this modification is apparently the complex formed between it and lambda Q. The current model
is that the binding of lambda Q prevents RNA hairpins formed at terminators from disrupting the 5'terminal bases of the RNA:DNA hybrid (G&D17:1281[03]).
mRNA processing (ASM2:849) RNA modification as a regulatory mechanism is clear in the case of
tRNAs, where a variety of modified bases must be created for proper tRNA function. mRNAs lack such
modification, but processing can still be an important process. In eukaryotes, the 3' ends of mRNAs are
polyadenylylated, which stabilizes the mRNA, but in prokaryotes, the situation seems to be oddly
reversed. Prokaryotic mRNAs do have poly-A tracts at the 3' end, but these tend to be much shorter than
in higher organisms and the presence of these tracts signals the degradation of the mRNA
(JBact184:4645 & 4658[02]). The processing of a completed mRNA can take the form of site-directed
cleavage by protein or RNA (including removal of introns). Indeed discrete processing pathways of some
mRNAs in eukaryotes yield different mRNAs with functionally different gene products being created from
the same primary gene.
Though not a factor in microbes, we will mention the addition, deletion, and substitution of U's for
C's and vice versa in mammals, plants and flagellates (ARG34:499[00]). In this situation, there are
mitochondrial DNA regions that match the final mRNA product. These apparently make an anti-sense
"guide RNA" to allow processing of the true mRNA in a 3'-5' direction. In some cases the region affected
by these guide RNAs overlap, so that one guide creates the anchor sequence for the adjacent one
(Cell70:459[93]). The implications of this system include: (i) the two coding regions must coevolve,
although some separate evolution can occur (PNAS90:9242[93]); (ii) the guide RNA becomes a de facto
coding region; (iii) the system provides a mechanism for split genes and differential processing.
Surprisingly, the RNA-edited regions seem to accumulate mutations at twice the rate as unedited regions,
for unknown reasons (Nat363:179[93]). A different sort of editing occurs in apolipoprotein B, where a C in
the mRNA is specifically deaminated to a U, forming UAA TIBS19:105[94]). Finally, in a human gene,
editing an A to an I (yielding a functional different gene product) involves base pairing with a nearby
complementary sequence found in the adjacent intron. This has been detected both in vitro and in vivo
(Nat374:77[95]).
While mRNA editing of these sorts is found only in animals, it is interesting to note that the
specific cytosine deaminase from animals is closely related to that seen in chloroplasts and mitochondria,
suggesting an ancient origin for the activity. Also interesting is the detection of highly related ORFs, of no
known function, in Bacillus cereus, so that we might yet see some mRNA editing in prokaryotes
(TIBS19:105[94]). A hypothesis for the evolution of RNA editing is in TIG9:265[93].
mRNA stability. (see an excellent review by Joel belasco on mRNA decay in proks and euks: Nature
Rev/Mol Cell Biol 11:467[10]) In the proteobacteria, there are clearly both endonucleases and 3'exonucleases in the cell that attack mRNAs. The stability of a given mRNA is a function of its ability to
resist all forms of degradation. Any regulation of stability of an mRNA involves the masking or unmasking
of a region of the mRNA to nuclease attack. The only known endonucleases in Ec, RNAse E and RNAse
III, clearly do not account for all such activity as some mRNAs are normally processed in their absence. It
seems likely that in some systems, the act of translation provides a stabilizing effect. In other cases, the
initiation of translation allows a ribosome to mask an endonuclease site, enhancing stability. Stability of
the primary transcript can be enhanced by interaction with other RNAs or with proteins and can be
reduced by proteins. It is also clear that there is a rather different mechanism of mRNA degradation in
many bacteria outside the proteobacteria. B. subtilis, for instance, certainly has a 5'-exonuclease that is a
major part of its processing system, but most bacteria seem to use a different system involving RNaseE to
gnaw on the 5' end.
Most bacterial mRNAs appear to be highly unstable with half-lives of only a few minutes, but
55
some mRNAs possess in unusual stability: the spliced-out version of the T4td intron is very stable in both
its linear and circular form (Gene73:295[88]). It has also been demonstrated that certain terminators,
when cloned 3' of a given gene, produce a 3' mRNA end that is a target for RNase III cleavage which in
turn leads to mRNA lability. Other terminators cause stem/loops that do not signal RNase III cleavage and
they therefore can result in a more stable mRNA. Finally, some (all?) bacterial mRNAs are
polyadenylylated, which decreases stability and the pcnB product is a poly-A polymerase (recent ref:
MolMicro44:1287[02]). Interestingly, the addition of polyG stabilizes E. coli mRNA in extracts
(JBC276:23268[01]). In some cases, the presence or absence of the poly-A attachment can have
profound regulatory implications (PLOSBiolI6:631[08]).
The situation in all eukaryotes is fundamentally different for the simple reason that transcription
and translation occur in discrete parts of the cell and therefore cannot be coupled as in prokaryotes. At
very least this implies a longer half-life to mRNAs to allow their transit from nucleus to cytoplasm. Not
surprisingly then, if the mRNAs are all more stable, then the stability and translatability of those mRNAs
has become a more common regulatory mechanism than in bacteria, where instability is the paradigm.
mRNA processing in eukaryotes is a topic that could not be adequately covered in 100 pages and
is studied by a number of faculty at Madison, including Mike Culbertson, Phil Anderson and Jeff Ross.
There is processing at both the 5' and 3' ends of the mRNAs and a curious twist to the bacterial model is
that the presence of long poly-A tracts at the 3' ends stabilizes eukaryotic mRNAs. There is also a
substantial amount of internal splicing for intron removal in many eukaryotic mRNAs, which allows the
possibility of multiple protein products from a given gene, though introns are relatively rare in yeast. These
processing steps increase the likelihood of errors that, when coupled to long half-lives, also means that
the potential cost of an erroneous mRNA is high. Not surprisingly, eukaryotes have a complex scanning
system whereby the first translating ribosome tests if the mRNA is at least apparently correct and leads
to mRNA destruction if there appear to be problems.
Somewhat surprisingly (to me, anyway) the archaea might have rather longer mRNA half-lives,
though the data set is based on one particularly slow-growing archaeal species. It is therefore less clear
that prokaryotes are really fundamentally different from eukaryotes in this respect. As stated in LT1:
Human cells have an average mRNA half-life of 16 h; with a generation time of 12 h, that yields a ratio of
about 1.3. In contrast, Sc has an mRNA half-life or 23 min, a generation time of 90 min, for a ratio of .26.
In the best-studied archaea, the slow-growing Sulfolobus solfataricus, the half-life is 54 min and the
generation time is 360 min, for a ratio of 0.15. Lastly, Ec has an average half-life of only 1 min and a
generation time of 30 min for a ratio of 0.03 (see Genome Res13:1975[03]). While the general claim that
regulation of mRNA stability or translation initiation is more common in eukaryotes than in bacteria, and to
some extent this is because of the longer mRNA half-lives in the former group, it is apparent that the
argument is a bit more complicated than this simple arguments implies.
Regulation at the level of translation
Initiation. (see ASM2,902[96])
(i) RNA structure can affect translation initiation (CSH66:363[01]). There are a number of cases
where a gene is set up so that you only get translation from the mRNAs starting at the correct (proximal)
promoter, while transcripts originating from the 5' of the proper promoter are ineffective because of the
formation of structures burying the ribosome binding site (the RBS) (G&D2:1791[88] & JMB192:781[86]).
There is also a case of two slightly different phage lysis proteins being made because of different starts
and different RNA structures (JBact172:204[90]). Q and other RNA phage form elaborate RNA
structures so that the translation of certain genes is prevented until the RBS is unmasked by the
translation of an adjacent gene (Nat260:500[76]). Ec cheA seems to have two different translation start
sites, making gene products of 69 and 78 kd and mRNA secondary structure seems involved in the
relative translation of the two versions (JBact173:2116[91]). Finally, there is apparently a case of a protein
binding and stimulating translation of a gene: Mu mom/com, (Cell57:1201[89]).
(ii) Multiple starts. The generalization here is that the cells makes protein products from two
different start sites that differ in the activity (some noted above). Typically one but not both of the proteins
is active and the inactive one complexes with the other to interfere with its activity under some condition.
An example is the holin function that lambda uses to lyse the host: There are two protein products,
differing by two N-terminal amino acids. The shorter one is active and the longer one serves as the
inhibitor. The combination has a role in determining the timing of lysis (MolMicro21:675[96]).
(iii) Small molecule-mRNA interactions: As mentioned above, there is a case where an mRNA
binds a small molecule to affect attenuation, but there are a set of other cases where small molecule
binding to mRNAs affects the availability of the Shine-Dalgarno, and therefore translation initiation. The
small molecules (each with its respective mRNA) include coenzyme B12, thiamine pyrophosphate, Sadenosylmethionine, lysine, guanine and adenine (CBC4:1024[03] & G&D17:2688[03]).
56
(iv) Translational repression by proteins: (ASM2, 1280[96] and the old case of RNA phage coat
proteins binding to RNA and repressing translation, NAR17:6017[89]). As noted in paragraph i above,
there are cases of both repression and stimulation of translation initiation by proteins (ARB57:199[88]). In
general, the translational operator is in the vicinity of the RBS, but this is not necessarily the case. The
infC rpmI rplT region is translationally autoregulated at infC initiation (by the infC gene product).
Translation of rpmI and rplT is turned off by L20 (RplT), by its binding within infC (JMB213:465[90]).
Another example of translational repression is provided by the negative autoregulation of T4 DNA
polymerase, which binds a stem structure in its own mRNA which is reminiscent of the replication initiation
site (JMB213:749[90]). Threonine tRNA synthetase similarly binds to prevent formation of the 30S-mRNAfMet
complex (ARB53:75[84], PNAS83:4384[86]). Ribosomal protein S15 of Ec acts a little differently
tRNA
in that it traps ribosomes on the mRNA initiation site, through stabilization of a pseudo-knot
(PNAS90:4394[93]; the S4 protein appears to act similarly (PNAS90:4399[93]).
A variation of this occurs with the TRAP protein of Bacillus mentioned above in connection to
regulation of transcription elongation. This same ring-shaped 11-mer can bind mRNA to affect the
availability of the ribosome binding sites for several different genes in response to the presence or
absence of tryptophan (ARG39:47[05]).
(v) Inhibition by ncRNA: Typically this occurs at the level of initiation and can be due to either
intramolecular or intermolecular interactions (ARB74:199[05]). On IS10 antisense regulation, see
CurTopMicrImm204:49[96].
(vi) Activation of translation by ncRNAs: This appears to happen by several mechanisms, with the
o
obvious one being the disruption of interfering 2 structure by the ncRNA. However, there are clearly other
mechanisms at play that are only beginning to be deciphered (PLOSBiol6:631[08], NAR13:1018[07]).
Translational attenuation is very similar and exists where the presence or absence of translation one gene
o
determines RNA 2 structures, which in turn determines initiation on the next ORF (JMB206:69[89] &
r
JBact172:1[90]). A particularly cute case of the latter exists in the control of Tet and conjugation in
Bacteroides. Here, the stalling of a ribosome, due to the presence of tet, on a three-codon leader gene,
leads to changes in mRNA structure that expose the RBS for the gene encoding resistance
(JBact187:2673[05]).
Translational elongation. There are a number of ways by which elongation can be regulated.
(i) RNA structural effects on elongation are poorly understood, but Inouye has argued for
translation pause sites that might reflect differences in RNA structure having an effect on synthesis of
large proteins (JBC274:4428[89]). In the TY element of yeast, the relative abundance of a rare tRNA
affects the frequency of frameshifting necessary to produce a critical transposition factor
(PNAS87:8360[90]).
(ii) tRNA availability can affect the rate of translation or even the ability to translate; availability
can influence both frameshifting and translation error rate.
(iii) Translational attenuation can affect translatability of the mRNA past the short leader gene. In
some of these cases, the stalled ribosomes also block what appears to be 5'-3'exo degradation of the
mRNA (JBact171:5803 & 6680 [89]). As mentioned above, there are cases of protein-mediated
attenuation and anti-attenuation. A particularly interesting case is that of ermC of Bs, which encodes a
methylase conferring resistance to MLS antibiotics. The methylase modifies ribosomes when slow
translation suggests that MLS antibiotics are present. Recognition of the effect is through relief of
translational attenuation. While attenuated, the stalled ribosomes provide stability for the untranslated
mRNA, presumably by blocking an endonuclease site. Finally, the methylase apparently serves as a
translational repressor of its own synthesis. These three levels of post-transcriptional regulation are briefly
reviewed in MolMicro4:1419[90].
Among the problems in rationalizing the ermC model has been the nature of the specificity in the
stall site (as the ribosomes should presumably stall anywhere) and the specificity of expression in the
presence of the appropriate drug (erythromycin and chloramphenicol have similar effects on translation,
but are specific for the proteins they induce through translational attenuation). There was also the very
surprising data from the Weissblum lab that the sequence of the ermC leader was important
(JMB206:69[89]), but a role was unclear. It now turns out that the two genes regulated by CAM have
different leader peptides, but each is an inhibitor of peptidyl transferase. Thus these peptides themselves
provide the site-specificity for pausing. The model is that paused ribosomes become stalled (and lead to
translation downstream) due to the presence of the appropriate drug, presumably because of the specific
nature of the leader peptide (PNAS91:5612[94]). This also explains how a drug that causes translational
stalling can paradoxically lead to translation: the pause sites are hypersensitive to the drug.
(iv) The effects of such cis-acting peptides on translation are not confined to leader peptides, but
57
are also found with certain coding regions within functional proteins. It is apparently found in a variety of
prokaryotic and eukaryotic systems and is nicely discussed in MicroRev60:366[96]).
Regulation of translational termination. (see ASM2,909[96].
There is not a lot of regulation at this level. RNA structural effects on termination exist, but these
are hardly regulatory (ARB57:199[88]. One can fail to terminate due to suppressors or frameshifting, but
again, the level of these is not regulated. The mis-coding discussed in LT2 whereby selenocysteine or
pyrrolysine are inserted at certain stop codons is again a coding issue and not a regulatory one.
Regulation at the level of protein
Protein stability. While most proteins in prokaryotes are fairly stable, there are cases of specific cleavage
(or stimulation of autocleavage-lexA/recA), as well as proteins with naturally short half-lives. There is an
excellent mini-rev of regulation by proteolysis in ASM2,938[96]. At least some unstable proteins are
stabilized upon overproduction (in rather direct contrast to the case of many normal proteins, where
overproduction makes them less stable), apparently because the degradation system has been
overwhelmed. In yeast, at least, there is a strong correlation between half-lives and the N-terminal amino
acids (JBC264:16700[89] & Sci234:151 & 179[86]), apparently reflecting the action of ubiquitin, which is a
major pathway of protein turnover in eukaryotes.
While not directly related to regulation this is a reasonable spot to mention chaperones (see
ASM2, 922[96] & MolMicro45:1197[02] CurOpMic6:157[03], CellMolLifeSci59:1589[02]). It is clear that the
products of groEL and a number of other proteins are involved in formation of at least some protein
complexes and in the stabilization of many proteins affected by missense mutations. For example, groEL
mutations cause a defect in error-prone repair, apparently because they are necessary to stabilize the
umuD gene product, which is required for that process. These chaperones, whose homologs are found in
virtually every organism, bind to semi-denatured states of many proteins, thus increasing the likelihood
that they will properly fold. Many of these proteins are induced by heat shock, and others seems to serve
the general function above, but for particular classes of proteins, for example, those to be exported from
the cell. The roles of DnaK, DnaJ and GrpE (other heat-shock proteins in Ec) in protein secretion has also
been described. The crystal structure of GroEL has been solved and is a cylinder that apparently
functions by trapping the target protein inside, shielding it from interference with its proper folding by the
concentration of proteins in the cytosol.
There are also numerous cases of proteins that are normally stable when complexed with other
proteins, but which become highly unstable when unable to complex, either because the protein in
question is overproduced or the other protein is absent. Again, whether this should be considered
regulation is unclear.
Finally, it has been my prejudice that yeast is more tolerant of odd proteins than are the
commonly studied enteric bacteria. This prejudice is based on the observation that one often has trouble
achieving good accumulation of foreign or odd proteins in enterics because of protein instability, but yeast
seems more tolerant of such things. Mike Culbertson disagrees, however, and notes that they routinely
fail to over-accumulate aberrant proteins even when expressed at high levels, which is certainly
consistent with their degradation.
Covalent modification can be used to either activate (phosphorylation of NtrC or most other twocomponent regulators, and spo0A gene product to block sporulation in Bs, JMB215:359[90]) or inactivate
Glc
(glnA and numerous other examples) or alter function (III of the PTS system, rev in JBC265:2993[90]).
Modifications include methylation, phosphorylation, ADP-ribosylation, adenylylation, uridylylation,
glycosylation, acylation (JBC265:17180[90]) and addition of cytochromes. Modifications can target
proteins for other processing, mask a critical site, interfere with activity by steric hindrance,... Such
modifications are typically reversible.
Allostery. Allostery is the situation in which the binding of a ligand to one site on a protein influences the
binding of another ligand at an independent site on the same protein. Innumerable examples of this
regulation exist. One specific type of allostery is feedback inhibition. This is not to be confused with
repression, which involves a blockage of transcription. Feedback inhibition is the inhibition of the activity
of an enzyme by a metabolic product of the pathway. Most typically, the first enzymatic step in the
pathway is inhibited by the final product of the entire pathway. This makes sense because it blocks the
flow of metabolites into the pathway, so that undesired intermediates are not produced. Brief discussions
of this in a variety of pathways is given in the ASM2[96] books, including aromatic pathway (p,458) and
histidine (p.485).
Do not be deceived by the short section here devoted to allostery. This mode of regulation is
58
probably the most important single method for all living cells. In prokaryotes, only the regulation of
transcription initiation even comes close in its impact on physiology and metabolism. Unfortunately, this
sort of regulation is missed by most of the genetic tools that we use for examining regulation, such as
fusions and arrays, since they focus on mRNA synthesis and accumulation.
Limiting factors (metals as the example). Metals are critical to a large number of proteins and they are not
always as available as an organism might like. For example, the most commonly found nitrogenase
system contains an atom of molybdenum at the active site, but many nitrogen-fixing organisms carry sets
of gene for one or even two additional nitrogenases. These alternate enzymes are slightly less efficient at
the process, but have either vanadium or iron at the active site, so they are used when there is insufficient
molybdenum in the environment.
There are a number of challenges in studying regulation in the cell, depending on the system.
Can we measure all of the physiologically important regulatory responses in a cell? Typically we struggle
to measure things in vivo with better than two-fold accuracy, which means that it is very tough to make
solid conclusions about regulatory systems that have only a three-fold range (or those with less, of
course) at maximum. We therefore have a tendency to dismiss such modest changes as unimportant,
where in fact they are simply technically tough to address. So are they really unimportant as well? Some
might be, but a case could be made that many systems with such a modest range are governing activities
that are so critical that this is the only variation in activity that the ell can tolerate. I will give a single
example: glutamine synthetase, or GS, is a central step in the nitrogen cycle of the cell, and so its activity
is regulated based on the ratio of carbon to nitrogen. It is subject to elaborate systems of transcriptional
regulation (or at least the gene, glnA is) as well as post-translational regulation through covalent
modification: at least six gene products are directly involved. However, under the most extreme growth
conditions (excess carbon and nitrogen-starved vs carbon-starved and excess nitrogen and carbonstarved), the change in GS activity is only about three-fold. Indeed, very high or very low levels of GS are
lethal to most cells. The reason for all this is that the relative level of nitrogen, in the form of glutamine, is
so critical to so many aspects of cell metabolism, that there is only a rather limited range that can be
tolerated.
What is the best way to study a regulatory system? Obviously this depends on the system, but I want to
make two points: we tend to apply our tools at hand rather than what is biologically appropriate, and we
often perturb the system under analysis. On the former issue: transcriptional regulation is certainly very
important, but there is a tendency to study it to the exclusion of other levels of regulation, in part because
it is the fashionable thing to do. Array analysis, for example, is very useful, but the result are often
phrased as if all interesting regulation in the cell is being studied, where what is actually analyzed is a
somewhat complex (and poorly defined) mix of transcriptional and post-transcriptional regulation. On
perturbation of the biology, remember that making mutants of any sort, and particularly transcriptional and
translational fusions (see below) themselves perturb the system. Now this is fine as long as we keep that
fact in mind, but we often do not. For this reason, there is something to be said for a direct physical
analysis of a wild-type population, such as a direct analysis of mRNA synthesis or accumulation, or
protein synthesis, accumulation or activity. Of course these all have their own special problems, but at
least the problems are different from those of genetic approaches, so a combination of genetic and
physical analyses is very powerful.
We tend to ignore the fact that biological populations are not homogeneous. There are two issues here.
The first was discussed in the paragraph earlier in this LT on stochastic processes, where the take-home
is simply that a population of cells is not uniform. Rather, there are a number of outliers in the population,
but our methods of analysis tend to average these outliers away. Sometimes this is fine, but sometimes
different sub-populations are sufficiently large and sufficiently different from the majority class that mixing
them all together gives a very distorted view of what is going on. The second issue was also touched on
before, but not very explicitly. That is, the behavior of populations of identical proteins in a cell is itself not
uniform. The best way of describing this is with an example. Catabolite activating protein (called CAP or
CRP) is said to be active for DNA binding and transcriptional activation in the presence of cAMP, but
inactive without cAMP. Of course, this cannot be absolutely correct when we think about the situation in
terms of thermodynamics. CRP must exist in an equilibrium between the active and inactive forms, which
are probably thermodynamically fairly stable, with a certain energy peak in between the two forms which
establishes the rate of interconversion. In the absence of cAMP, the equilibrium is strongly shifted toward
the inactive form, and in the presence of cAMP, it will be strongly shifted toward te active form. However,
in neither case will the population be homogeneously active or inactive. Indeed, in a homologous protein,
59
CooA, it appears that the absence of the effector strongly shifts the protein to the inactive form, but the
presence of the effector only creates a population of proteins that is about 25% active. Biologically,
however, this makes sense if the cell wants negligible activity without effector and only requires only a
small total population of active protein in the presence of effector.
But our ability to analyze single cells is improving dramatically. See Sci329:533[10}, though this
analysis and the proposed interpretations are not without some serious problems.
Fusions
In the course of analyzing a regulated gene or genes,
it is often useful to monitor the regulatory system by
putting a foreign gene under the control of that
system. The foreign gene typically encodes an easily
assayed product, (a reporter molecule), like the
enzyme -galactosidase, encoded by lacZ. For the
activity of the reporter to reflect the regulatory scheme
of interest, the foreign gene needs to be separated
from its own regulatory signals and placed
transcriptionally downstream from the regulatory sites
under study. Such a merger of the two different
regions is termed a fusion, and is used when the
regulation of the system is the important topic. The
inverse case, where a gene of interest is placed
downstream of a strong or tunable promoter, is
typically called an expression vector and is employed
when the overexpression of the gene product is the
Figure 3-1. Examples of transcriptional and
critical issue. The use of fusions is substantially
translational fusion.
similar in yeast as in bacteria, though perhaps less
commonly employed. The traditional reporter was also
-galactosidase, though GFP fusions have become
much more popular.
Fusions fall into two classes that require
separate definitions. In transcriptional fusions, the
reporter gene lacks a promoter, but possesses a
functional ribosome-binding site, so the reporter gene
product is made whenever the target gene promoter is
being transcribed. This allows you to monitor
transcriptional regulation of the target gene and ignores
any posttranscriptional effects, since the mRNA is
necessarily altered relative to the wild type. In
translational fusions, the reporter gene lacks both a
promoter and a functional ribosome-binding site, so the
reporter gene product is made whenever the target gene
is being transcribed and translated. In this case, you test
for the presence and degree of both transcriptional and
translational regulation, though such a fusion by itself
does not allow you to tell which is present; the comparison
of results with transcriptional and translational fusions
Figure 3-2. Fusions created through the
identifies how much regulation is being done at each level.
use of engineered transposons.
The nature of reporter gene product is obviously
crucial, you want an easily assayable product (lacZ, -galactosidase, for which there are easy color
assays; galK, galactokinase, also with color assays; lux, luciferase, with photometric assays) or a
selectable gene product (for example, most any gene causing drug resistance by degradation or
modification of the drug). Further, the reporter protein must be functional even when synthesized as a
fusion protein.
If you have translational fusions at different positions within the target gene, the specific activity of
the different reporter proteins might be different because of the extraneous protein attached to the amino
terminus of each one. The hybrid proteins might also possess different degrees of stability and therefore
show different accumulated activity, which could be misinterpreted as a difference in regulation of the
target gene. One of the reasons for the utility of -galactosidase as a reporter is that its activity is
remarkably unaffected by the presence of attached protein.
60
Use of fusions. Many genes are expressed at relatively low levels so that direct measurements of RNA
synthesis is technically difficult and therefore the regulation of transcription of the gene cannot be so
determined. If the reporter gene encodes an easily assayable product, transcriptional regulation of the
affected transcript becomes possible. While such fusions give relative rates of expression of a given gene
under different conditions, care should be taken in trying to extrapolate back to the absolute levels of
expression in the wild type, since the reporter itself can affect the absolute expression in different ways.
Selection/screening of mutants: In a wild-type situation, there may be no easy selection or screen
for mutants that abnormally regulate expression of a given gene. However, introduction into that transcript
of a gene with an easily selectable or screenable gene product can allow the production of desired
mutants because the regulatory system is now regulating a more technically addressable gene and gene
product. The strain desired for such analyses should typically contain the fusion as well as the wild-type
allele of the region. Otherwise the regulation pattern seen will be that of the mutant and not necessarily
that of wild type (see below).
Though not exactly an aspect of regulation, fusions can be used for protein isolation and analysis
of protein localization. When fused or hybrid proteins are produced, antibodies recognizing either portion
of the hybrid will typically precipitate the entire hybrid. Alternatively, the hybrid protein often behaves like
the wild-type protein and can be used to monitor the behavior of that protein. For example, if the hybrid
protein is detected in the cell membrane, it is likely that the B product is normally there (assuming, of
course, that the reporter product was already known not to be associated with the membrane by itself).
Similarly, alkaline phosphatase is only active when exported to the periplasm, so detection of its activity
as a reporter can serve to indicate that the product of the gene to which it is fused is normally exported.
Generation of fusions. As noted elsewhere, fusions can be generated by deletions and duplications. Such
modes of generation have a number of disadvantages if the fusion is to be useful as described above.
Most notably, the frequency of any given fusion will be low, and there will be severe limitations on which
genes can be so fused. More useful fusion systems involve the use of specially constructed transposons
for the generation of fusions in vivo or cloning methodology for in vitro fusion generation.
A small variety of transposable elements have been constructed to carry a gene (encoding an
assayable product) that is automatically fused to the transcript (or gene) into which the transposon inserts.
Different versions of these transposons generate either transcriptional or translational fusions as
described above. Desired fusions can be sought either by the phenotype that the fusion causes (after
mutagenesis with a fusion-forming transposon, a newly generated His strain will probably have a fusion in
the his region, but perhaps not in the "proper" orientation) or the regulatory phenotype itself (strains that
express the reporter gene in response to a growth condition or other stimulus). This latter point is
particularly interesting since it allows regions to be mutationally identified on the basis of their regulation,
rather than on the phenotype caused by their loss. Yeasts do not have appropriate elements for this, so
other construction methods must be employed.
It should be evident that having access to the DNA of the two genes of interest allows one to fuse
them in vitro. This was typically through the use of appropriate restriction enzymes, but now is more often
performed by appropriately synthesized PCR oligos.
Controls/concerns with fusions. The fusion is itself a mutation with respect to the wild-type genotype, and
a polar one at that. As such, it can perturb the very regulatory scheme under analysis in the following
ways: (i) the products of either the mutated gene or one transcriptionally downstream from the insertion
could be autoregulatory or (ii) the absence of the product of the mutated gene or of those downstream
might perturb the physiology of the cell so as to alter the regulation detected by the fusion. The latter
perturbation is particularly common, especially in anabolic pathways where regulation is typically based
on the level of the product of the pathway. Arguably, most gene products are autoregulatory in some way.
If the fusions are being used to monitor regulation, an obvious control is the introduction of a wildtype version of the affected region, typically on a low-copy number plasmid, on an integrated specialized
phage, or in the normal chromosomal location. This copy supplies the mutated gene products and their
metabolic products, making the merodiploid much more like the wild-type situation. It is, however, still not
precisely the same as wild type, since there are two or more copies of the region in the cell and at least
one is not in the normal position on the chromosome. For most cases, these issues will not greatly perturb
regulation, but for some systems of extremely delicate regulation they might.
If the fusions are being used to select for regulatory mutations, the subsequent analysis of these
mutations in the absence of the fusion will eliminate concerns over perturbation by the fusion. On the
other hand, certain regulatory mutations may not be selectable because of the perturbation by the fusion.
Different reporter genes downstream of a given promoter can yield very different results, apparently
61
because the reporter gene itself has perturbed the supercoiling of the region (JBact176:2128[94]). Finally,
if the system under analysis employs post-transcriptional regulation, then the fusion will possibly perturb
that regulation by its effect on the mRNA produced.
If a fusion is generated by an in vivo transposition event, the possibility exists that one will end up
with a strain containing fusions at two different sites due to two transposition events (this is more likely
with Mu-generated events than with other transposons). Such a strain would be inappropriate for a study
of regulation since the assayable product would be produced by two differently regulated promoters.
Selections for altered regulation would be flawed for similar reasons. Southern analysis or genetic
methods should be used to indicate if the strain contains two copies of the fusion system.
Without belaboring the obvious, if one wishes to monitor the regulation of an operon with a lacZ
fusion and the compound X-gal (which -galactosidase enzymatically cleaves, producing color), one
needs to start with a strain which is itself lacZ (preferably by deletion to avoid complications due to
recombination with the fusion system), and which is capable of transporting X-gal into the cell. If the
fusion system uses a transposon, it should be capable of transposing in that strain. Lastly, if fusions are
selected using a drug-resistance marker in the vector, the starting strain must be sensitive to that drug
and the resistance gene must be expressed in that strain.
Arrays. (CurOpMic6:114[03], JBioTech98:255[02], TIG18:255[02]). The goal of this method is to study
physiology by a direct analysis of the accumulated mRNAs in cells at any given time. The general
approach is that glass or plastic chips are created with many different oligonucleotides placed in precise
positions. These might be PCR products from unique portions of each gene in the organism, or sets of
short oligos that hybridize to a number of discrete positions within each gene. In either case, mRNA is
isolated and converted in to dye-labeled cDNA, which is hybridized to the gene chip. There are a number
of controls that address tagged sequences sticking to similar, but inappropriate, sequences on the chip,
but this is not the place to go through that detail. The absolute level of a given mRNA can be determined
(averaged over all the cells in the population), as can a comparison of mRNA levels in cells growing under
different conditions. Alternatively, sets of co-regulated genes can be identified either because their mRNA
fluctuations are identical or because mutations in specific regulatory genes have similar effects on a
certain number of genes.
The method is already powerful and doubtless will become more so with technical improvements.
This has rapidly become a fairly vast and highly technical field. Initially it required so much information
and expensive methodology that it was not an approach that could be applied to more than a few
organisms in a handful of labs, but commercialization of the method has changed that.
Some general issues: (i) Low signals cannot be reliably detected, which makes analyses of poorly
expressed genes difficult. In other words, there is sufficient noise from tagged mRNAs hybridizing to
inappropriate oligos that very low positive signals cannot be recognized with confidence. This is not a
trivial issue, since many genes in the cell are expressed only at low level (such as most genes encoding
regulatory proteins), or only occasionally during the cell cycle (such as genes whose products perform
replication) so that the average accumulation in a non-synchronous population is very low. One
implication of the problem with low signals is that it is often difficult to accurately determine the foldincrease in expression because that is hugely affected by errors in the smaller of the two numbers. (ii)
The method does not measure gene expression or gene product synthesis, but rather mRNA
accumulation. Unstable mRNAs might be difficult to reliably quantify. Similarly, the presence of an mRNA
does not imply that it is being translated. See a brief discussion in ASMNews 68:432 [02]. (iii) While the
method can certainly be used to analyze mutants, detected effects might not be direct. That is, if a mutant
with a knockout mutation in gene X has a higher level of gene Y mRNA than does the WT under similar
conditions, that might mean that the product nof gene X is a direct repressor of gene Y transcription, but it
might equally well mean something more complicated: the absence of the activity of the gene X product
might cause all sorts of physiological changes in the cell that, very indirectly, lead to a higher level of gene
Y mRNA.
The issue of directness can be handled by a related array method called ChIP-chip, for chromatin
immunoprecipitation chip (array). One takes living cells and treats them with a cross-linking agents that
rapidly kills the cells and freezes metabolism and crosslinks proteins to anything that they are near. The
cells are lysed and sonicated to breakup DNA into rather small (a few hundred bp) fragments. Then
antibody to a given protein is used to precipitate that protein and anything it is cross-linked to. The crosslink is broken and any precipitated DNA, which was presumably bound to the protein in the cell, is
amplified with a tag and hybridized to an array, which allows recognition of the specific DNA regions to
which the precipitated protein was bound (see MolBiotech 45:87[10]). A related method, termed ChIPSeq, uses a similar protocol, but simply sequences the DNA directly rather than using arrays for analysis.
62
Cell compartments. (Yes, its true that cell compartments does not have anything directly to do with
regulation, but I think it likely affects regulation indirectly and is an important if underappreciated aspect of
prokaryotic cells, so...) We are well aware of the different compartments, or organelles, in eukaryotic cells
and the important roles they play, but it is typically assumed that prokaryotic cells are merely bags of
proteins and nucleic acid without any finer structure than that. But can this really be true? Consider the
huge and complex metabolic pathway charts, in which more than 80% of the small molecules are the
product of single reaction and the substrate for another single reaction. While it is possible that all these
compounds are made and simply float around in the cytoplasm to occasionally be bound and processed
by the next catalytic step, but there are good reasons for doubting that this is the way things are. For
example, it would be much more efficient if the consuming enzyme were complexed with the producing
enzyme so that the substrate was immediately bound and processed because the substrate would never
have to accumulate at a meaningful level in the cytoplasm. Also, this would avoid the potential problem of
having the substrate bind to other enzymes for which it has some low affinity. (Obviously this does not
work for all small molecules there are many that are parts of several pathways and therefore must be
generally available in the cell.) So protein complexes make sense, but is it really the case?
The answer is that there are several general cases where protein complexes certainly occur but
the more general cases of metabolism, which were the example that started the discussion, have not
been shown to operate this way. For the known cases and then the rationale: The first striking case is
represented by proteins termed polyketide synthetases and the non-ribosomal peptide synthetases. In
these systems, the proteins themselves are typically massive chains that have multiple domains of
different function essentially protein complexes formed by gene fusion rather than post-translational
assembly. These massive proteins are typically found in tight complexes with a few other members of the
same multi-domain type, though with domains serving different catalytic roles and therefore providing
different modifications to the growing substrate. Finally, it is typically the cases that the substrate does not
leave the complex until completion, in part because it is covalently tethered to one or another of the
proteins throughout the process.
The other striking example is where the cell creates large protein complexes in the cell to contain
a given set of proteins and their products. These containing structures can resemble phage heads, but
that might simply reflect the limited number of shapes that can efficiently be formed out of a few monomer
types. One that has been known for a long time is that containing Rubisco (ribulose-1,5-bisphosphate
carboxylase-oxygenase) and carbonic anhydrase, which both perform CO2 fixation in some phototrophs.
(Sci319:1083[08]). A recently described case is in E. coli and Salmonella where the proteins involved in
ethanolamine utilization are sequestered in a 100 nm polyhedron, presumably to prevent the aldehyde
intermediates in the process from damaging other cell components (Sci327:81[10]). Other examples are
cited in that paper and in an accompanying article (Sci327:42[10]).
If these examples exist, and if it seems to make sense for many other metabolic pathways, why
isnt it the case? I think that the best answer is that we really do not know whether or not other pathways
are organized in complexes, but rather that they do not seem to be in particularly tight or obvious
complexes (or else they would be seen). My bet is that the enzymes involved in most pathways are
organized in rather loose complexes that are simply difficult to find by our methods, though of course
there might be a completely different factor that we are unaware of more than compensates for the
apparent advantages.
So what are the genetic (where I mean genetic very generally) implications of all this? In cases
of extreme sequestration of a substrate, one might have cases of the same substrate being present in two
separate pathways, yet not being shared, a phenomenon known in eukaryotes as channeling. One should
expect some interesting complications in complementation, where any given protein would need to
compete for integration into one of the complexes in order to be functional. An implication of very large
proteins is that the error-prone nature of translation means that most such peptides would harbor one or
more substitutions, which might affect function (and therefore the function of the entire complex).
607 Lecture Topic 4. GENETIC ENGINEERING. No cloning operation goes to completion and it is
important to concentrate one's effort on those vectors that have had a piece of DNA ligated into them.
These can be identified one of several ways: Some vectors have a portion of lacZ that encodes the fragment of -gal. This fragment is able to "complement" a mutant -gal version, typically encoded in the
chromosome, to produce blue pigment from X-Gal. The cloned lacZ also contains an MCS (multiple
cloning site), so that insertions into the vector disrupt the gene and eliminate color production. The mutant
lacZ is employed because of its smaller size, allowing the vector to remain small. Another possibility is to
r
have two drug markers on the vector, one or the other possessing cloning sites. Following ligation,
vectors that have lost one drug marker, but not the other, will often contain inserts. Finally, there are
vectors that contain genes that are conditionally deleterious, so that cloning into these allows host (and
63
vector) survival. The problem with such a selection for inserts is that the large target size for spontaneous
knockouts ensures many survivors, whether or not inserts have been obtained. If there are no problems,
screens are probably fine; if there are problems, selections will probably just yield a number of aberrant
clones.
Another generally useful property for a vector is a promoter capable of reading into the cloned
region. These can allow the expression of the cloned region in vivo, which can be used to either identify
the correct clones or as a useful end in itself, or in vitro, where the synthesis of RNA could be used in in
vitro translation or for the generation of labeled probes. The promoters typically have the property of
tunable high-level activity, but some are also are designed to shut down expression completely under
desired conditions.
Types of vectors for prokaryotes.
(i) "Standard" plasmids: There is a vast array of plasmid types that are useful for cloning. Many of
the properties of these are covered in LT9 on plasmids, but some of the features that are particularly
useful for cloning are the following: small size, so that they are easier to isolate; ability to carry large
inserts; high copy-number, so that they are easy to isolate (high copy-number can also be a problem, so
an even nicer feature is controllable copy number (JBact171:5254[89]); drug resistance; multiple origins of
replication allowing replication in diverse organisms.
(ii) "Non-standard" plasmids: Very large fragments of DNA have been successfully cloned in yeast
on the Yeast Artificial Chromosomes (YACs) (Sci236:806[87]); these have been applied to prokaryotic
systems. It has been shown that a mini-F, lacking tra etc, can be used to clone fragments >100kb. Large
random fragments of DNA are cloned into a site in the mini-F flanked by phage cosL and cosR sites.
These provide long, sticky ends and a unique "restriction" site (NAR18:3863[90]). Alternatively, Bacterial
artificial chromosomes (BACs) have also been successfully employed. Construction methods are
described in Genomics 34: 213[96] and some uses are in PNAS96: 6451-6455[99], Infect. Immun.
66:4313[98], Microb. Comp. Genomics 3: 105[98].
(iii) Phage vectors: (see Sambrook chap. 2) Nearly all phage vector systems are based on
lambda and that is all that will be described below. Lambda vectors have the virtue that they can be
amplified by lytic growth of the phage, thus providing an excellent source of DNA; they can carry larger
inserts than many plasmids; and they can be introduced into cells by phage infection, avoiding the
problems of transformation of large pieces of DNA. Lambda requires about 60% of its genome to be able
to form plaques. It also has a requirement for the size of the packaged DNA, which must be 78-105% of
the size of wild-type lambda. This property is used in in vitro packaging, whereby extracts of two different
strains (each infected with a different phage mutant and unable to perform packaging by itself) are mixed
with appropriately sized DNA fragments carrying the cos sites. Given the constraints of the packaging
system, the desired DNA is preferentially packaged. The extracts complement and package up to 0.5% of
the input DNA. The virtue of this is that you do not need to use transformation to get large pieces of DNA
from the test tube into the cell.
There are lambda derivatives that support the lacZ screen (looking for colored plaques). Some
derivatives have cloning sites in the lambda cI gene, whose presence prevents the phage from plating on
hfl Ec, allowing the selection of inserts. Similarly, only Spi phage (lacking the red and gam phage
recombination genes) are able to plate on P2 (another temperate phage) lysogens. This allows the
selection of phage that have had the red gam region replaced by an insert. Selections have the small
problem that you will get things that survive the selection, but they may be due to mutations in the
counter-selected gene, rather than due to insertions.
(iv) Cosmid vectors: Cosmid vectors contain a plasmid replication system as well as a lambda cos
site, which allows DNA to be packaged in vitro, thus avoiding the transformation of large plasmids. Such
vectors are specifically designed for the cloning of larger pieces of DNA than lambda vectors can carry
(~24kb). Since the packaging system demands a certain sized piece of DNA, cosmids constitute a
r
demand for large inserts. They are often used by ligating a fragment carrying a replicon, drug , and a cos
site with a partial digestion of DNA. The product is then packaged in vitro and used to infect cells,
selecting drug resistance. Particularly nice features of some cosmid vectors include (a) lack of homology
of the replicon to other replicons you are using, so that there are no compatibility problems and the latter
plasmids can be used in hybridization assays with the cosmid and not detect vector sequences. (b)
Presence of promoters reading into the cloning region. (c) Some cosmids use phage origins of replication
that can work better for large plasmids than do small plasmid replicons. pLAFR1 is a pRK290 derivative
with cos and mob so that it can be packaged in vitro and moved as a phage, or mobilized as a plasmid by
pRK2013, which supplies tra functions. There are a variety of problems with the use of cosmids, and they
are best used when you need to clone fragments larger than the lambda vectors can handle.
(v) Single-stranded DNA vectors: (based on M13, f1, or fd) These phage are approximately 6400
64
nt long and they replicate by forming a double-stranded DNA circle (termed the replicative form or RF)
and then making many copies of the (+) strand by "rolling circle" replication. These are immediately
coated by the gene V product and this nucleoprotein complex moves to and extrudes through the
membrane, with the gene V product being replaced by the capsid protein in this process. Since these are
+
male-specific phage, the host cells need to contain an F factor. The great virtues of these phage are that
12
they produce large amounts (up to 10 phage particles per ml in the medium) of a given single strand of
DNA with few size constraints (since there is not a defined head that needs to be filled). These vectors
are sometimes used as double-stranded plasmids, with the inserts cloned in and then transformed. Some
versions have been set up for use of the lacZ screen. These phage used to be enormously useful as a
source of DNA, but PCR has largely obviated their use.
A variation of these is the particular case of the phagemids, whose basic design includes a ColE1
r
replicon, drug , and the sites necessary for viral DNA replication and packaging. Under normal conditions,
this vector replicates as a double-stranded plasmid. When cells carrying this plasmid are then infected
with a normal phage, rolling circle replication begins, causing the synthesis and packaging of the (+)
strand as determined by the orientation of the phage region on the phagemid. For a variety of reasons, a
common problem is poor production of the desired single-stranded DNA.
(vi) Suicide plasmids: Plasmids always have a range of hosts in which they can replicate. If
plasmids are moved into an organism where they cannot replicate, then they are termed suicide vectors
and can be quite useful. Generally they are moved from a cell where they can replicate to one where they
cannot by conjugation, since this process does not require replication in the recipient. Such plasmids can
be vehicles for transposon mutagenesis (LT7) since the only drug-resistant cells will be those where the
transposon has moved from the non-replicating vector DNA to the chromosome. Alternatively, if there are
regions on the plasmids that are homologous to the chromosome, then they can integrate by homologous
recombination. The utility of this is discussed in the last section of this LT.
Types of yeast vectors:
There are two types commonly used of replicons that function well in Sc: replicons that act like
chromosomes in that they show 2:2 Mendelian inheritance and exist in single copy, and replicons that
exist at higher copy numbers and show a non-Mendelian inheritance. These are discussed in more detail
in LT9. One can also introduce suicide vectors in yeast, where they behave very much like non-replicating
plasmids in bacteria: if there is a selection for a gene they carry, one sees homologous recombination that
integrates them into the chromosome.
Other vector uses.
(i) Promoter/terminator probes : These are plasmids that contain a cloning site (MCS, preferably)
so positioned relative to a reporter gene that inserted DNA can be assayed for the presence of promoters
or terminators.
(ii) Shuttle vectors: Alluded to above, these are plasmids that have the ability to replicate in (very)
different hosts. Typically these have separate ori and rep systems appropriate for the desired hosts, e.g.
gram-positive/gram-negative or bacterial/yeast etc.
(iii) Promoters for expression vectors: These are partly designed for very high expression, but
also for very low expression under certain conditions, since the overexpression of a cloned gene product
might be deleterious to the cell.
The lambda PL promoter is regulated by the cI gene of lambda. It is a very strong promoter that
can be fairly well shut down, if the cloned functions are deleterious to growth. It is often regulated using a
ts version of cI (cIts857). This has the drawback that the temperature shift also induces heat-shock
response. Another solution is to use cI and induce with mitomycin C or nalidixic acid.
Ptac is a very strong promoter that is a hybrid of the trp and lac promoters, but is still regulated by
the lacI gene product, which is inducible with IPTG. The system is rather good at shutting off unwanted
q
synthesis if a lacI mutation, giving 10x more LacI product, is used. An even more tightly controlled
variant, designed by Szybalski's lab, has the promoter flanked by the lambda attP and attB sites, with the
promoter directed away from the cloned region. Induction of Int synthesis leads to inversion of the
promoter and allows expression (Gene56:145[87]).
Certain T7 promoters are only recognized by the T7 RNAP. The desired gene is cloned
downstream from such a promoter on a plasmid with the synthesis of T7 RNA polymerase (gene 1) under
the control of Ptac. Induction of this latter promoter allows the synthesis of T7 RNA polymerase and the
expression of the desired gene. Alternatively, T7 gene 1 can be introduced on a mutant phage to induce
expression.
Host strains. The choice of a host strain depends on the vector system being employed, but there are
65
several common properties that can be useful. The frequency of plasmid transformation is increased
dramatically if the recipient lacks a restriction system (hsdR). Occasionally, lack of a modification system
is also useful (hsdS). The recA, recBC, and recF pathways can all allow inserts to undergo recombination
with resulting loss in genetic material. There is usually a selection for this, since most plasmids want to be
as small as possible in competition for the replication machinery, so the use of Rec strains reduces this
instability. Finally, many vectors, especially the lambda derivatives, grow only on amber suppressorcontaining strains. This is a remnant of the earlier fears over containment of genetically engineered
vectors.
Genome sequencing. The complete sequence of many hundreds of microbial genomes are now
available. Probably the best way to check what is available is through the Venter web site:
http://www.jcvi.org/cms/research/groups/microbial-environmental-genomics/. Obviously sequencing is
getting faster, cheaper and more automated and the speed and cost will affect what we use it for. As of
April 2010, the use of Illumina machines has brought the cost of sequencing a previously uncharacterized
yeast genome (10 Mb) down to $200 and that of a similar genome for which a sequence is in hand (i.e. to
look for mutations) down to $50.
This will cause us to not only rethink how we do genetics, but what experiments are not possible. A recent
and very nice review of both cutting-edge sequencing and the scientific implications is in NatRevMicro
7:287[09]. Given a sequence, the game then becomes a matter of deciding what conclusions can
reasonably be drawn from that sequence, as well as what testable hypotheses are worth examining.
These are critical and non-trivial questions, as sequencing is revealing vast numbers of homologs (i.e.
genes whose sequence indicates that they must be evolutionarily related) and then matter is then whether
or not the gene products have an identical or a similar function.
PCR. This approach relies on the temperature-resistant Taq polymerase and two short oligos so chosen
to act as primers for DNA synthesis off opposite strands at a reasonable distance from each other (1005000 nt). Repeated cycles of heating (to denature all hybrids), and cooling (allowing the primers to find
their targets and direct polymerization) yields dramatic amplifications of the target DNA. The uses of PCR
include, but are not limited to: (i) the generation of specific double-stranded DNA sequences for probes;
(ii) the amplification of uncloned DNA by primers based on information obtained elsewhere (cDNA clones,
related genes, homologous genes from other organisms); (iii) the generation of DNA for direct
sequencing; (iv) the sequencing of mutations, generated through either classical or inverse genetic
means; (v) the construction of mutations by building them into the primers themselves, with subsequent
recloning into the genome; (vi) the construction of all manner of hybrid transcripts, genes, etc, again by
the use of appropriate primers.
PCR has also been useful for the amplification of ancient DNA, which may provide some insight
8
into evolution (Nat352:381[91]). DNA sequence analysis of a 1.2 x 10 year old weevil has been reported
(Nat364:536[93]), though others have argued that DNA is simply not that chemically stable.
Although Kary Mullis received a Nobel Prize for inventing PCR, it was actually very clearly
described in a Khorana paper in 1971 (JMB56:341[71]), despite Mullis claims to the contrary. The most
significant thing Khorana lacked was the temperature-resistant polymerase; see a discussion of the Cetus
patent in Nat350:6[91].
Generation of mutations in vitro.
Site-directed mutagenesis, involving the use of specifically synthesized oligos. This is the precise and
premeditated alteration of the genotype with its subsequent phenotypic characterization. This is most
useful when the structure of the product is known, allowing structure-function assignments. Indeed, this
sort of approach is largely completely useless unless you have a fair amount of molecular understanding
of the gene product. A critical problem in the current misuse of this approach is summed up in the
question, "did the mutation alter the phenotype because of the nature of the substituted amino acid or the
absence of the wild-type one?" On a more positive note, however, you do certainly get a direct answer to
the question posed.
Most all such mutageneses employ PCR. This scheme utilizes either 3 or 4 oligo primers. Two of
these are complementary to regions on either side of the target site and the other 1 or 2 encode one or
both strands of the mutant target site. The mutant product is amplified by PCR and then cut with
restriction enzymes and cloned into the desired vector.
Localized mutagenesis. As with any version of localized mutagenesis, this allows the heavy mutagenesis
of a particular region without significant damage to the rest of the genome. The goal is to target only a
66
single region for intensive mutagenesis, followed by a phenotypic screen. It is employed when a region is
known or thought to be important, but the specific critical residues are unknown. The latter point makes
this more akin to classical, rather than inverse, genetics. This is an especially good idea when
spontaneous or random mutageneses yield an undesired event much more frequently than the desired
event due to target size.
"Poisoned" chemical synthesis, which is the addition of a small amount of an incorrect nucleotide
to one or more of the other nucleotide stock bottles, allows error frequency to be defined by the nucleotide
mix used (an example of "saturation mutagenesis" {1.7% errors} by this method is in JBact171:4852[89]).
Alternatively, you can synthesize the target region with a 20% inosine poisoning at each position, clone
into a vector (the authors cleverly chose a small fragment with restriction sites at each end), transform
and let errors occur in vivo. The result was about 2 errors per 24 base region, mostly of AG type
(MGG214:62[88]).
Another variation allows mutagenesis of a large region (i.e. more than chemical synthesis can
usually handle) by synthesized overlapping 60-nt single-stranded fragments in a mutagenic way and then
annealing these. A related approach, termed Staggered Extension Process (StEP) uses in vitro genetic
recombination, which allows the reassortment of mutations created through error-prone synthesis
(TIBiot15:523[97] & NatBiotech16:258[98]). By this approach multiple mutations in different regions of a
gene can be recombined and tested. Obviously a vast amount of useless stuff is created and a highly
efficient selection or screen for interesting variants must be available.
PCR is naturally a bit error-prone and the error rate can be enhanced by increasing the level of
metal ions or altering the ratios of nucleotides (MethMolBiol;2313[03]). This allows the mutagenesis of
virtually any region that can be PCR-amplified and cloned into the homologous unmutagenized region for
further analysis. Taq polymerase is a bit error-prone, typically causing A G transitions, but error rate can
2+
2+
be increased by either using higher levels of Mg (to stabilize mismatched base pairs) or by using Mn .
2+
Mn appears to have different effects depending on the level used, either through interaction with the
DNA itself or with the enzyme (Bioc24:5810[85]) and can apparently affect proofreading of some enzymes
(JBC258:3469[83]).
Randomization. As argued in LT2, there are situations where no level of random mutagenesis is sufficient
when you want to test very different amino acid residues at certain positions. The redundancy of the
genetic code means that multiple changes will be required in a given codon, which is beyond the reach of
a reasonable dosage of mutagenesis. However, it is obviously possible to have your Biotech center
synthesize an oligo that is completely random (as a population) for some region of a gene. Now the
randomized region has to be small, <10 codons, or virtually nothing functional will come out. This
approach has the advantage that it certainly allows the detection of multiple simultaneous changes in the
region of interest, but it generates a lot of garbage, so a powerful selection/ screen is essential if more
than a couple codons are tested. Clearly the larger the region that is randomized, the rarer the functional
products will be and therefore the greater the required power of the selection/screen.
So what does the data from such an approach tell you and how is it different from other sorts of
data? Lets say that you randomize two residues known to lie in the active site of enzyme X and screen
for cells that grow under conditions where X must be functional. You will probably see colonies that grow
well (as you certainly must get back at least some clones with wild-type residues) and probably some that
grow poorly, and quite possibly there will be a number of clones that dont grow at all (and which you
wont even know about unless you also plate the mutagenized population on conditions that do not
demand enzyme X function). You then pick perhaps 10 fast-growers and 10 slow-growers and PCR
amplify and sequence the relevant region (which is trivial because you know exactly what you
mutagenized). If in the fast-growers, one residue is always His, for example, then you can conclude that
this residue is critical for function. If it is also always present in the slow growers, then it is absolutely
essential for whatever level of function was necessary for detectable growth. But you might find that you
get only hydrophobic residues, or only small ones or... The point is that, in contrast to Ala scanning, for
example, you can use the approach to completely determine the rules for functionality at a given position.
If there are synergistic effects (for example, His is OK at position 5 if and only if there is a Gly at position
6), than these can be revealed as well, assuming that the relevant codons were all simultaneously
mutagenized.
There is another interesting utility to this approach: If you want to alter the behavior of a protein in
a specific way (causing the protein to work on a rather different substrate, for example), you might well
need to change more than one residue and change each of them dramatically. Again, this cannot be done
by either site-directed mutagenesis or my randomized mutagenesis, but it is possible through
randomization, if you target the right residues and have a good screen or selection for the mutants with
the right protein function.
67
When to employ site-directed mutagenesis. You can make all sorts of specific changes in proteins,
but only a small fraction of these will yield useful information about the protein's structure and function. In
general loss-of-function mutations are not so informative, since there are many ways to destroy function,
even in functionally non-critical regions of the protein. You therefore need a good structural insight into
what you are changing in order to interpret the result of an inactive protein. A description of some of the
arguments and considerations is given in Sambrook, p15.82. Even with that information, you probably
need to biochemically characterize the mutant product. You are on firmest ground if you can demonstrate
that you have altered one function of the protein with the retention of others, thus arguing that you haven't
simply wrecked havoc on the entire structure. Certainly gain-of-function mutations have the clearer
interpretation. Examples of such functions include ability to bind or process substrates, ability to interact
with other proteins or form complexes, and the ability to be stably accumulated and recognized by
antibodies. Obviously you can make the most intelligent changes if the crystal structure of the product is
known. Remember that when an amino acid change has an effect, it could be because of the loss of the
wild-type amino acid or the introduction of the new one (which is why randomization, which tests many
substitutions, is often an attractive approach).
In vivo analysis. Virtually all of the above sections have described the construction of specific mutations,
not mutants (that is, changes in DNA and not organisms with those changes). Somehow, the products of
the mutated region must be analyzed to determine the functional effect of the mutation. In some cases,
where protein biochemistry is the theme of the research, it might be appropriate to use the cloned
mutation in an expression system and immediately purify and analyze the protein in vitro without
examining its behavior in the cell. Similarly, there may be occasions where overexpression in vivo is all
that is necessary for interpretation, so that the mutated region, on an expression vector, is merely
transformed into a cell. Presumably this would utilize a selectable marker on the vector itself, so the
generated mutation need not be selectable. However, if the critical question involves the physiological role
of the gene product in the cell, then the constructed mutation ought to be introduced into as
physiologically normal a situation as possible. This implies the use of nearly normal levels of expression.
This might be done with low-copy vectors or lysogenic phage, where the mutated gene is placed either
under its own promoter or under a tunable one that can be adjusted to a reasonable level. As before,
selectable markers on the vector allow analysis of both selectable and non-selectable mutations.
For the most proper analysis, however, the mutant allele should used to replace the normal
chromosomal allele. In such a case, the mutant allele is in the correct copy-number, under the normal
promoter, and in the appropriate chromosomal environment (considering supercoiling and the like). If
such care is not taken, such that expression levels are unlike in the wild type, then the resulting
phenotype might be because of expression per se and not the mutant allele. There are two general
approaches to performing this gene replacement using a selectable marker in the gene of interest.
(i) Non-replicating (suicide) plasmids: Introduction of the region with the selectable marker in your gene
(you want a Kb or so on either side of the selectable marker, so that recombination will be fairly frequent)
on a non-replicating plasmid (with another scorable marker, such as drug resistance, carried by the
plasmid but not within the region of interest), selecting the marker within the target gene. This will yield
two sorts of events: single recombination events resulting in the integration of the entire region, and
double recombination events (on either side of the selected marker) which leads to the replacement of the
wild-type allele with the mutated one from the plasmid. The suicide plasmids mentioned here are typically
ones that are mobilized into the strain from an E. coli donor. As a consequence, the plasmids typically
have the mob region cloned into the plasmid backbone and they are mobilized from a strain with the tra
genes somewhere in the donors genome. See the section on Uses of plasmids in LT9.
Replacement of the wild-type allele in the chromosome with non-selectable alleles is a bit trickier,
but of great importance in many cases. One typically wants to study the effects of the damage to a single
gene and not worry about possible effects on the expression of other transcriptionally downstream genes,
and selectable markers will also most certainly perturb downstream expression. (Note that even if a
promoter is provided 3' of the marker, that by no means predicts that expression of the downstream genes
will be proper; if it is too low or even too high, such aberrant expression might well affect the phenotype
and therefore provide an erroneous interpretation of the phenotype of the original mutation.) There will
also times when one wants to introduce a very specific sort of change (for example, it might be highly
interesting to determine the effect on cell physiology of a site-directed mutation that affects the region
68
of a protein involved in
allosteric control), and these
can never be created through
selectable genes as described
above.
As above, the mutated
(non-selectable) allele can be
introduced two ways, by
variations on the above
methods: Using suicide
plasmids (see Fig. 4-1), one
creates the desired nonselectable mutation in the gene
of interest and then selects for
plasmid integration, using
another selectable marker on
the plasmid, but outside the
cloned region, creating a strain
merodiploid for the region of
interest. One then grows these
cells without selection, which
allows homologous recombination
Figure 4-1. Impact of the integration of cloned genes on the genetic
between the duplicated regions
region. In order to ensure that at least one copy of all genes will be
carrying the two alleles. If the
functional and expressed in the merodiploid, the region cloned on
desired mutation is centrally
the suicide plasmid must not be internal to a single transcribed
located in the cloned region,
region. In the figure, X, A, B, C and Y refer to genes, where A, B,
approximately 50% of the
and C are transcribed from promoter P. the small numbers
recombinants, which are
designate sections within given genes. The large X represents the
recognized with a screen for
approximate site of recombination, though that can occur anywhere
those that have lost the plasmid
in the homologous region without affecting the outcome. The faint
marker (i.e. drug-sensitive
dotted vertical lines simply show the alignment of the ends of the
strains), will retain the wild-type
cloned region with the homologous region in the chromosome. (A)
allele and the other 50% will
When the cloned region is completely within a given gene, the
retain the desired allele. Since the
merodiploid will not only lack a functional copy of that gene, but any
desired class is so common, it
downstream genes (C, in this example) will be separated from their
can be found by screening a few
normal promoter. (B) When the cloned region carries parts of one or
colonies by sequencing (or by
more genes but remains completely within a given operon, the
restriction analysis, if one has
merodiploid will have at least one normal copy of every gene, but
altered a restriction site).
downstream genes (B and C in this example) are separated from
A very similar method of
their promoter. (C) When at least one end of the operon is contained
allele replacement can be
in the cloned region, then at least one copy of the region will contain
employed with yeast. Essentially
WT alleles that are properly transcribed. Such a case is highly likely
one takes a circular nonto yield a WT phenotype for the merodiploid.
replicating plasmid with a
selectable marker and transforms it into a cell. This marker is often G418 resistance, where G418 is an
aminoglycoside antibiotic similar in structure to gentamicin B1 and blocks polypeptide synthesis by
inhibiting elongation. G418 can be inactivated by phosphorylation by the enzyme encoded by Tn5 that is
also capable of attacking neomycin and kanamycin. Selection for the marker demands that it become
associated with a replicon, which will occur through plasmid integration. Note that the fact that the replicon
with which it recombines is linear does not perturb things at all, though a linear vector (as the suicide
plasmid) would certainly not work, since recombination between that piece of DNA and the linear
chromosome would effectively break the chromosome. Other antibiotics that work in yeast are hygromycin
B, blasticidin S, phleomycin, and puromycin and note that the drugs that also inhibit bacterial translation
inhibit the mitochondrial ribosomes in yeast but not the cytoplasmic ones.
There are some lower eukaryotes in which homologous recombination is the rare exception: if
one clones a selectable marker into a cloned version of a gene carried on a non-replicating plasmid and
introduces it into the organism, only ~1% of the strains with the selectable marker will be the desired gene
replacements with the rest being integrants of the entire plasmid at apparently random sites in the
chromosome. This problem can be solved by having a counter-selectable marker on the vector outside
the cloned region of homology. An example of the latter is the gene for hypoxanthine-xanthine-guanine
69
phosphoribosyl transferase (HXGPRT), which can be selected against by the presence of mycophenolic
acid. By this method, greater than 50% are the mutants of selected phenotype are the result of
homologous recombination.
(ii) Introduction of linear DNA: Linear DNA fragments of the target gene with a selectable marker can be
introduced into bacteria in the presence of a recombination system from phage lambda, with the result
that the region is reciprocally recombined into the chromosome at high frequency (PNAS97:6640[00],
JBact182:2336[00], Gene379:109[06]). A version of the approach is depicted in Fig. 4-2. A cloned region
is flanked by relatively short regions of homology, introduced into the recipient by transformation or
electroporation, and a selectable marker on the linear
fragment is demanded, which require recombination with
some replicon. In bacteria, however, this will fail if the
recipient cell is WT because The RecBCD and SbcCD
nucleases will rapidly degrade the linear double-stranded
DNA. However, in these schemes, the red system of phage
lambda is expressed either from a prophage or from a
plasmid and the Exo, Beta and Gam proteins encoded by red
blocks these nucleases, processes the target DNA and pairs
the complementary regions (see the above references for
details). Note that this method demands a selectable gene
and therefore does not provide a simple non-polar
substitutions as the suicide plasmid method does. However,
Wanner (PNAS97:6640[00]) created a cute modification that
allows the subsequent elimination of the selectable marker by
the FLP recombinase. This still leaves an 82- or 85-bp scar
(which is actually a chunk of the plasmid that is retained),
Figure 4-2. Using PCR fragments as a
which can be chosen to have little or no polar effect).
substrate for in vivo gene replacement.
Because this method relies on stimulation of these cellular
An antibiotic-resistance gene, antR is
systems, it is inherently mutagenic to the entire genome to
flanked by two 40-bp regions in vitro by
some low degree.
PCR (A). These added regions were
Note that the approach of using linearized fragments
chosen for their homology to
with a selectable marker works well with Saccharomyces.
sequences internal to gene B (B). The
However, with some very important pathogenic fungi, such
PCR product is electroporated or
as Blastomyces and Histoplasma sp, homologous
transformed into the recipient (C) and
recombination is very poor. The result is that trying to move
R
Ant is selected. Homologous
selectable markers into the chromosome is very hard.
recombination assisted by the red
Mutants with the selected phenotype will occur, but typically
functions yields the final product (D).
fewer than 1% of them have recombined into the site of the
WT gene, so lots of secondary screening is necessary.
In the past several years, this general approach has been automated to an amazing extent. For
example, systems have been designed such that the regions of homology (L1 and L2 in Fig 4-2c) are
synthesized as a single fragment on a chip, and therefore regions for every gene in the organism can be
synthesized simultaneously (and for about $1/gene as of 2010). These are then used to flank selectable
markers and create all possible mutants in a single pool for about $1/gene. (The details of this are too
tough to explain in brief, but actually fairly straightforward. See NatBiotech28:857[10], especially Fig. 2.
The usage of the method in this paper is even more of elaborate, since they try to turn up and down every
E. coli gene. I am unsure the physiology makes sense here, but the technology is very cool.)
(iii) RNA interference (RNAi): In many eukaryotes, including some yeasts, RNAi is very successful at
dramatically lowering the levels of accumulation of a specific mRNA, therefore mimicking the effect of a
loss-of-function mutation. The actual mechanism is complicated, but the procedure is to produce a
double-stranded RNA within the cell (or introduce it after production in vitro, but this is not practical in
yeast) that overlaps part of the target gene. This RNA is then processed to short sections, and through
interactions with RNA-induced silencing complexes, attacks and destroys the cognate mRNAs. Because
homologous recombination is not involved, this can be very useful in organisms such as the yeasts
mentioned immediately above. This approach has no utility in prokaryotes, as far as I know. There are
cases of antisense RNA having an inhibitory function on translation of other mRNAs in bacteria, but this is
rarely been manipulated to modulate mRNA levels, probably because most prokaryotic mRNAs, and their
antisense counterparts, are short-lived.
70
607 Lecture Topic 5........DELETIONS

Deletions are defined as the loss of three or more bases relative to the wild-type sequence clearly the distinction between frameshifts and very small deletions is arbitrary. They are detected with
-6
fairly high frequency (10 might be typical, but this number is very dependent on the region of the
chromosome examined. There are thought to be three general methods of deletion formation: (i)
Recombination between similar or identical sequences that lie at different places in the genome. This
mechanism in RecA-dependent and can involve large regions. (ii) Slippage between template and the
newly synthesized strand during replication, such that a region is skipped or synthesized twice. Deletions
by this mechanism will presumably be hundreds of bases or less in size and require that somewhat
similar sequences exist in a direct repeat in the genome before the mutation to stabilize the slipped
structure. (iii) Recombination between dissimilar sequences anywhere in the genome. This mechanism is
RecA-independent and the actual enzymes involved remain unknown.
Most spontaneous deletions seem to be generated with at least some regard to small regions of
homology and, as will be mentioned below, deletions can also be stimulated by the presence of insertion
sequences. As with duplications (LT6), most deletions depend on the recA system for their creation. In
one well-studied case, deletion formation was reduced twenty-fold in a Rec background (reminiscent of
the Rec effect on duplication formation). The reason for this is probably that recombinational events
simultaneously create one progeny with a deletion and the other with a matching duplication (see Fig. 6-1
in LT6).The point here is simply that both legitimate and illegitimate recombination can give rise to
deletions and duplications, so that for the average region, homologous recombination accounts for about
95% of both types of event.
Deletions can also generate either transcript or protein fusions, where non-identical transcripts or
genes are fused together, respectively. Finally, deletions can be of almost any size, with the constraint
that your strain will be dead if you delete an essential function. Deletions may well occur at the same
frequency as duplications, but are detected less often because they will often be lethal events. The
general topic of the mode of deletion generation is addressed in ASM2,2256[96].
It appears highly likely that virtually all of the claims made here about deletions arising in
conjunction with duplications are similar for many higher organisms including humans. At least three
dozen human disorders have been shown to result from non-allelic homologous recombination
(TIG18,74[03]).
Analysis of the mechanism of deletion formation. The sites at which deletions occur can say
+
something about the mechanism of their occurrence. While it is generally true that a Rec phenotype is
worth at least 20-fold in the frequency with which a random region is deleted (Cell29:319[82]), some
deletions occur between highly similar regions while others occur in regions with little or no detectable
similarity (see the mini-review on illegitimate recombination in Genet115:581[87]).
RecA-independent deletions are also observed in regions that have short tandemly repeated
sequence, with the result that one copy is deleted. A specific analysis of the length dependence of direct
-10
repeats on deletion formation in phage T7 showed that 5-bp repeats gave deletions at 10 , while 10-bp
-6
repeats gave them at 10 and all these events were RecA-independent. Deletions at short repeats that
are found within a few hundred bp in the genomes are consistent with a copy-choice mechanism
(MGG212:450[88] & EMBO8: 3127[89]) in which the template strands slips with respect to the new strand
and two different copies of the repeated sequence pair with each other, such that the replication proceeds
after copying only one. On the other hand, at least some RecA-independent recombination does occur
and its magnitude is a function of the specific region examined. Some recombination, including that
between sister chromosomes, might be mediated as post-replication DNA repair (Genet135:631[93]).
The effect of the size of the region to be deleted has been examined by flanking variously sized
inserts with an identical 10-bp direct repeat. This was done in T7, with differently marked phage, so it
could be determined if the events were inter- or intragenic (the former would yield phage with markers
from both parents). Surprisingly, there was a strong effect of the region to be deleted: there was little
effect of differences in the <90bp range, but an order of magnitude drop above that size, with a more
gradual effect up to 1 kb, the largest size examined. All events detected were the result of intragenic
events and there was no recA effect, consistent with the notion that RecA is not fond of very short
stretches of identical sequences. These results are most easily rationalized in terms of a copy-choice
model (JBact173:869[91]).
Hunts for deletions can themselves be strongly prejudicial, either because a particular end point is
demanded (or at least an end point in a particular region) or an end point is prohibited (because it would
delete an essential function). This latter point is an area where the analysis of duplications is probably
71
instructive, since they are much less likely to be detrimental, yet share common themes in their
generation. While there is some effect of palindromes on deletion formation, at least when they are
immediately adjacent to direct repeats, the mechanism is unclear (JBact173:315[91]). There is also a
potential influence of secondary structure within the region to be deleted, at least in the model case of
non-tandem duplications on a plasmid (Genet134:409[93]).
It has been argued that regions that are prone to Z-DNA structure are hot spots for deletions
(PNAS86:7465[89]), but is this because Z-DNA participates in deletion formation or because its presence
is a problem for the replicon and there is a selection for deleting a Z-prone region? Lastly, it has been
shown that the frequency of a specific deletion event in a given region can vary when that region exists in
different chromosomal or plasmid environments (Genetics126:17[90]). The reasons for this context effect
are unclear but may reflect superhelicity.
There is clearly not one set of rules for deletion formation that is valid for all regions. Specifically,
if chromosome restructuring is involved, some events may be disallowed by the mechanism, not by the
resulting genotype/phenotype (Sci241:1314[88]). Also, most of the above arguments are for random,
spontaneous deletions (whatever either of those terms means at this point). There are a few cases of sitespecific deletion events where a specific gene product is known to be required for this developmentally
regulated event: e.g. Anabaena and nif deletions driven by the xis gene product (see JBact171: 4138[89])
and deletions generating functional sporulation genes in Bs. In these cases the deletion events are
necessary for the functionality of a given coding region. These events are also restricted to terminally
differentiated cells - not surprising for an irreversible genetic event.
There should be a caveat with a number of these studies: for obvious technical reasons, many
were performed on plasmids, and it is not clear that recombination in plasmids is representative,
especially in terms of frequencies, of recombination in the chromosome. In part this is because of
recombination systems specific to plasmids (resolvases), but also because of differences caused by
replication and by plasmid transfer between cells (ASM2, 2265[96]). For example, there was dramatic
stimulation of deletion formation on a plasmid when replication rates were increased modestly
(Genet154:971[00]).
Examination of mutants affected in deletion formation. Another way of gaining insight into the
mechanism of deletion formation is to examine genotypes that stimulate (deletion of a specific region of
the Ec genome stimulates illegitimate recombination 10-fold, JBact170:2898[88]) or inhibit the frequency
of deletion formation. If this is to be statistically valid, you need to examine lots of events and therefore
you need to set up an assay system for the rapid identification of deletions. This then opens the door for
prejudicial effects as noted above. It's not that the results will be wrong, but rather they will not be as
interpretable as they would appear. For example, your selection might either demand or prohibit one or
more mechanisms of deletion formation to satisfy the phenotypic requirements.
(i) recA mutations reduce deletion formation frequency at least 20-fold in some assays, implying
that most deletions larger than a few base pairs are a direct result of RecA action. As noted above,
deletions between very short regions of adjacent repeats seem to be formed by a replication mistake
rather than recombination.
(ii) RecA/RuvA: recA ruvA double mutants display more reciprocal recombination than a recA
alone, consistent with a role of RuvAB in branch migration that has taken place without RecA
(Genet135:631[93]).
(iii) recBC sbcB: Eighty per cent of the clones of Physarum DNA can only be maintained in Ec
recBCsbcB mutants. In one case the instability was localized to a 360-bp region of the sequence T45(320-bp)-T50. In vitro this region unwinds under torsional stress (Gene 48:133[86]). Might we be talking
about non-B DNA here? recF mutations also have small but significant effects (ASM2, 2260[96]).
(iv) Gyrase involvement: When DNA from Dictyostelium is cloned into Ec, 18/21 deletions were
found to fall between two of the six existing sequences reading A6PyGGCXGCCPuT6. These deletion
ts
events are blocked by raising a gyrA strain to the non-permissive temperature. The authors argued that
gyrase likes palindromes and cleaves there. So gyrase might be necessary, but is it sufficient? Is it direct
(MGG214:1[88])? Deletion of a gene for topoisomerase has also been found to stimulate at least some
classes of deletions, both on plasmids and in the chromosome (ASM2, 2262[96]).
(v) Mismatch repair and ssb: An analysis was performed of deletions (in E. coli) formed between
two similar but non-identical sequences carried on the same plasmid. Not surprisingly, because the
sequences were non-identical, there was an effect of mismatch repair (30-50 fold increase in mutS, mutL
and uvrD mutants (UvrD also has a role in mismatch repair). and a 250-fold increase with an ssb-3 allele.
The ssb allele also altered the hot-spot preferences for the deletion endpoints (Genet 145:563[97]).
Screens/ selections that enrich for deletions. Most of these treatments do not increase the number of
72
deletion-containing strains in the population, but rather their frequency in the population.
There are mutagens that stimulate deletion formation (certainly UV, JBact170:2898[88]). Also,
chemical adducts, constructed in vitro, can lead to in vivo-generated deletions (PNAS85:1043[88]). It is
unclear if these effects reflect the stimulation of deletion-prone enzymes, the saturation of repair systems
that normally prevent deletions, or even more indirect effects. This is one of the few treatments that might
actually increase deletion number.
Screens that demanding only tight or polar mutants enrich for deletions and insertions and
produce a lower percentage of missense mutations (JMB30:81[67]).
Screens for loss of the products of two genes are linked dramatically enrich for deletions because
they will typically be the only single event to kill the function of both gene products (ASM p1001).
s
r
r
The tet /fusaric acid selection (JBact145:1110[81]). Tet happens to confer sensitivity to fusaric
r
acid. Thus a selection for fus , demands loss-of-function mutations in tet, many of which are deletions.
Survivors of heat induction of a temperature-inducible prophage often occurs by deletion, not
because of phage induction, but because of the loss-of-function selection.
Selection for loss of galK cloned into a target region (MGG206:35[87]) allows a loss-of-function
selection in galE strains. This is because since GalK causes lethal levels of Gal-P to accumulate (in the
presence of galactose) when there is no galE gene product to process it to a non-toxic metabolite.
Selection for loss of the ccdB gene, a protein that normally regulates cell division, but can be
lethal when improperly expressed (BioTech21:320[96]).
The sacB gene can also be selected against, since it blocks growth on sucrose. It encodes
levansucrase, which synthesizes levan from sucrose, resulting in toxicity. Since it is not found in most
genomes, it can be introduced and used in screens in a variety of organisms (e.g. on a constructed
transposon, Gene78:111[89]).
s
A Tn5 with rpsL, conferring Str , is also available (Gene99:101[91]) and can then be selected
against with exogenous streptomycin.
Transductional shortening of chromosomal or plasmid-borne regions is done by demanding that a
phage either carry regions more than one phage-length apart or carry a markers of a replicon which is too
large to be packaged in its entirety. For this hunt, you need good (highly efficient) transducing phage so
you can find and package these relatively rare deleted regions.
Citrate/heat selection of specialized transducing phage selects for those that have less than a full
head and therefore contain DNA deleted for part of the region of interest (JMB56:369[71]).
Homologous recombination between ISs or other regions of homology will lead to genetically
defined deletions at high frequency.
Site-specific deletions on a plasmid can be generated by using PCR to amplify all of the plasmid
except the region targeted for deletion, followed by self-ligation of the amplified portion (NAR17:3319[89]).
Other in vitro methods of generating deletions were covered in LT4.
607 Lecture Topic 6......... DUPLICATIONS AND INVERSIONS

The nature of tandem duplications. Duplications are one of the least appreciated and yet most
important mechanisms of genetic rearrangements in enteric bacteria and there is no reason to believe
they are not common in other bacteria as well as in eukaryotes. Indeed, genome analysis of
Saccharomyces shows that it has evolved from a large number of non-tandem duplications, which have
Figure 6-1. Creation of a tandem duplication
(and a matching deletion). The figure shows two
daughter chromosomes and the small "x" refers
to a sequence that exists in two places in the
genome such that Rec can occasionally mediate
recombination between the non-identical copies
of that sequence. After recombination (and a
subsequent homologous recombination event to
resolve the double-chromosome into two
monomers), one daughter has a tandem
duplication and the other has a matching
deletion.
73
then subsequently drifted through further mutation.

Duplications are present in E. coli at a very high level:
approximately 0.1% of a culture is duplicated for a
given region of the chromosome and there is no reason
to doubt that most other prokaryotes behave similarly.
Because of this high frequency, you may safely assume
that any time you have a situation where a duplication
will satisfy your selection, that class is virtually all you
will be able to detect because they will predominate in
the population. This general set of topics is covered in
ASM2, 2256[96]).
In general, bacterial duplications are rather

large, up to one-third of the chromosome, and in a
+
Figure 6-2. Intermolecular recombination
Rec background they are highly unstable
+
between duplications yields one daughter
(approximately 1% of a Rec duplication-containing
that is normally haploid while the other
culture will have spontaneously lost the duplication) as
daughter has a triplication of the previously
they are typically tandem duplications and are lost by
duplicated region.
homologous recombination. On the other hand, the
absence of a recombination system (Rec ) stabilizes

-6
duplications fairly well (10 of a Rec

duplication-containing culture will lose the
duplication due to illegitimate
recombination events).Tandem
duplications are those where the two
copies of the duplicated region are
immediately adjacent to one another and in
the same orientation (Fig. 6-1). For circular
chromosomes, these duplications tend to
be tandem because their mode of
generation involves either legitimate or
illegitimate recombination between daughter
strands in the same cell. Since legitimate
Figure 6-3. Loss of a duplication by intramolecular
(homologous) recombination can produce
recombination restores a WT sequence without
duplications, at least in some regions
simultaneous creation of a triplication.
flanked by homologous sequences, fewer
duplications are seen in a Rec strain. The degree of this reduction varies, but a reduction of 20-100-fold
is typical. The figure gives an indication of the result of a recombination event between two non-identical
regions. While not immediately obvious, to regenerate the two separate daughter chromosomes in a
proper topological sense, another recombination event is required elsewhere in the genome. Duplication
loss can occur by either intermolecular (Fig. 6-2) or intramolecular recombination (Fig. 6-3). Both types of
events restore a wild-type sequence, but he former leads to an amplification in the other chromosome).
Duplications can have an effect on the phenotype by: (i) A dosage effect on the products of the
duplicated genes; this response obviously depends on the regulation of the affected genes. (ii) The novel
expression of a gene due to a fusion event occurring at the join point of the duplication (G&Dev1:227[87])
(Fig. 6-4). (iii) Duplication of a region also allows two different (and independently selectable) alleles of the
+
+
r
same gene to exist simultaneously in the cell (e.g. argX for an Arg phenotype and argX::Tn5 for Kan ).
While the above mechanisms might allow detection of a duplication (because of its effect on the
phenotype), it is the frequency of appearance and loss that typically provide the clearest indication of their
presence.
Genome sequencing has also revealed the existence of stunning levels of tandem duplications in
some higher organisms as well. Seventeen per cent of all Arabidopsis genes are arranged in tandem
arrays of two or more copies and analysis of entire chromosome segments indicates that 58% of the
Arabidopsis genome reflects readily detectable duplications (Nat408:796[00]). Similarly, it is now felt that
5-10% of the human genome might be duplicated (this and subsequent claims from TIG18,74[03]), which
might well be an underestimate if the duplications are large and extend beyond the BAC clones that are
used to generate the sequence (in other words, we mighty ignore the presence of a duplication larger
than our clones). It is also clear that at least three dozen different human diseases can be explained by
nonallelic homologous recombination between repeats that range from 0.8-500-kb in size.
74
Lets say that you have the

duplication shown in Fig 6-4 and you
want to move it into a different strain by
generalized transduction or
transformation (to be discussed later).
Lets further assume that the duplicated
region was 200 kB and the biggest piece
of DNA you could move between cells
was only a 100 kB the limit of the phage
head, for example). Could you move the
duplication? The surprising answer is
yes, because all you need to move is
Figure 6-4. The novel join point in a duplication can lead
the join point of the duplication. If this
to altered expression of some genes by creation of a
recombines appropriately with the two
fusion. In this example, duplication of genes CDEF has
daughter strands during replication, one
placed one copy of gene C under the control of the
product will be a chromosome with the
promoter 5' of FG.
duplication and the other product will be
a strange inverted beast that will die
(you need to draw this out). Somewhat amazingly, this transduction event is fairly frequent relative to
normal events.
Consequences of tandem duplications. Duplications have two general features: they cause a region to
exist in more than one copy and they create a novel join point, where the two tandem copies meet; either
or both of these properties can contribute to a phenotypic effect. Generally duplications cause no loss of
function, unless the duplication is within a transcript, leading to polarity, or within a gene, leading to loss of
that product's function as well as polarity. They can be of almost any size, with the proviso that you do not
duplicate the terminus of replication, a lethal event. Duplications can be moved from one cell to another
by any gene transfer system, even if that system carries much less than the region actually duplicated.
Such a duplication will only be selectable in the recipient if the duplication itself is selectable or if the
selectable marker is appropriately linked to the join point.
It has been argued that duplications/amplifications can be seen as a regulatory response to
specifics problems (ASM2, 2271), and as an essential mechanism for some "adaptive" mutations (ibid,
2273 and Genet161:945[02]).
In a longer view, duplications are obviously the mechanism by which genomes produce the raw
material for evolutionary experimentation, as evidenced by the numerous examples of sequence
similarities among genes whose products have rather related functions. There might even be cases where
regions of homology are so positioned so that they are able to promote amplification of the intervening
region by homologous recombination (see the case of the rhs gene of Ec JBact172:446[90] &
171:636[89]). A number of cases exist where there are small tandem or non-tandem duplications within a
gene, presumably generating repeated similar domains within the gene product. A particularly interesting
case is that of algP of Pseudomonas aeruginosa (this gene encodes a regulatory factor controlling
mucoidy, a critical factor in cystic fibrosis mortality). The gene contains about 44 nearly identical tandem
repeats of a 12-mer encoding LysProAlaAla, a motif similar to that of a eukaryotic histone that is involved
in DNA binding. Not surprisingly, the region, and therefore the mucoid phenotype is highly unstable, and
not all mucoid strains are identical in the actual number of repeats (JBact172:5544[90]). An example of
duplications providing growth advantage is the case of St growing on a limiting carbon source
(Genet123:19[89]). Nevertheless, because of the extreme instability of tandem duplications, I assume that
the rarer, but more stable, non-tandem duplications are actually the source of most genes copies
manipulated evolutionarily through mutation.
If duplications are so frequent, why do they not perturb chromosomal structure more? That is,
how can chromosomes appear to be so stable (e.g. the E. coli and Salmonella genomes are really pretty
similar (ASM2,1715 and 1903[96]). At least part of the issue is that there is a subtle selective pressure for
genes to be where they normally are with respect to the origin of replication. This position determines their
relative dosage in the cell, for instance. It is also possible that duplications perturb the (unknown) regions
involved in nucleoid condensation and perhaps even the position of Chi sites, with implications for
recombination and therefore repair (ASM2, 2256[96]).
Duplications in eukaryotes. Numerous examples exist for duplications in eukaryotes:
(CurrOpGentDev12:393[02], TIG17:299 & 661[01]). As noted above, genome sequencing of Arabidopsis
showed that 17% of all genes are arranged in tandem arrays of two or more copies and analysis of entire
75
chromosome segments indicates that 58% of the genome reflects readily detectable duplications
(Nat408:796[00]). Genome analysis has allowed improved estimates of the rates of duplication formation
and loss (TIG17:237[01]).
Perhaps more important is the demonstration of various numbers of (CAG)n repeats as a basis for
muscular distrophy, including its genetically unstable nature (Nat355:545ff[92]), and for the "fragile X"
syndrome, a (CGG)n repeat, which causes retardation (TIG8:249[92]). Oddly, with myotonic dystrophy,
the repeat is NOT in the coding region, but rather in the 3' untranslated region. Apparently the affected
gene product is irrelevant to the phenomenon, but instead it is the accumulation of the aberrant mRNA in
the nucleus that then directly affects the level of available MyopD, which is a master regulator for muscle
differentiation. Consistent with this indirect effect, another group of patients with MD have a completely
different CTG expansion in a completely different mRNA that leads to the same effect on MyoD.
JCellBiol.159:419[02]). Interestingly, while (CAG)n tracts in humans tend to expand, causing increased
severity of disease, such tracks in mice are stable and they tend to rapidly contract in yeast and E. coli. A
nice summary of some of this literature is Genet155:1657[00].
Obviously the instability of such highly repeated codons might come either through copy-choice
errors in replication or uneven recombination events.
Role of DNA structure in duplication formation. Essentially all of the arguments concerning the sites
and enzymatic functions involved in duplications are the same as already covered for deletions, since
duplications and deletions are two products of the same sorts of events (ASM2, 2256[96]).
In a case of unclear generality, small duplications, some also involving tandem insertions, are
found frequently in T4 tRNA genes and in particular regions of IS2. In both cases, the involved regions
have significant potential for secondary structure and it is argued that these serve as targets for
endonucleases with subsequent aberrant replication (JMB204:27 & 38[88]). This could correlate with the
rather common occurrence of palindromic sequences seen at the sites of some deletions and duplications
(and suggest the possibility that certain secondary structures can stimulate recombination).
In a screen of duplications ending in lacI, 28/30 occurred at small imperfect repeats: 7/7,
20/25,13/16 bp matches (G&D1:227[87]). It has been argued that there are different mechanisms for this
apparent recombination in short direct repeats, depending on the length of the repeat: <10 bp repeats
tend to be involved in a copy-choice error in replication (wherein the template and product strands slide
with respect to one another), while >18 bp will be used in a breakage-reunion error in recombination
(EMBO8: 3127[89]). Unquestionably, larger regions of homology tend to yield duplications through the
efficient rec system.
Non-tandem duplications. If the duplication is nontandem, then homologous recombination between the duplicated
regions will cause the loss of the unique material encoded
between the duplications (Fig. 6-5). For this reason, a nontandem duplication will typically be stable since the loss of the
unique information between the duplication is typically
deleterious for the cell. Perhaps because non-tandem
duplications require two non-homologous recombination events
for their generation, they are much less frequent that are tandem
duplications, but this may also be because they are difficult to
recognize.
However, I have come to believe that non-tandem
duplications might actually be the major mechanism for evolving
new genes, rather than tandem duplications. The reason is
simple. Unless a tandem duplication confers a very striking
advantage, it will be readily lost, which means that any
Figure 6-5. Creation of nonduplicated gene will have a limited time window to evolve to a
tandem duplication requires two
new function. In contrast, the relative stability of non-tandem
non-homologous events.
duplications should "buy some time" for the duplicated gene to
acquire a new function through additional mutations.
Non-tandem duplications in higher organisms presumably arise through retroposition in which
the reverse transcription of a processed RNA leads to a DNA copy without introns that subsequently
integrates (at random) into the genome. Such a copy will be silent unless it happens to integrate
downstream from a functioning promoter. Such genes might then be modified through mutation and might
eventually evolve into an expressed functional gene (see TIG15:304[99]).
There are numerous examples of non-tandem duplications in Ec: (i) argI and argF (encoding
76
OTCase) map at 97' and 7'; the argF gene is flanked by IS1 elements (MGG181:230[81]) of Ec. (ii) thrA
and metL genes (encoding aspartokinase homoserine dehydrogenase) of Ec (see ASM p970). (iii) The
rhs genes of Ec (JBact171: 636[89] & 172:446[90]). (iv) The tuf genes of most gram-negatives and
Clostridia, but not other gram positives (JBact171:581[89]). (v) The two nitrate reductase loci of Ec:
narGHJI and narZYWV (MGG222:104[90]). (vi) Two nearly identical glutamate decarboxylase genes in Ec
(JBact174:5820[92]). (vii) Most obviously, the seven rRNA operons.
There are some cases of duplications internal to genes, as well. (i) Within thrA and metL, two
120-bp repeats specifying sequences with 44% amino acid relatedness. (ii) In thrABC, there is a 35 aa
sequence that is found twice in A, and once each in B and C, in roughly similar positions (ASM p970). (iii)
Duplications and deletions lead to a family of related "M" proteins in Streptococcus (JBC261:1677[86]).
(iv) The mvhB gene of Methanobacterium thermoautotrophicum, which has six tandem copies encoding
an 8-Fe/8S ferredoxin, each linked to the next by -helices. This generated a poly-ferredoxin (a biological
battery?) (PNAS86:3031[89]). Obviously sequence analysis indicates that duplications have had a major
role in evolution.
There is an additional complication with non-tandem duplication in organisms that rely on sex for
their population dynamics. In such cases, it means that there is no longer a precisely homologous
chromosome in any other cell to which the cell with the duplication mates. This in turn can cause
problems in disjunction and can, one assumes, be a driving force in speciation.
Constructed non-tandem duplications. "Pseudo-tandem duplications" can be generated by introduction of
a selectable, non-replicating plasmid carrying a region homologous to the chromosome. The homology
directs the integration of this vector, yielding a merodiploid wherein the vector sequences are flanked by
two copies of the cloned region, one from the introduced vector and one from the cell's chromosome.
Maintaining selection for the drug-resistance cassette stabilizes these duplications (Well actually, it
doesnt really, it just kills off all the cells that happen to lose the duplication and we see only duplicationcontaining strains in the population.) As above, this only causes a deleterious phenotype if the cloned
region is entirely within a given transcriptional unit.
Amplifications. Amplifications are often just a special case of tandem duplications that have simply
continued to recombine to achieve higher copy. Amplifications are even more dependent on RecA than
are duplications for the following reason: duplications can arise by a single illegitimate recombinational
event, but amplifications only occur when there are repeated recombination events between tandem
duplications. Such repeated events can simply not occur with sufficient frequency without a homologous
recombination system. Because amplifications involve multiple copies of a region, recombination between
non-identical copies of those regions in the daughter chromosomes can frequently lower the copy number
of amplifications.
Amplifications can come and go through recombination, but they will only accumulate at
detectable levels in a population if the presence of the amplification confers some selective advantage.
Such an advantage is necessary for their accumulation because there is certainly a cost to having an
amplification: even though DNA replication is cheap relative to some other metabolic needs, it is still a real
cost and will be selected against absent an advantage. Amplifications are typically found experimentally
by gradually increasing the demand on the organism for the encoded gene product; that is, providing an
advantage for strains with amplifications. A particularly nice case has been done in lac, where 50-100
tandem copies of 5-20 kb have been successfully demanded (G&D1:227[87]). A related example may be
r
occurring in Bs when Tet was selected following protoplast fusion: resistant colonies appear to result from
80-100-fold amplifications of an 11-30 kb portion of a particular chromosomal region. Possibly the
protoplast fusion causes multinucleate cells, making recombination more likely for the generation of
amplifications (JBact172:4936[90]). (It is unfortunately almost impossible to use PubMed to find articles
about amplifications, because it turns up zillions of PCR references.)
A more special case seems to exist in Streptomyces spp, where there is apparently a dedicated
mechanism for the generation of amplifications of certain regions. In one such case, a plasmid has been
found that carries the genes involved in synthesis, and resistance to an antibiotic as well as an AUD
s
sequence (amplified unit DNA). Following protoplast fusion, one sees drug strains that have either
deletions or massive amplifications (500 copies of 10-kb). The amplified region is bounded by 2.8-kb
direct repeats and is apparently turned on by deletion of something else on the plasmid
(MGG202:348[86]). This association with a large deletion (up to 800kb) on one side of the AUD region is
a common theme in the Streptomyces amplifications (JBact171:5817[89]). The associated deletions are
generally quite complicated and they are the apparent result of duplications, inversions, and
translocations (JBact173:3531[91]). Spontaneous amplification of regions of pBR have been detected
when that plasmid was introduced into H. influenzae, with no obvious selection for the amplification
77
(JBact171:1898[89]).
Amplifications are
reasonably common
phenomena in eukaryotes
and a model has been
developed of amplifications
starting with large inverted
duplications
(EMBO7:407[88]).
Inversions. (see ASM2,
2256[96]) One would expect
to see recombination events
Figure 6-6. Permissive and non-permissive sites for inversion. As
between indirect repeats at the
described in the text, regions of homology were deliberately placed
same frequency as between
at two different pairs of sites and the resulting strains tested for their
direct repeats, causing
ability to invert the region between. Pairs of sites that inverted are
inversions at the same
indicated by the solid lines (which run from the site of one
frequency as duplications.
homologous region to that of the other), while pairs of sites that
Inversions are probably not that
failed to invert are shown in dashed lines.
frequent, but detection is also
difficult, since inversions have
neither a dosage effect nor display the instability that characterizes duplications. Inversions have two nonwild-type join points and the best known examples are the large inversion between Ec and St (from 26' to
35' in Ec). (See ASM pp974-976 for discussion of this and other smaller inversions between these two
organisms.) There are inversions due to recombination between rrnD and E in some Ec K12 derivatives
(PNAS78:7069[81] & Genet119:771[88]), and the identification of inversions between Ec lab strains has
also been reported (NAR17: 5901[89]). Some inversions in Ec are known to occur between native IS
2
elements (ASM p.987). By at least some assays, inversions are at least 10 times less frequent than
duplications. As always, the requirements implicit in the hunt may perturb the generality of the result.
Salmonella paratyphi, which causes typhoid fever, has a 100-kb insertion (relative to St) and an inversion
of the genome (between rrnH and rrnG), possibly to compensate for the inversion (JBact177:6585[95]).
The mechanisms by which inversions are formed is unclear, but does seem to involve the RecA
and RecBC proteins (Genet132:295[92]). The results below, which indicate regions where inversions are
forbidden, also have implications for the nature of the mechanism (Genet122:737[89]). Several selections
for the generation of inversions have been published (see Sci241:1314[88] and Genet137:919[94]) (Fig.
6-6): (i) Invert two non-functional his operons, one lacking the promoter end and the other lacking the
+
proximal end of the transcript, but with about a 2-kb overlap, and demand His . (ii) Move two copies of
lacZ in inverse orientations into different positions on the chromosome, each with a dissimilar transposon
+
in a non-identical position, using the Mud phage and demand Lac . By either system, when inversions
occur, they make up 30-90% of the events leading to the selected phenotype and generally grow quite
-3
well. Where they do not occur at permissive sites, they are <10 of the events. Almost all the permissive
regions involve either the ori or ter regions (Genet129:1021[91] & Genet137:919[94]).
In general, the sequences next to the sites are not important (or at least are not of overriding
importance) and non-permissive regions are found within permissive regions. End-points of nonpermissive inversions in one case can often be acceptable as an end point in another inversion. The nonpermissive inversions are not forbidden due to phenotype, because a non-permissive inversion was
constructed by another means (using transducing fragments, each of which carried one of the two desired
join points) and it was quite viable. Three models were proposed for the failure of certain classes to occur:
(i) a rigid chromosome might restrict certain folding events; (ii) differential domain supercoiling might make
sites recombinationally incompatible; and (iii) certain inversions may tend to invert a single replication
fork, with deleterious consequences.
Site-specific inversions. (Cell41: 649[85] & ASM p.1055ff, see Mobile DNA II ,ASM[02]) These are
systems in which a specific pair of inverted sequences recombine through the enzymatic action of a sitespecific enzyme, typically encoded very near the invertible region. By the inversion, one or the other gene
or set of genes is expressed.
In the case of the hin system, the invertase binds as a dimer to each of the two inversion sites,
apparently in the absence of other factors (JBC264:10072[89]). In this system, each end of the enhancer
region is bound by a dimer of FIS and this FIS tetramer complexes with the Hin tetramer, bringing the IRs
78
together. Part of the driving force behind this is the binding of HU to the space between the enhancer and
the near IR, bending the DNA. The inversion occurs when the dimers of Hin exchange partners, and
therefore strands of DNA, while being held together by the Fis. This inversion therefore involves a rotation
of the complex, providing an orientation to the inversion event. Finally, it is constrained to a single cycle
by the tension built up in the short loop bound with HU.
Other examples include Ec regulated phase variation (PNAS84:6506[87]), bacteriophage Mu gin
inversion (Cell41:771[85], JMB295:767.[00]), and pilin genes of Moraxella bovis (JBact172:310[90]). In
general, 26-bp IRs are sufficient for inversion, but there is a 15-20-fold enhancement by the presence of
60-200-bp cis sequences. These might be acting either as entry sites for proteins or perturbing the local
DNA structure. There are host factors involved, such as IHF and FIS, depending on the specific system.
One of the most striking features of these systems is the strict adherence to intra- rather than intermolecular recombination, though such preferences are not unique to these systems. It appears that this
adherence may be mediated through a requirement for knotted or catenated DNA structures, thus
selecting for regions on the same DNA, rather than through sliding or looping of protein DNA complexes
(Cell58:147[89]). The invertases are clearly related to the resolvases of the Tn3 family and form a
covalent complex with the 5' DNA phosphate at the cleavage site (EMBOJ7:1229[88]).
All these systems are similar at the level of both function and sequence, since the invertase from
any of those described above can substitute functionally for the others. There is at least one other class of
inversion system, represented by the pilin expression system (pil) of Ec. This invertase does not display
homology to the above, but rather to phage lambda integrase.
607 Lecture Topic 7............... MOBILE GENETIC ELEMENTS
(See reviews in ASM2[96] on transposition (p.2339), site-specific recombination (p.2363), use of
transposons (p.2588 & 2613), native insertion sequences (p.2000) and the book Mobile DNA II, ASM[02].
A review of the use of transposons in genomics and proteomics is in ARG37:3[03].
In the following, I often use the term "hop" instead of the more proper, but cumbersome,
"transpose." Do not be misled into thinking that the transposable elements move from one site to another,
leaving the first in its wild-type state; this is probably the one thing they never do. They certainly can make
a copy of themselves at another site, but they either leave the original copy at the original site (replicative
transposition) or leave a double-stranded break at that site (conservative transposition) - in neither case is
the original sequence of the region restored to what it was prior to the insertion of the element (which is
the typical mental image from the term hop). I also use the abbreviation MGE (mobile genetic elements)
but the very definition of these elements is a bit tricky. Some argue that only elements that move "by a
transposition mechanism" can qualify as transposons, and this demands a small amount of DNA
synthesis. As you will read below, there is a specific biochemical implication to the term transposition
(movement that involves DNA synthesis) and most elements that move do so by this mechanism.
Arguably the term transposon should be restricted to those elements that use this mechanisms, The other
class happen also happen to remove themselves and restore the original target sequence when they
move, so these elements do hop. Such elements use an integration/excision system like phage lambda
(with Int and Xis homologs, but without any DNA synthesis). The term MGE therefore encompasses both
groups.
MGEs are of interest for the insight they provide into basic molecular biology and evolution, as
well as for their use as basic genetic tools. In terms of their use in genetics, transposons carrying drugresistance genes provide selectable mutagens. As mutagens, they destroy the function of the product of
the mutated gene, tend to be polar, and have site specificity ranging from very specific to nearly random.
While this LT focuses primarily on bacterial elements, there is a rich literature on eukaryotic
elements as well, particularly TY1 in yeast (Sci303:240[04] & Genet165:83[03]). It happens that the vast
majority of transposable elements in eukaryotes are retroelements, meaning that they go through an RNA
intermediate. The genome sequencing of Arabidopsis showed a remarkable occurrence of transposable
elements: 10% of the genome represents transposons of a variety of types but most elements are not
transcribed, suggesting that they have acquired mutations and are therefore inactive in transposition.
General classes of bacterial MGEs.
Insertion sequences (IS) are defined as MGEs that move by a transposition mechanism and are known to
encode only functions involved in transposition (CurrOpMicro9:526[06]). This is to be contrasted with
transposons (Tn), which are mobile genetic elements that move by a transposition mechanism and
contain genes unrelated to insertion functions (for example, drug resistance). Since IS elements, by
definition, do not carry other genes, they tend not to be used as genetic tools. On the other hand,
79
mutations caused by ISs tend to be fairly common among spontaneously derived mutants, so that
knowledge of their properties is relevant. Finally, recognize how very arbitrary the above definition of an
IS is: it is potentially our ignorance of other encoded products that so identifies an element as an IS
instead of a Tn.
In either case, MGEs are essentially pathogenic elements whose environment is the genome of
the host. As in the case of any pathogen, they can neither be too virulent (where they efficiently kill the
very host on which they depend for their existence) or not virulent enough (in which case they cease to be
identifiable as pathogens). The virulence of these pathogens depends on their ability to increase their
number in the genome, as well as on any negative effects caused by the insertion itself. The former is a
function of the frequency of transposition, while the latter reflects their target sites. Not surprisingly, then,
MGEs spend most of their regulatory efforts avoiding transposition in order to avoid putting their host at a
selective disadvantage. Some elements aid the host by supplying useful functions like drug-resistance.
Transposons. There are two general classes of transposons: Class l or compound Tns encode drug
resistance genes flanked by two copies of an IS as direct or indirect repeats. The IS sequence supplies
the transposition function these Tns can be thought of as ISs flanking another gene. Examples are Tn5, 9,
10, 903, and 1681. In all of these, the IS ends of the transposon are capable of transposing separately.
Indeed, the ISs transpose at least 10-fold more frequently than the entire Tn, though we rarely notice this
because we tend to monitor the drug-resistance gene carried only by the entire element. The second
class of transposons is known as complex or Class 2 in which the element is flanked by short (30-40 bp)
indirect repeats with the genes for drug resistance and transposition encoded in the middle. In other
words, the resistance and transposition functions exist as an integrated genetic unit. Not surprisingly,
these must always transpose as a unit.
Mu and other phage transposons. For all intents and purposes, bacteriophage Mu (named for its mutator
effects) belongs in the general category of a Class 2 transposon, since it is not flanked by separately
transposable insertion elements. Its physical size is 38 kb and it generates 5-bp duplications upon
insertions. Its site preference is remarkably random and the argument has been made that its specificity
can be for no more than one or two base pairs. However, in at least one particular region, it has been
found that a disproportionate number of insertions fall within one very small region of the gene suggesting
that there can be some site preference. Mu is rather strongly polar in both orientations, but it is clear that
there is an exceedingly low level of transcription out of one end of the prophage. The transposition of Mu
is known to generate deletions, as roughly 10% of the Mu prophages have adjacent deletions. These
deletions tend to start at one end or the other of the prophage and extend into the adjoining DNA though
there also seem to be cases where the deletions are unlinked to the prophage. Finally, precise deletion of
-9
Mu is rather rare, occurring at approximately 10 .
The advantages of the use of Mu are: it is not normally found in the bacterial genome and
therefore there are few problems with homology to existing sequences in the chromosome; in contrast to
most other transposons, Mu does not need a separate vector system (see below) since it is itself a vector,
ts
being a bacteriophage; Mu prophage (at least the c versions, where c encodes the repressor) are
inducible. The disadvantage of Mu is that it is a bacteriophage and therefore can kill the host cell. A wide
variety of useful mutants of Mu have been generated, especially the small defective variant known as
Mud.
In addition to the case of Mu, there is a Pseudomonas phage, D3112, that also generates
phenotypically distinct mutations, albeit at 100 times lower frequency than Mu (JBact171:3909[89]).
Elements that move by site-specific recombination. "Site-specific" here refers to the mechanism, not to the
specificity itself, which depends on the enzyme making the nicks in the target DNA. The site-specific
enzymes do tend to be quite specific, but transposases are not completely random. Indeed, the
transposase of Tn7 (a standard transposon by any definition) is extremely specific, but it still uses a
transposition mechanism.
These MGEs do not duplicate target DNA, and therefore do not require DNA synthesis. By Nancy
Craig's definition, these are not transposable elements. Tn1545 and its relatives Tn916, 918, and 920
from gram-positive streptococci move using Int and Xis proteins possessing both sequence and function
similar to those of the lambdoid phage. They can thus restore the target sequence when they leave. This
sequence requirement confers a degree of target specificity (MolMicro4:1513[90]). On the other hand,
Tn916 does not necessarily restore the target, because it brings in a 5 bp sequence on either end, which
80
may or may not be incorporated into the new chromosome upon

replication (Cell59:1027[89]). Curiously, the int homolog is required
only in the donor cell and therefore is only necessary for excision (or
else it also moves to the recipient) (JBact174:4036[92]).
Other examples of site-specific recombination systems
include phage integration and excision systems (lambda, P22, 80,
P2/P4), resolution systems of plasmids and some replicative
transposons, and the various related, inversion systems. These
systems have the ability to recognize the site geometry, perhaps by
protein tracking or requirements for specific inter- or intramolecular
structures. It has been shown that many site-specific integration
systems, excepting lambda, integrate at a tRNA structural gene, and
the elements are so designed that they recreate the gene and its
associated sequences upon integration.
Conjugative transposons (MicroRev59:579[95]) These elements
have a number of properties of transposon, plasmids and even
phage. The situation is exacerbated by the fact that there are few
striking consistencies within the group. These elements employ sitespecific recombination rather than transposition and do not seem to
be able to replicate as plasmids, but they typically excise from the
chromosome as a prelude to conjugal transfer.
These elements were originally found only in gram-positives,
with Tn916 and Tn1545 being the best known. They have been
found in Enterococcus, Streptococcus, and Lactobacillus. As befits a
conjugative element, they are large with the smallest being 18 Kb.
Tn916 and its relatives have also been found in gram-negatives like
Neisseria and Kingella and two different unrelated conjugative
transposons have been found in Bacteroides. At least one type of
conjugative transposon on Bacteriodes has the ability to cause
excision and mobilization of sections of the Bacteroides
leu
chromosome. Tn916 and Tn1545 integrate at a specific tRNA
gene in Bacteroides. When introduced into Ec, they hop to multiple
sites and at a much lower
frequency, though both the
specificity and frequency look like
that in Bacteroides if the
appropriate target is provided
(JBact178:3595 &3601[96]).
While these elements
Figure 7-1. Model for

replicative transposition.
tend to transfer only themselves,

Figure 7-2. Model for conservative transposition.
there are cases of mobilization of
the chromosome or of plasmids
either through integration or by supplying tra functions to mobilizable plasmids. Perhaps most interesting,
the transfer of at least some of these plasmids is actually stimulated by Tet, to which these elements often
provide resistance. In other words, the physical presence of the drug to which the element carries
resistance somehow signals movement. While this has obvious implications of clinical importance, it is
also a puzzle, as Tet has not been used for very long, yet an elaborate sensing and response system
seems to have been developed. Finally the detection of very similar resistance markers on such elements
in a wide variety of bacteria and hosts is fairly suggestive evidence for horizontal transfer between
species. These elements have sometimes been termed integrons and they show up frequently in clinical
analyses of drug-resistance (AAC46:2427 [02], 2400[02] & 2656[02]). For more on mechanism, see
JBact184:3017[02].
Yeast MGEs. (See Cell93:1087[98], GenomeRes13:1975[03],CSHSQB66:249[01], & Yeast16:785[00])
There are five different but related classes of MGEs in the yeast genome, termed Ty1-Ty5, that are
roughly 6 kb in size. Each class has a pair of direct repeats at the ends, and in most cases these are
81
different among different classes. Each encodes a number of different functions including an RNA-binding
domain, protease, integrase, reverse transcriptase and RNAse H, though these activities are apportioned
into one or two separate protein products. In all cases, there is a -1 programmed frameshift that is
necessary for the production of most of these critical activities. Very briefly, an mRNA copy is made,
protein products are translated and accumulated, which (amazingly) assemble into virus-like particles. It
is within these particles that reverse transcription into ds DNA, which then is inserted into DNA. The entire
process certainly looks as if these are viruses that at some point lost their ability to exit the cell.
As with prokaryotic MGEs, there is a danger in integrating into a site that is important for host
survival. The Ty elements exist in 7 to 217 copies each in the genome, so how do they avoid hurting the
host (remember that the yeast genome is not much larger than some prokaryotic genomes, so there is not
a lot of irrelevant DNA to integrate into)? They solve the problem through different degrees of sitespecificity. Ty3 is the most site-specific and always inserts 16-18 bp 5' of tRNA genes. In these pol IIItranscribed genes, the promoter is internal to the gene and these elements insert about where
transcription actually starts, which does not affect expression. It accomplishes this by associating its
integrase with host factors that bind at these sites. Ty1, 2, and 4 integrate within 500 bp of the promoters
of pol II-transcribed genes, so that they occasionally turn on or turn off expression of the genes
immediately 3' of them. This does not seem to be sequence-specific and the molecular basis for the
selectivity is unclear. Ty5 integrates near telomeres, which are often transcriptionally silent. The common
logic in all these cases is that the mechanism of site selection decreases the likelihood that the growth of
the cell will be damaged by the transposition event.
Because these elements are flanked by direct repeats, these sequences can undergo
intramolecular recombination that pops out the entire element except for a single copy of the repeat,
termed a solo LTR. These then serve as an indication that a Ty element existed at that site in the past.
Over time, of course, random mutations occur in these remnant sequences and we lose our ability to even
recognize them.
The Ty elements have certainly had an effect on the evolution of the yeast genome, as judged by
their position in the genome with respect to what are clearly old rearrangement events. However, their
effects has not been because of their transposition properties so much as by the fact that they create
multiple copies of sequence similarity that can then be targets of homologous recombination between
non-identical regions of the genome. In contrast to bacterial MGEs, Ty elements have not been
engineered for use as genetic tools.
Conservative vs. replicative transposition.
Current models for replicative transposition all involve cointegrants, with subsequent resolution by a
resolvase or recombination (mechanism of resolvase, Nat332: 861[88]).
In replicative transposition, the transposon increases its copy-number in the cell directly by
making a new copy while maintaining the original one (Fig. 7-1). This process, apparently used by Tn3
and Mu, involves (i) single-strand breaks in the target DNA and on either side of the transposon; (ii)
ligation of ends as shown to produce a complicated hybrid molecule; (iii) (not shown) bi-directional
replication through the element with subsequent ligation, (iv) producing a single replicon where the donor
replicon is flanked by two copies of the transposon and fused to the target replicon. (v) A replication event
resolves this to the finished product.
Conservative transposition is so named because the copy number of the transposon is conserved
during the operation. It involves the transposon physically leaving one replicon and moving to another.
The original replicon is left with a double-stranded break and so is destroyed unless that is repaired. Most
elements try to transpose soon after DNA replication because there is then the highest probability that the
cell actually contains another replicon. Notice that while conservative transposition does not increase the
element's copy number in the cell (at least not initially), it does increase the ratio of transposons to
functional replicons in the cell. This might actually be the reason that Tns do not repair the site they have
left.
The molecular events are illustrated in the figure. The model was developed for Tn5/IS50 and
seems to describe the Tn10/IS10. The initial events are (i) double-strand breaks on both sides of the
transposon coupled with staggered single-strand breaks in the target replicon. (ii) The ends of the
transposon are then ligated to the single-strand segments in the target DNA. Finally, the gaps in the target
molecule are filled in.
There is an additional wrinkle here for conservative transposition. Consider what would happen if
the double-strand break at the site of the original insertion was repaired. This could only happen by
recombinational repair in which another copy of the region on either side of the break was used as a
template to reconnect the ends. However, you will realize that the only possible source for such a
template is the sister chromosome from which the Tn did not leave. Using this as a repair template would
82
recreate the original insertion at the site from which it left. The result of this repair would look identical to
that of replicative transposition and it has been suggested that some elements might move by both
mechanisms.
What if the two replicons involved in replicative transpositions were not linear? Obviously there
would be a problem with resolving the intermediate, or, more properly, with recreating the starting
replicons. This is not an absurd question, as linear plasmids, chromosomes and phage are all known to
exist.
Details of bacterial transposition.
Transposition factors encoded by MGEs. Though there is a bit of variability, notably with Tn7, it appears
that most prokaryotic MGEs encode a single key protein for transposition, typically abbreviated Tnp. The
assembly of a higher-order nucleoprotein 'synaptic' complex by transposase (Tnp) proteins is a central
checkpoint in the mechanisms of many DNA transposable elements. This protein must form a synaptic
aggregate of a complicated architecture, which ensures that only the end sequences of the MGE are
cleaved. Typically, the initial DNA binding is by single Tnp complexes, but DNA cleavage does not occur
until the formation of the complex structure and it is the Tnp bound to the other MGE end that makes the
cleavage (MolMicro62:1558[06]). With Tn7, there are five gene products involved in two different
transposition pathways: three genes are involved in both, while the other two are involved in target
selection. One of these latter two is extremely site-specific and predominates, resulting in the single site of
Tn7 transposition seem in all examined bacteria; the other is a much more random in site selection
(NatRevMolCellBiol2:806[01]).
Target specificity of transposition. A wide range of target specificities exist among MGEs with Mu being
nearly random in most hunts, but showing identical, independent isolates in others. Tn7 is particularly cute
in that it has two different site-choice mechanisms (though only the first seems employed most of the
time): a very specific one for a sequence nearby the site of insertion and another system which is rather
random (JBact172:2774[90]). The particular gene product involved in the specific recognition of attTn7 is
TnsD (PNAS86:3958[89]). The protein involved in the non-specific transposition, TnsE, has a cute
property of its own: it likes to insert near double-stranded breaks and at replication termination sites,
which is thought to be a strategy for getting itself onto a conjugating plasmid (from Nancy Craigs lab,
quoted in TIGS16:202[00]. Tn5 has at least a preference for G/C pairs and for negatively supercoiled
regions, while Tn10 has rather strong site preferences, but is capable of utilizing other sites at lower
frequency. As always, the result depends on the nature of the hunt. For more on Tn7, see Nat
RevMolCelBio2:806[01].
There can also be an orientation-specificity for a given transposon/target; that is, certain elements
only integrate into certain replicons in one orientation. The basis for this might arise from either the
asymmetric recognition of the site or because of nucleic acid topology constraints.
IS102 has been shown to have a preference for certain sites when they are transcriptionally
active, while Tn5 and 10 show a 4-10 fold reduction in transposition into transcribed regions. This may be
a secondary effect, however, since it has been shown that Tn5 prefers negatively supercoiled DNA for at
least some sites. The transcription effect above might therefore reflect positive supercoiling ahead of the
RNA polymerase.
Regulation of transposition frequency. (ResMicro155:387[04],G&D19:2224[05]))Because transposons are
parasites on the host genome, they must not transpose very frequently or they will damage that host.
There is therefore a large set of mechanisms that transposons employ to maintain a very low frequency of
movement:
(i) Antisense RNA (in the case of IS10/Tn10): In this case the antisense RNA is complementary
and overlaps the 5' end of the mRNA encoding transposase. The antisense mRNA is extremely stable
and hybridizes with the mRNA, forming a structure that is attacked by RNase III. It is also possible that the
hybrid occludes the ribosome-binding region of the transposase gene. This RNA becomes a trans-acting
negative regulator of transposition, presumably becoming more effective as the copy number of the
element increases.
(ii) Protein-protein interaction: In IS50, a shortened transposase variant, produced from an
internal translational start, inhibits the activity of the full-length one. It is unclear if this is actually regulated,
or simply a mechanism to reduce activity of the transposase.
(iii) Classical repression: The Muc gene is a repressor of all other phage functions including
transposase. Even neater, one of the sites recognized (and required by) Mu transposase is an operator
83
site that is regulated by the Muc gene product. The binding of Muc to that site not only blocks
transcription, but also transposition. One of the ORFs in IS1 also seems to be a repressor of a transcript
encoding itself and transposase activity. The Tn3 resolvase is a repressor of both its own synthesis and
that of the transposase.
(iv) dam methylation can affect both the transcription of the gene for transposase and the site of
transposase action for both IS10 and IS50. In each case, the fully methylated target site is much less
active than the hemi-methylated version in terms of transposase binding. Also, the fully methylated
promoter regions are much less functional in supporting transcription of the transposase gene. These
properties have several effects: they reduce the time window after replication when transposition is
possible and they help the Tn to hop only when a new replicon has been produced in the cell.
(v) Prevention of aberrant transposase expression: In at least some cases, transcripts into the
transposase gene that start from outside the element necessarily contain secondary structure that
prevents the translation of transposase (JMB192:781[86]). In other words, the correct promoter starts
transcription within the region encoding the structure so that the structure cannot form. However, a
transcript from the outside carries regions of the MGE 5' of the normal promoter that cause this secondary
structure. A similar result has been seen in chromosomal genes, where promoters 5' of the proper ones
are precluded from causing gene expression (JBact173:3680[91]).
(vi) Translational frameshifting is required in the translation of the transposase of IS1. It is unclear
what the regulatory impact of this is and may merely serve to further decrease production of functional
transposase.
(vii) cis-acting transposases. Because of the high affinity of transposases for DNA, they tend to
bind near their own gene. This reduces the potential increase in transposition resulting from multiple
copies of the element in the cell that might have arisen if all copies served to increase to transposase
concentration in the cell.
(viii) Transposases act as multimers, possibly requiring as many as a dozen molecules to form a
functional aggregate. This severely reduces the likelihood of transposition since there are, on average,
0.2 transposase molecules in a given cell.
(ix) Transposase seems to be poorly translated in most elements for one of several reasons: a
poor or absent Shine-Dalgarno sequence, occurrence of secondary structure occluding the RBS, and
relatively unstable mRNA (Genet124:449[90]).
Other factors affecting transposition. (i) Replicon immunity is a phenomenon whereby some elements
interfere with the ability of elements like themselves to hop into the replicon where they exist. In the case
of Mu, this occurs by the Mu B protein (which targets sites) binding preferentially to non-immune DNA.
This differential binding is probably because the Mu A gene product binds to the Mu DNA and dissociates
the B gene product. A cis-acting site in Tn7 has also been implicated in transpositional immunity as has
that of Tn3.
(ii) Physiology of transposition: The above topics describe some of the mechanisms of regulation,
but not the underlying physiological properties that invoke these mechanisms. While these are almost
completely obscure, it has been noted that there is both temporal and spatial regulation of transposition
within a bacterial colony (JBact171:5975[89] & and refs therein). That is, it appears that the very different
environments within bacterial colonies have differential effects on the regulation of transposition of at least
some elements. Such phenomena are largely ignored in most genetic analyses, but are almost certainly
of importance ecologically and evolutionarily. UV has been shown to stimulate Tn10 transposition, but not
in lexA or recA mutants, suggesting that SOS stress response can have some effect on transposition
(Genet149:1173[98]).
Distribution and evolution of bacterial ISs. (CurrOpMicro9:526[06]) Until genome sequencing, one
typically detected a new class of IS because it appeared in a genetic hunt for loss-of-function mutants and
was recognized as an insertion of genetic material. Obviously now ISs are found through genome
sequencing, but the presence of sequence homologs to known MGEs does not verify that they are still
functional. Based on the similarity of most ISs in a given host to each other, it has been argued that they
have been recently acquired, which suggests that the elements are lost with some frequency as well.
The involvement of ISs in some chromosomal rearrangements is clear, such as the case of the
duplication of argI/F, which is flanked by IS1 elements. Another clear case is the observation that there is
many chromosomal rearrangements in old stabs (soft agar tubes with a bacterial inoculum that grows
exceedingly slowly, if at all, after the initial round of growth) of E. coli (Genet136:721[94]). In many cases,
the rearrangements could be directly correlated with IS5 or IS30, but the resulting strains often had
84
growth problems on different media.

Now certainly few of us really care about yeast[:-)], but the major chromosomal rearrangement
events seen among related strains are translocations and it is clear these occur by homologous
recombination between various Ty elements throughout the chromosome (Nat405:451[00]).
So, for all the fanfare about the rearrangements that MGEs can stimulate, they do not seem to
cause these very often. Why not? Why are MGEs there? The best model is that they exist because they
have a mechanism to enrich themselves in their environment, the genome of the host (ASM, p.1066).
There need not be a selective advantage for the host to have them, merely a lack of a strong selective
disadvantage. On selections against inteins (LT1 - proteins that splice themselves out of the proteins
encoded by the gene into which their DNA has inserted), a specific form of selfish DNA, see
ARM56:263[02] & TIG17:465[01].
One case noted in the reviews is the curious positive effect of IS50 on growth of Ec under certain
circumstances (and this does not refer to drug resistance selections). This is certainly interesting, but can
it possibly be general? Until the mechanism of this effect is clear (e.g. how can they be sure that
transposition is not involved?), it would seem best to defer judgment on the implications of general
beneficial effects.
A second possibility is that ISs speed rearrangements, helping the host respond to novel
problems. The problem is that the data does not support this. A counter argument is that we have not
studied the pathways relevant to Ec survival in the real world: if we examined those, whatever they are,
we would see tons of rearrangements. However, we should see lots of differences between Ec and St, but
they simply aren't there. It is of course possible that the positions of genes are sufficiently tuned for proper
function, so that most rearrangements are selected against.
ISs are pathogens and as such run a narrow gauntlet of being lost from the population due to
either insufficient or excessive virulence. Perhaps they have difficulty maintaining this balance and are
constantly failing either way, only to be reinvented due to their basic self-selection. We may not see some
of their handiwork, because we no longer recognize the corpses at the sites as ex-functional IS elements.
Bodies have probably been found on ColE and pSC101 plasmids (MGG217:269[89]). Skeletons of
retroviruses are detected in the human genome.
It is clear that some introns are also mobile and they do not have quite the problems of most
IS/Tns in that they do not impart a deleterious phenotype through gene disruption. It is of course possible
that they might have subtle effects on gene expression (e.g. effects on the stability of the transcript into
which they are inserted), but it would only be under conditions of massive multiplication they would seem
to have an impact on the host. Why then are they not more common? The easy answer is that the known
T4 introns have an extremely site-specific transposase, but it would seem simple enough to evolve a
piece of DNA that had the transposition properties of a typical IS, but with splicing abilities also. Perhaps
they do exist, at relatively low copy number and simply have not been detected.
The transposases of at least some bacterial IS sequences are related to those of transposable
elements in higher organisms. Specifically, that of IS30 is rather related to transposases of HIV-1 and
RSV (retroviruses) and shows less similarity to the transposes of the LTR-retrotransposons, Copia and
Gypsy. This entire group has a lower similarity to the transposases of IS911 and IS3. That of IS630 shows
similarity to those of mariner, Pogo and Tc1 and all of the transposases listed here can be grouped into a
single monophyletic group (TIG15:326[99]. Similarities between Tn7, Tn10, V(D)J recombination has
been noted (Science271:1592[96], EMBOJ15:6348[96], Cell84:223[96]. Evolution of Ty elements in yeast
has been rather well studied and provides a nice model system for the question because of the number of
range of elements in the Saccharomyces genome JMolEvol49:352[99]).
Effects of transposons other than gene disruption. Surprisingly transposons can actually activate the
expression of otherwise silent genes in one of several ways:
(i) IS1 and IS3 have sequences near their ends that are similar to consensus -35 regions of
promoters. If these elements happen to insert at an appropriate distance from some sequence that is
similar to a -10 region, then a promoter will be created. The effect is the coincidental generation of a
functional promoter by the insertion.
(ii) Some elements have a completely functional promoter that reads out of the element into the
adjacent DNA. Typically this is the anti-sense promoter that the element employs in its own regulation of
transposition. This gene activation occurs with a number of Tns and ISs: in a specific screen in his, a
+
promoterless hisD gene was turned on by the transposition of Tn10 or 5 in 85-95% of the HisD isolates,
when the cell had the relevant Tn already in its chromosome. Most of these involved the smaller IS at the
ends, rather the than the entire Tn (Genet120:875[88]).
(iii) IS1 and IS5 have positive effects on bgl expression through effects on the local structure of
DNA(Nat293:625[81]). However, insertion of a element anywhere near the ebgA gene reduces its
85
expression, perhaps by altering the helicity of the region and reducing promoter efficiency
(PNAS81:6115[84]).
(iv) and other elements are involved in F factor integration into the chromosome, leading to
mobilization of the chromosome or other plasmids. R68 becomes a chromosome mobilizing plasmid
(R68.45) when there are tandem copies of IS21 on it, presumably stimulating transposition of the plasmid
into the chromosome (Genet115:619[87]). As noted above, there are also the conjugative transposons of
Bacteroides and other organisms that support their own transfer as parts of either plasmids or the
chromosome.
(v) Obviously Tn elements often encode drug resistance and different ones are noted in ASM
p1071 ff. As an aside, Tn5 was originally identified as conferring kanamycin resistance, but it was later
found to also confer streptomycin resistance in some bacteria other than Ec. Further analysis revealed a
gene for bleomycin resistance and all three genes are in the same transcript. It appears that the failure to
see the latter two resistances in wild-type Tn5 in Ec is because of a post-transcriptional event that
interferes with translation of str and ble, but drug-resistant mutants can be selected (MGG204:404[86]).
Use of bacterial Tns (see ASM2,2588 & 2613 [96]; for uses of Tns in genomics, see ARG37:3[03]) The
most common use of transposons is as a selectable mutator. For this you need to arrange things so that
the only drug-resistant colonies that arise are the result of transposition events. We will assume you have
a selectable marker on the element, but remember that MGEs spend most of their time not transposing.
For transposition into the chromosome, you typically introduce the transposon on a vehicle that
either is incapable of replication in that host or that can be discouraged from replicating. For example, one
can use a phage with a nonsense mutation in a critical gene. Such a phage mutant can grow on a strain
that contains an appropriate informational suppressor, but when infecting other hosts, it is unable to
replicate or integrate and all drug-resistant colonies that arise are due to transposition of the element from
the phage into a host replicon. Alternatively, suicide plasmids can be conjugated into the strain one
wishes to mutagenize. Again, because the replicon with the transposon cannot replicate, only
transposition event will be detected as drug-resistant. Finally, linear DNA containing the transposon can
be introduced by transformation or generalized transduction. Such DNA cannot replicate and, assuming it
lacks homology with the chromosome, only transposition events will be detected.
For transposition onto a vector (plasmid or phage), it can either be done as above, where the
vector is already resident in the mutagenized cell, or you can simply move the vector out of a cell
harboring the element and demand those vectors with the selectable marker by selecting transfer of drugresistance in a recipient cell. Mutagenized vectors can be demanded by making a phage lysate,
performing plasmid conjugation/mobilization, or by the physical isolation of the plasmid DNA with the
subsequent transformation selecting the element.
Use of bacterial Tns in eukaryotes. (This is a slightly deceptive title, but...) The bacterium
Agrobacterium tumefaciens is a plant pathogen with the remarkable property of being able to mobilize a
piece of DNA (T-DNA) from its genome into eukaryotic cells, where it moves to the nucleus and
integrates, more or less randomly. Ignoring the biology of this for the present, one can clone selectable
markers into the T-DNA and then use this as a MGE mobilization system. Importantly for our purposes, it
has been successfully used in some yeast that lack other useful tools (Sci312:583[06]).
Constructed Tn variants. (see the ASM listing on p.1072ff and the IS/Tn registry in Gene51:115[87]).
(i) Different drug markers: These provide add novel drug-resistance genes to the transposition
properties of a given element.
(ii) High-hoppers: Variants of several Tns have been generated that are derepressed for
transposition, typically because of enhanced expression of the transposase, and therefore hop with higher
frequency.
(iii) One-time hoppers: If the gene encoding the transposase is removed from the element and
placed nearby (perhaps on the same piece of DNA) the element will hop to another new site, but, now
lacking a transposase gene, will hop no more. This is particularly useful when used in conjunction with
high hoppers, since continued transposition would cause problems in any interesting mutant found after a
transposon mutagenesis.
(iv) Gene- or protein-fusion elements: Some elements have been constructed that have
selectable or scorable reporter genes at one or the other end, that are unexpressed unless there is active
transcription/ translation from the surrounding region. This provides an instant assay from the expression
of that region into which the element is inserted. Concerns with this are covered under Fusions in LT3. A
particularly interesting pair of fusion systems involves Tn5 derivatives of lacZ and phoA, the former only
functions when the hybrid protein remains in the cytoplasm and the latter only when hybrid proteins are
86
exported. Further, either fusion type can be genetically converted to the other at the same site
(JBact172:1035[90]). An allied construct allows fusions so generated to be easily cloned from multi-copy
plasmids onto a prophage (Gene90:135[90]).
(v) mob/oriT transposons: Transposons have been generated that carry the mob site for one of
the conjugative plasmids, allowing the mobilization by the plasmids of any replicon into which the Tn has
inserted. Some Tn5 derivatives have mob sites for different conjugation systems, which allows their
mobilization (LT9).
r
(vi)Tn5tac1: This construct contains an outward reading Ptac promoter, lacI, and Kan . It has been
used in Bordatella to isolate mutants affected in virulence with screening for those that regain virulence
upon addition of IPTG. This allows the overexpression of gene products otherwise identified only by the
phenotype of their loss (JBact172:1681[90]).
(vii) Counter-selectable Tns: Tn5 derivatives with rpsL (conferring streptomycin sensitivity in many
r
strains) and Kan allow selection for both the presence and the absence of the element, as desired.
TnsacB (sucrose sensitivity) is also available.
607 Lecture Topic 8.........SELECTIONS, SCREENS AND ENRICHMENTS

The most critical potential difference between selections and most screens is in the power of resolution:
-10
How rare an event can either scheme detect? A strong selection can detect an event at <10 while
-2
-4
screens typically function in the range of 10 -10 , although much more effective screens are sometimes
possible. This difference in power affects not only the events that are detectable by either approach, but
how much effort is involved in obtaining the desired mutants. The ability to devise a selection that yields a
desired phenotypic class is what defines a geneticist. Roth and Botstein felt that genetics was a matter of
"demanding something from the bug instead of just taking what it is willing to give you.
Selections. A selection is the demand for a given phenotype; a review on all sorts of selectable
phenotypes is given in ASM2,2527[96]) Some of the general questions and concerns are:
The tightness of the selection. How well does a strain with the selected phenotype grow relative to the
counter-selected one? Surprisingly, it only takes a 2-3 fold difference in growth rate to provide a
significant enrichment, if sufficient cycles of growth in liquid are allowed. The problem then becomes
starting with a large enough inoculum to ensure that the desired mutant is present at the end. Selections
on plates require a more severe growth difference between the bulk of the population and the desired
strain because the desired strain must give colonies while the rest of the population cannot be allowed to.
How many genotypes confer the selected phenotype and which are the most common? Another way of
phrasing this is: What is the window of your desired phenotype? Let's say that you are seeking promoterup mutations and you develop a reporter that provides growth only for promoters that have at least a 10fold increase in activity. That's fine and means that there are all sorts of undesired events (like
duplications) that you won't have to bother with, because they will not provide the demanded 10-fold
increase. However, it also means that you will only detect that class that has this effect. You will not
detect weaker promoter-up mutations nor will you be able to say anything about the frequency or nature
of that unfound class. Another example is a paper that dealt with duplications: The authors demanded
duplications with at least one end point in a given region (for the very good reason that it made
subsequent analysis easier and therefore more interpretable). However, this window meant that there
were only a small fraction of possible duplication events that they could have detected. As above, nothing
can be said about that larger class, but more importantly, if your window happens to target a class that is
not representative of the unconstrained class, then the broader conclusion you hope to draw may well be
invalid.
This is not a trivial concern or issue. We always should strive for systems that allow us to answer
broader biological questions (and not merely the role of a specific amino acid in a specific protein in a
specific organism), but in choosing any model system with a demanded phenotype, we have established
a window and the degree to which it is representative of the broader biology is very important.
Does your selected phenotype demand compensatory mutations? As you surely know, every time you
grow up a culture, you are doing a selection for mutants that grow faster than the majority class. Normally,
this is not a strong selection because the potential gain in growth rate is small: there is no single mutation
that causes a mutant to grow significantly faster. Thus, even if such mutations occur in the population,
they will not be significantly enriched in the population, which in turn means that the population will be
roughly the same at the end of the growth as it was in the inoculum. However, there are very common
87
situations where one can be fooled by changes in the genotype of a population.

Remember that even a small advantage in growth rate can allow a mutant to dominate a
population if a sufficient number of generations are allowed. Thus, if you always start a culture from the
tube or flask that you last used, you are in effect providing a very large number of generations and
therefore creating a condition where the culture after a few days or weeks will be genetically different from
the one you started with. The solution is to always use a frozen stock as the inoculum, often, but not
necessarily, after streaking that onto agar plates to pick a single colony.
But why, you ask, does it matter if there is another mutation in the culture, since we agree that it
causes only a very slight change in growth rate under these conditions? The first response is that the
change is modest under these conditions, and might be significant under other conditions. The second is
that it very much depends on what you plan to do with the strain. If you are biochemically analyzing some
property of the strain, then it is likely that you are choosing that property because you expect it to be
affected in an interesting way by the strains genotype. Then if that property is physiologically linked to the
growth behavior hardly a remote possibility then you are selecting for genetic changes that directly or
indirectly affect the very property you are going to analyze biochemically. Any conclusions you draw will
be fundamentally flawed because they will be based no the assumption that the observed biochemical
properties are the result of only the genetic changes that you know about.
In fact, the presence of unrecognized suppressors is probably very common. I do not think that it
is a stretch to guess that one-third of the literature on microbial mutant analysis is completely wrong
because of this phenomenon (though I admit this is a complete guess). Heres a general scenario of how
you might see, but not really notice, the warning signs: You decide to add a mutation, marked by a drugresistance cassette, to a new strain background by transduction (but conjugation or transformation would
have the same issues). You do the transduction and select drug resistance and see only a few colonies,
where you expected hundreds. Ah, the phage stock must be old (or the conjugation didnt work well,
or), you say and happily pick a colony and get on with your life. Perhaps you verify that the strain has
the correct insertion (PCR?). So whats the problem? There might actually not be a problem it could well
be that your explanation for the poor numbers was correct. But if it were not, then it means that you have
probably chosen a strain with two mutations the desired one and a suppressor mutation that provides
decent growth. The number of transductants was low simply because the vast majority of them did not
also have the necessary suppressor mutation. Now you should ask yourself, if the frequency of
transductants is only reduced 2-3 orders of magnitude (i.e. you got a few and expected hundreds), then
-2
must the suppressors in this case not be arising at an unreasonable frequency of 10 ? Fair enough, but
consider that the recombinational event that creates the mutant is taking place in a cell thats growing
well, so it might be several generations before growth is impacted, so there is a much larger population
3
present in which the advantageous mutations can occur are many more 10 generations give 10 cells,
but still a microcolony too small to see. The unsuspected presence of the suppressor is almost certain to
completely invalidate any work done with that strain. Examples of this are numerous and, I assume, most
remain undiscovered. A classic is the apparent viability of mutants lacking gyrase, which is actually a
lethal event, but is suppressed by mutations that kill topoisomerase activity, so virtually all the work on gyr
strains was flawed for years.
The situation is typically discovered by accident: someone tries to move the known mutation into
a new strain background and finds that it is an unusually rare event, because of course moving the known
mutation is unlikely to simultaneously move the suppressor mutation. (If the two are closely linked
genetically, then the problem is extremely difficult to detect except by sequencing, but even then, there is
no reason to assume that a detected genetic difference actually affects the phenotype.) The solution to all
this is to pay attention to frequencies and do controls.
How many isolates should be analyzed? If the desired phenotype arises because of frequent events, then
the frequency of the desired mutants will be high. In such a case the selection may require plating rather
8
fewer than10 cells/plate. Otherwise, you will have a confluent lawn of the desired mutants rather than
isolated cells. Rarer phenotypes may require multiple plates or higher concentration of cells in order to
find the desired mutant. Also, the desired phenotype could be arising by more than one class of genetic
alteration; if you desire one of the rarer class, you will need to screen many colonies of the desired
phenotype to obtain some with the desired genotype.
Is there a concern over the amount of carry-over of media to the selective conditions or any trace
chemical contaminants? In cases of particular sensitivity to growth inhibitors or nutrients, it may be
necessary to wash cells or go through some other dilution scheme such as multiple cycles of replicaprinting or re-streaking, in order to obtain a distinct mutant phenotype. For example, the level of some
trace requirements, such as lipoic acid, can be so low that strains unable to make it still give normal-sized
88
colonies on medium lacking it, if they have been picked from a plate containing it - there is sufficient carryover in a single cell to support the growth of an entire colony. If that colony is streaked again on medium
lacking lipoic acid, very poor growth is finally evident.
Phenotypic resistors. In some conditions, particularly at high plating densities, some genotypically
r
sensitive cells can be protected from the selection by their neighbors: (i) Classes of drug siblings can
s
protect their drug friends by either excreting degradative enzymes or otherwise detoxifying the medium.
(ii) Phage-sensitive cells can stick together in such a way that some cells are physically masked from the
phage and can therefore survive phage challenge, even though they are genotypically phage-sensitive.
s
(iii) Similarly, cell clumping can physically mask UV cells from UV damage.
Selectable markers. The typical selectable markers in prokaryotes are drug-resistance genes, though of
course the utility of a given resistance gene to provide useful levels of resistance in a given bacterium
depends on a variety of factors including the natural level of sensitivity of the cell and the level of
resistance provided by the element. The situation is rather different in yeast, where the only antibiotic
markers that exists is G418 resistance, that happens to be encoded by Tn5. So geneticists very early on
identified certain genes for critical metabolic functions by cloning random fragments of yeast DNA into E.
coli auxotrophs and demanding a version of complementation. Having obtained clones of a few genes,
such as URA3, they then mutated the normal copy in the genome, so that the cells were uracil auxotrophs
unless another copy of the gene was provided on a plasmid or by integration. Effectively normal metabolic
genes became the selectable markers in yeast, and have worked quite well.
In addition to cloned WT genes, a number of antibiotics can be used in Saccharomyces: See the
list in the text of "non-replicating (suicide) plasmids" that is found near the end of LT4.
Counter-selections. There are obviously tons of examples of selectable markers, most notably those
conferring drug resistance or, of course, the wild-type alleles of genes whose loss leads to auxotrophy.
There are also a few genes that, when cloned into certain backgrounds, confer a deleterious phenotype
under some growth conditions. These include, but are not limited to sacB (causing death when the cell
takes up sucrose, apparently because the accumulation of levan causes cell death in some way,
s
Gene78:111[89]), rpsL (Str ) (Gene99:101[91]), ccdB is normally a killing function involved in plasmid
r
partitioning (LT9) (BioTech21:320[96]), and tet, which causes not only Tet , but simultaneously sensitivity
to fusaric acid (JBact145:1110[81] & JBact169:4285[87]). In yeast, the chemical 5-fluorootic (5-FOA) acid
is widely used. This compound is not itself toxic but becomes toxic when there is a functional URA3 gene
in the cell and, as noted in the above paragraph, URA3 is a commonly used selectable marker. In other
words, the presence of a functional copy of URA3 can be selected for by growth in the absence of uracil
(and 5-FOA), while the absence of a functional copy can be selected for by growth in the presence of
uracil and 5-FOA.
Some examples. Obviously, whenever one plates an auxotrophic strain on medium without its required
nutrient, there is a strong selection for prototrophic revertants. Similarly, a selection for the proper clones
from a variety of in vitro operations typically involves the selection for cells transformed with plasmids of a
certain drug-resistance. Suppressors (LT14) are found by taking a strain that grows poorly under some
condition and then selecting those derivatives that grow discernibly better. While these selections are
obvious, there are some others that can be exceedingly powerful, but are a bit more complicated.
Two-hybrid systems reflect a clever notion whereby a protein pair has been chosen that normally
interact in the cell to accomplish some task, typically the activation of gene expression. The domains of
these proteins that provide the necessary interaction surfaces have been removed, so gene expression
no longer takes place. Rather remarkably, when completely different protein domains are put on each of
these altered proteins in such a way that the new portions can interact, this is sufficient to allow the proper
protein complex to form and again activate transcription. I find this all remarkable because it happens that
all sorts of new partner domains work, suggesting that the precise geometry of the interacting proteins is
not very important (this is not typically the case with protein-protein interactions). In any event, this system
has been set up in such a way that one clones a gene or gene fragment encoding a domain of interest
into the gene of one of the potentially interacting proteins. One then clones a library of fragments into the
proper place of the gene of the interacting partner and introduces this library into a large set of cells,
selecting for the desired gene expression. Such expression indicates that the two necessary transcription
factors have been brought into contact with each other because of the interactions between the new
domains that were added to each. With surprising frequency, the two new domains that are detected in
this way can be shown to interact in vitro and in vivo. This is therefore a selection for any two interacting
domains in the cell. A nice version of this for yeast is described in Genet144:1425[96]; this typically
89
termed the yeast two-hybrid system but is used for interacting proteins from virtually any organism. A
functionally related system, that creates interactions among phage lambda regulatory proteins has also
been described and is termed the bacterial two-hybrid system (G&D12:745[98]).
Phage display is another clever notion that has been modified in a number of useful ways. At its
core is a small phage that encodes a specific protein that ends up on the outer surface of the virion, but
which has a region at its tip (i.e. the part most exposed to the environment) that is apparently irrelevant for
phage function. One clones small DNA fragments into the appropriate site of this gene and the peptide
encoded by that fragment is added to that tip of the protein and displayed. Typically this inserted DNA
fragment is actually a completely randomized DNA sequence that has been produced by chemical
synthesis. By this approach one can create a vast library of phage sequences. When these phage are
used to infect bacterial cells, each mutant DNA sequence displays a protein domain on its surface that is
encoded by its specific randomized DNA fragment. One then takes this library and washes it over some
surface that has a specific protein already bound to it, allows interactions to occur, and then washes off
those phage that bind poorly. Now since there will be a lot of garbage in such a randomization, only a
very, very small fraction of the input phage bind. One then takes those that bind, removes them with a
more vigorous wash and goes through another growth cycle in cells to amplify all phage present. The
binding selection is repeated again and after 5-10 cycles (which are actually very quick and easy) one has
enriched for those phage that bind with high affinity. One then sequences the relevant regions of some
selected phage and identifies the sequence that encodes the interesting region (JMB234:564[93]). From a
seminar by Brian Kay, the remarkable claim emerged that, with a given protein bound as bait, all of the
interacting surfaces selected were similar (i.e. they were not identical - so they were not siblings - but they
were clearly encoding related domains). This implies that for a given bait, there is typically only one
surface capable of producing strongly interacting partners - a concept that I have not heard discussed
elsewhere.
Screens. In a screen, all of the cells in the population grow, but you then use some tool to find those cells
with a particular desired attribute. At least some screens are destructive (e.g. you might lyse the cells), so
there is a need to maintain orderly stocks of the strains being screened (e.g. master plates from replica
printing etc.). A great screen can be as powerful as a selection.
One can screen for the physical presence, absence, or absolute amount of something. For
example, one can screen for RNA or DNA sequences by hybridization; protein, by antibody, stain or label
at the appropriate gel position, or direct assay of activity; metabolites by biological assays (feeding or
inhibiting a tester organism) or physical assays (spectroscopy, thin-layer chromatography, etc.).
Particular growth properties can also be screened for. These may range in specificity from a
general phenotype (e.g. good growth) to a specific property only created by alterations at a specific locus.
For example, a His strain will always have a mutation in a gene whose product is involved in histidine
biosynthesis. In contrast, a strain with a ts phenotype might have a mutation in any of a several hundred
genes encoding critical metabolic steps. A particular behavior, such as motility, chemotaxis, sporulation,
virulence, or developmental impairment (inability or excessive desire to sporulate, improper regulation of
macromolecular synthesis, etc) can also be screened for. In any screen, your assay needs to reflect the
desired phenotype as closely as possible. What you will always get is a mutant class that presents the
desired phenotype under the conditions of your screen, not necessarily under the conditions of the
eventual use planned.
Examples of screens. There is an optimal strategy for screening for any phenotype that depends on the
frequency and detectability of the desired mutants and the difficulty of performing steps in the screening
procedure. As a result, the optimal approach is often a set of successive screens of increasing difficulty,
each eliminating successively smaller fractions of the undesired strains.
(i) Replica printing utilizes a master plate that can be useful in the case of destructive assays,
such as flooding the plate with a chemical that kills the cells but allows detection of enzyme activity.
Highly active colonies are recognized on this plate, but viable cells need to be recovered from another
copy of the plate. One can print to a variety (>10) plates from 1 master, but for certain phenotypes, care
must be taken in the number of cells transferred (typically the first plate receives lot of cells and therefore
should be another non-selective master).
(ii) Color indicators: These can reflect what is in the cells (DNA/RNA hybridization blots, antibody
to protein), and therefore require cell lysis. Alternatively, indicators can show what cells are excreting or
taking up. There are color indicators for pH (and therefore sugar use), and a number of different reporter
enzymes. You can also use sub-optimally supplemented plates, seeking tiny colonies, or shift
experiments where the plates are changed in temperature or nutrient content (by using overlays).
Examples of useful reporter genes include lacZ (-galactosidase), lux (luciferase), gusA/uidA (-
90
glucuronidase), phoA (alkaline phosphatase), and xylE (catechol-2,3-dioxygenase).

(iii) Screens of individual colonies: One can pick colonies with toothpicks to microtiter wells for
assays that do not lend themselves to agar plates. The most time- and effort-intensive case is picking
colonies to liquid culture for further growth. This is necessary when more cells are necessary than are
found in a colony, the product is only made in liquid culture, or the only assay is a physical one that does
not lend itself to plates.
(iv) Screening by SSCP: Single-strand conformational polymorphisms (SSCP) have been used as
a difficult, but valuable screen that can resolve single base differences in DNA fragments several hundred
nt long on non-denaturing PAGE by altered mobility.
(v) Use of pools: It is often possible and efficient to screen for the presence of something only
found in a small fraction of the population by using pools of organisms. For example, if a rare mutant in a
population can make a compound that is time-consuming to detect, but is quite sensitively detected, then
screening pools of dozens or even hundreds of pooled cultures will reveal if any cells in that pool are the
desired strain. Any positive pool can then be further divided and analyzed to obtain a pure culture of the
desired mutant.
(vi) IVET (In vivo expression technology - John Mekalanos) is a selection method for genes that
are turned on only under conditions of virulence. Typically, this means gene that are only expressed
within the host. I will describe the process with an arbitrarily chosen example: analyzing a mouse
pathogen and utilizing purA as the critical reporter gene: (a) Make purA mutant, preferably a non-reverting
deletion, of the pathogen. (b) Clone random chromosomal fragments 5' of a plasmid-borne promotorless
purA. In this case, purA will only be expressed only if there is a promoter in the cloned fragment directed
toward purA. (c) Screen the clones, each with a different plasmid, to eliminate prototrophs (by replicaprinting?). This eliminates those promoters that are functioning outside the mouse. Alternatively, you can
have another gene downstream of purA expressed from the same promoter but for which expression can
be selected against. (d) Infect the population of cells that were purine auxotrophs on the petri plate into a
mouse and, after a while, go in and retrieve pathogens that have survived. Presumably, any cell that has
survived has been able to express purA in side the mouse. (e) Reconfirm that these cells do not express
purA on petri plates, which would happen if they reverted. Sequence out of some of the insertions to
determine which gene is mutated.
This is a powerful approach, but the results are hardly yes and no. For example, how much
expression is necessary for survival in the mouse? How little expression is necessary for us to see it as
an auxotroph on petri plates? If you use two different genes for the selection or screen, the level of
expression analysis is even more likely to be different between mouse and petri plate. The answers to
these questions affect the range of genes that you can find by the method. If you choose an in vivo
reporter that demands very high level of expression, then you only find promoters that provide that. This
result is fine, so long as you recognize that constraint.
Another complication concerns the location of the reporter gene in the bacteria to be tested. The
description above refers to the reporter, and therefore the random chromosomal fragments, being on a
plasmid. This has certain advantages for ease of construction and use, but constructed plasmids can be
unstable, so that interesting mutants might be lost through poor plasmid segregation. An alternative is to
move the reporter and the cloned fragments into the chromosome by recombination. Such integrants are
likely to be very stable, but there is the important issue of where the recombination event occurs and
whether or not that will actually disrupt an important function in the chromosome. For example, if you put
a suicide plasmid into a cell with a selectable drug marker, the reporter and cloned fragments, any drugresistant colony would likely have arisen by a recombination event between the chromosome and the
cloned region. This might actually disrupt the expression of the chromosomal genes (see the last
paragraph in LT4) and if these gene products were actually involved in pathogenicity (which is the entire
reason for the hunt), the mutants might actually be defective for survival in the mouse and you would lose
them. Again, these issues can be dealt with, as long as one recognizes the constraints on the approach.
91
(vii) Signature-tagged
mutagenesis (STM) has the
goal of finding organisms that
are defective in pathogenesis,
but uses an approach a bit
more akin to traditional
genetics. It creates mutants
and looks for those with a
defective phenotype, but the
insight here is to tag all
members in the population
first, so one can recognize
which mutants failed to survive
in the host. (a) Create a library
of different insertions that each
differ at a randomized
sequence at a distinct site in
the element. (b) Use these
insertions to make a random
library of insertion mutants in
a population of the pathogen,
Figure 8-1. Signature tagged mutagenesis (from ARM53:129[99]).
selecting some drug
resistance. (c) Pick a number
of these (96 is a useful number because of the number of wells in a microtiter dish) and organize them in
such a way that you can generate ordered arrays for some hybridization screen. (d) Pool the 96
organisms, amplify the randomized regions by PCR, label and probe your ordered array. They should all
be there. (e) Inoculate the pool of organisms into the mouse, wait a bit, then go in and pull as many as
possible back out. Again PCR amplify the tagged region and probe the array. Anyone who is missing is
apparently unable to survive in the mouse under this condition - presumably because of the insertion.
Sequence the gene from the missing organisms.
As with IVET, this is a powerful approach, but the devil is in the details. For example, how
important does a gene have to be for survival in the mouse in order for you to detect it? Also, there is a
limit to how many different organisms you can screen in a single mouse and still have confidence that you
would be able to recover all of the viable ones reliably at the end. Finally, there is the "herd immunity"
phenomenon (essentially what parents are relaying on when they choose not to vaccinate their children):
If a mutant fails to produce something important for pathogenicity and that product is normally excreted or
displayed on the cell surface, then a mutant unable to do so might well survive I n the host because the
vast majority of its bacterial colleagues will compensate for it. Thus such mutants will be missed by this
approach, and they are among the more interesting types that one would want to find.
(viii) Expression arrays are another powerful approach. (a) One makes a chip with one or more
fragments on it for every ORF in the cell. (b) Isolate mRNA from the cells under various conditions - in this
specific example, for cells growing inside and outside a mouse. (c) Label this RNA in some way (typically
by converting the RNA into a fluorescently tagged DNA probe) and see what hybridizes to the chip. (d)
mRNA that are much higher in cells growing in a mouse than in cells grown in medium are good
candidates for being relevant to virulence.
The power of this method is great, but there remain a number of technical issues that complicate
interpretation. Because bacterial mRNAs are typically highly unstable, they turn over fairly rapidly, so the
reproducible isolation of mRNA from mouse-grown cells is not trivial. Also, the ability to reliably compare
mRNA levels requires that a certain level of mRNA be present over the background of noon-specific
hybridization. This means that the expression of many genes that are only expressed at low levels cannot
be analyzed, even though this potentially precludes that analysis of genes whose products are important
regulators.
Enrichments: What do you do when the desired mutant never grows better than the rest of the
population and actually grows less well under at least some conditions? Enrichments are experimental
systems under which the majority class is preferentially killed. Thus that the desired minority class, which
is not killed, represents a larger fraction of the total surviving population than it did in the starting
population. In general, the point is to kill growing cells, and therefore the targets need to be cellular
functions utilized only by growing cells. The effects must be bacteriocidal, not bacteriostatic, or else you
92
really are not changing the fraction of the desired class among viable cells.
The general problem with enrichments is that you want conditions that kill the majority (growing)
class as efficiently as possible, without causing significant effects on the non-growing, desired class.
Depending on the relative degree killing of the two classes, one sometimes has to repeat the enrichment
several times to bring the frequency of the desired class up to a level at which that can easily be detected.
Excessive treatment can cause killing of the desired class due to lysis of the undesired class (and then
resulting nourishment of the non-growers) or selection of cells that are resistant for phenotypic reasons.
Examples of enrichment methods. Cell wall analogs are compounds that mimic the constituents of cells
walls sufficiently well to be incorporated, but they fail to make all of the necessary chemical bonds for an
appropriately cross-linked wall. As a consequence, rapidly growing cells that incorporate them severely
weaken their walls and eventually lyse. (The only problem with this description of events is that penicillin,
the best-known example of this class of agents, does not kill by cell lysis, since cells resistant to lysis are
comparably sensitive to the drug.)
DNA analogs can also be used for enrichment. For example, radiolabeled thymine is only
incorporated into cells undergoing DNA synthesis. Upon prolonged cell storage, radioactive decay leads
to cell death for those organisms that were capable of growth under the restrictive conditions. Similarly,
bromouridine is much more reactive to UV than are normal bases and incorporation of bromouridine can
make growing cells hypersensitive to UV, while non-growing cells fail to incorporate the agent and remain
r
relatively UV .
Causality. By any of these methods, you find mutants with interesting phenotypes, but you need to be
careful: just because you have a mutant with a signature tag and with a desired phenotype does not mean
that the phenotype is due to the signature tag. The notion that the constructed mutation caused the
phenotype is plausible of course because you didn't deliberately do some other form of mutagenesis.
However, spontaneous mutations do occur and could be the actual genetic basis for the altered behavior.
So any time you find an interesting mutant and before you spend a great deal of time guessing why the
mutation causes the phenotype, you had better make sure that it really does: you need to show that the
mutation is "causal" to the phenotype. There are two general methods. The best is to recreate the
mutation in a new background and verify that the new strain has the same mutant phenotype. Depending
on the organism, this can sometimes be hard, so an acceptable alternative is to show that introduction of
the wild type region causes the mutant phenotype to appear (much more) like the wild type. Neither of
these shows that there is no other mutation in the background, but they do demonstrate that the
interesting phenotype is caused by the known mutation.
607 Lecture Topic 9.................PLASMIDS, CHROMOSOMES, and CONJUGATION

(see ASM2[96] for plasmids {p.2295} and for conjugation {p.2377, 2402, and 2406})
Plasmids and conjugation" mixes rather different things: conjugation can certainly occur without
plasmids, and many plasmids have nothing to do with conjugation. Nevertheless, there are sufficient
linkages between the two concepts that there is some didactic rationale to considering them together.
A plasmid is defined as a replicon, or replicating piece of DNA, that is stably inherited in an
extrachromosomal state. In older literature, the term episome was used for plasmids capable of
integration into the chromosome, but this term has largely gone into disuse. Plasmids typically exist as a
covalently closed, circular pieces of double-stranded DNA that have the ability to replicate autonomously
in the cytoplasm. It is this property that allows their isolation and physical recognition. The closed covalent
nature of their structure allows them to be separated from chromosomal DNA by either gel electrophoresis
or cesium chloride buoyant density gradients. A chromosome is any replicon in the cell whose genetic
material is essential for growth of the cell. Finally, conjugation is the transfer of DNA from one cell to
another by cell-to-cell contact.
There are two features held in common by all prokaryotic plasmids: the ability to replicate and to
partition themselves between the daughter cells after cell division (see ASM2:2295[96]). Curiously, both of
these fall under an older term of incompatibility. Two plasmids are termed incompatible if the maintenance
of either (in a population) is lower in the presence of the other plasmid than in its absence (note that
"lower" is not necessarily zero). This can be pictured in two ways. If two plasmids share replication
functions, then those functions will maintain a certain copy number of the replicons in the cell. Because
this copy number will be the sum of the two different plasmids (because, in our example, their replication
functions are the same and cannot distinguish between the sites on two plasmids), a competition will
result between the two plasmids. Whichever plasmid is able to replicate faster, or has some other
advantage, will be represented to a disproportionate degree among the cells in the population, which we
93
will see as incompatibility. If two plasmids have different replication systems but happen to have the same
systems for ensuring that daughter cells have a copy of the plasmids (as described below), then a similar
competition will occur. Whichever plasmid is better at utilizing that partitioning system will be found more
consistently in the cell population and we will again see the two plasmids as incompatible.
Unfortunately, throughout the plasmid literature there are problems with changing and deceptive
nomenclature. For example, partitioning systems (see below) of two different plasmids have often been
given the same gene designations where the products might be entirely unrelated in sequence and
mechanism.
Types of prokaryotic replicons. What distinguishes a big plasmid from a chromosome
(ARG32:339[98])? It is not merely size: Rhizobium meliloti has two megaplasmids, 800-1000 kb, whose
products are primarily involved in plant infection, while a number of different bacteria have entire genomes
of less than 600 kb (see below). The largest known prokaryotic genome is that of Sorangium cellulosum,
a soil-dwelling Gram-negative bacteria, at 13 Mb (NatBiotech25:1281-9[07]), which is actually larger than
that of Saccharomyces cerevisiae, the famous yeast, at 12 Mb. The median size of prokaryotic
chromosomes is a bit above 2 Mb for those studied to date, although this value is probably skewed by the
focus on sequencing the genomes of pathogenic bacteria. Sam Kaplan has argued that R. sphaeroides
has two chromosomes, because the smaller replicon (0.9 Mb vs 3.0 Mb for the larger replicon) carries one
of the three rRNA genes (rrn) as well as other unique functions (Genet153: 525[99]). This notion that a
chromosome is a replicon that carries essential genes while a plasmid does not, seems to be generally
accepted. In the same vein, Burkholderia cepacia (which was until recently called Pseudomonas cepacia),
a bug that can degrade most everything from peanut butter sandwiches to strange organics, has three
chromosomes of 0.9, 2.5 and 3.4 Mb, all of which have genes for 16S and 23S rRNA. Similarly, Brucella
abortus, an animal and human pathogen related to R. sphaeroides, also has rRNA and heat shock genes
on each of its 1.2 and 2.1 Mb chromosomes.
Now it happens that these properties are actually somewhat flexible. For example, Kaplan has
also found that other natural isolates of R. sphaeroides vary greatly in the size of the smaller chromosome
(JBact181:1684[99]). As another example of chromosomal variation among closely related strains,
different isolates of Bacillus cereus (aka B. thuringensis) have chromosomes of 2.4-5.3 Mb, but the larger
ones seem to carry new DNA inserted into the 2.4-Mb chromosome.
But are copies of rRNA genes a good criterion for identifying a chromosome? Are additional
copies of these genes really essential? It so happens that E. coli has seven rRNA operons and several
can be deleted without any observable problem, so why should these genes be considered diagnostic of a
chromosome? On the other hand, while the megaplasmids of some rhizobia are not essential for growth
in the lab, they certainly do carry genes that are critical for the organisms ability to infect plants, which is
(at least from the human perspective) a pretty central aspect of their biology. Indeed, if plasmids existed
because they supplied useful genes to the host, one would expect that these genes would be recombined
into the chromosome and the plasmid would lose its edge. Instead it has been argued that it is only the
ability to move newly selected genes that can explain their presence (Genet155:1505[00]).
The matter is further complicated by the challenge in trying to determine the importance of
plasmid-encoded functions. That is, you would presumably test the importance of a plasmid to the cell by
eliminating it and watching for the effect on growth. However, eliminating natural plasmids is not easy
(because of the partitioning systems described below) and when you find a rare isolated that lacks the
plasmid, it is not trivial to show that some critical plasmid genes have not moved to the chromosome in
the process. This seems to be yet another example where biology refuses to fits neatly into the little
categories we have created.
A related question is, what is the minimal prokaryotic genome? This has been guestimated by
counting the E. coli genes whose products have been shown to be essential, but it has now been
addressed experimentally. TIGR took the small M. genitalium chromosome, generated random
transposon insertions, and sequenced the sites. The analysis suggested that 265 to 350 of the 480
protein-coding genes of M. genitalium are essential under laboratory growth conditions, including about
100 genes of unknown function (Science286:2165[99]). This commonly held view of ~300 essential genes
took a drubbing in 2006, when the genome of Carsonella, an endosymbiont of sap-feeding insects, was
found to contain only 182 ORFs (213 genes) in an 160-kb genome (Sci314:267[06]). Fully one third of
these are involved in translation and almost one quarter have no obvious predicted function. In 2009, an
even smaller genome was found: the 144-kB genome (189 genes) in another insect symbiont, Hodgkinia
cicadicola. Probably not coincidentally, this symbiont cohabits in the insect with another small-genome
bacteria, Sulcia muelleri, and they appear to have maintained different, but complementary, biosynthetic
pathways (PLOS Genet 7:e1000565[09], PNAS106:15394[09])
Finally, what if anything is the selective pressure for a genome to be of any particular size?
94
Eukaryotic genomes vary in haploid genome size over a range of almost 10 -fold, with no obvious
correlation with the resultant organism. The various pressures that affect genome size are discussed in
TIG17:23[01].
Linear replicons: (ASM2:2309[96], TIG12:192[96], ARG32:339[98]) We have pretended that all
prokaryotic replicons are circular, but that is not the case. Among the bacteria that have been shown to
have linear chromosomes are Borrelia burgdorferi, Agrobacterium tumefaciens, Rhodococcus sp. and all
tested Streptomyces sp. including S. lividans. To show the range of possibilities, the Borrelia chromosome
is only 960 kb and the streptomycetes tend to have very large genomes (about 8 Mb), while
Agrobacterium apparently has both a linear and a circular chromosome. There are also a number of linear
plasmids (5-200 kb) including SLP2 in S. lividans and others in Borrelia (Borrelia also contains a number
of circular plasmids. Finally, there are linear double-stranded DNA replicons in viruses such 29 of
Bacillus subtilis and the adenoviruses.
These linear replicons have two different types of telomeres. Some have covalently closed hairpin
ends. Replication of these replicons proceeds bidirectionally from an internal origin and yields a doublestranded circular dimer. The telomeres are then cut with a resolvase and religated to form to linear
monomers (CurrOpMicro5:529[99]). In the other class there are proteins bound to either end and these
are termed "protein-capped." Replication of these also proceeds from a central origin, but because the
ends are not covalently closed, two linear monomers are directly formed, though there is a bit of work to
properly form the telomere (see Plasmid Biology, p. 291. Funnell and Phillips, ASM Press).
Finally, replicative transposition in and between linear replicons, should cause some problems.
You probably need to draw it out, but the point is that you cannot go through the "replicon dimer"
intermediate, which is then resolved, if the two replicons are linear (TIG:12:194[96]).
Yeast chromosomes (and other eukaryotic replicons) have the same challenge as linear
prokaryotic replicons: how do you replicate the ends? If you prime from the ends themselves, then youll
shorten the ends at every replication cycle. The general solution is a short redundant repeat near the
ends that can fold-back on itself and base pair, but this folded-back section is not ligated closed. Then,
you solve the problem of losing the ends by adding them back through the enzyme telomerase, which
carries a section of RNA complementary to the ends and uses that to extend the single-stranded
chromosomal end.
Cryptic plasmids are those that serve no known function, with the LT plasmid of the standard lab strain of
St being the best-known, but by no means only, example (ASM2,2012 & 2041[96]). Now obviously, these
plasmids might provide a function to the cell that we simply dont know about. Quite possibly we have not
analyzed the success of strains with and without the plasmid in the actual competitive environment where
the bacteria actually live. However, if we assume that they really do not provide an advantage to the host,
why are they maintained in the population? One can imagine several possibilities. They might have such
terrific partitioning systems that they are simply never lost. Alternatively, one could imagine them having
such an aggressive conjugation system that they reinfect any daughter cell that failed to receive the
plasmid upon cell division. But the following example suggests there are other possibilities as well.
pACYC (a small, moderate copy-number plasmid) was moved into Ec and it made the cell less fit under
non-selective conditions. The cells were then grown under selection for the drug marker on the plasmid
for 500 generations. Surprisingly these cells were now more fit with the plasmid (actually, they introduced
a new, unselected version of pACYC, to make sure that the plasmid itself had not been altered) than it
was in its original state before it had seen the plasmid at all. These authors argued that the released
microorganisms with such plasmids will not necessarily be at a selective disadvantage in the real world
due to excess baggage as has been argued elsewhere (Nat335:351[88]). Is a similar physiological
phenomenon behind the retention of cryptic plasmids?
Integrative plasmids are those that can occasionally integrate into the chromosome and there are a
r
number of examples. There are the plasmid-like forms of the Bacteroides uniformis conjugal Tet elements
that are normally chromosomally located, but upon tet exposure, become plasmids and conjugate
themselves (JBact170: 1651[88]). Some Streptomycetes strains have the integrative SLP1 and pSAM2
plasmids. SLP1 inserts at a 112-bp region that is homologous to a region on the plasmid. The plasmid is
17-kb, self-conjugative, and generates various smaller versions of itself after mating into some hosts.
The integration site of many of these plasmids is a tRNA gene, and the integration event actually
regenerates the tRNA as well as the neighboring regions necessary for successful tRNA processing. In at
least some of these cases, the chosen tRNA is unique and essential. Given the conservation of tRNA
genes amongst related bacteria, this might be a way for plasmids moving through a population of diverse
but related bacteria to be sure of having an appropriate site. Perhaps not surprisingly (since yeast are just
95
big, slow bacteria), yeast retroposons, termed Ty elements, also integrate into tRNA genes
(TIG9:421[93]). tRNA genes are also the sites of insertion of phages P4 and P22.
Uses of plasmids. The uses are manifold, but in general: (i) Small and high-copy number plasmids are
useful because they are easy to isolate and handle for molecular biological methods. (ii) Low copynumber plasmids are best for pseudo-complementation experiments (including those involving reporter
genes for studying regulation), where you want to minimize the perturbation of the balance of things. Very
large low-copy plasmids (Bacterial Artificial Chromosomes) can be useful for genome sequencing. (iii)
Conjugative plasmids are nice because conjugation tends to be a very efficient mode of moving DNA
around much better than transformation, for example. Conjugative plasmids have the disadvantage of
typically being large and low-copy number. These potential disadvantages can be overcome with the use
of mobilizable plasmids, which have the small mob region but for which the tra functions are supplied in
trans typically from the chromosome of the donor strain (see section on conjugation below). (iv)
Chromosome mobilization can be useful for both gross mapping of genes on the chromosome and for the
generation of "primes" in vivo.
Plasmids that replicate in one host, but not in another (termed suicide plasmids in the latter host)
are of great importance in both prokaryotes and yeast because they serve as a powerful means for
introducing mutations constructed in vitro into the genome. Typically these plasmids are mobilizable as
described in the paragraph above. Use of these is discussed in detail at the end of LT4.
Replication functions in prokaryotes. In the simplest case, these consist of one or more origins of
replication with the trans-acting proteins necessary for that replication (ASM2, p.1579-1626). The broad
host range of some plasmids is at least in part explained by their multiple replication systems that allow
them to function in a variety of dissimilar hosts. However, multiple or extensive replication systems make
a plasmid larger and, unlike the case of chromosomes, more plasmid DNA may come at a cost in terms of
competitiveness.
Protein replication factors. (see ASM2,2295 & 2406[96] and MolMicro37,467[00], MMBR62,434[98]) Most
plasmids encode a trans-acting replicator (Rep) protein that interacts at the origin to stimulate replication.
The origin very often has multiple repeats, termed iterons, which are the binding sites for the Rep
activator proteins. These iterons exist in several different forms: some are near the start site of replication,
often one iteron is physically separated and within an A-T-rich region where replication actually begins,
and one or more iterons are in the vicinity of the promoter for the gene encoding the Rep protein itself.
These iterons were previously thought to titrate out the Rep protein and thus limit replication. However,
this is clearly incorrect for studied plasmids, because the number of Rep proteins in the cell far exceeds
the number of iterons.
It might seem surprising that Rep proteins have both positive and negative roles in replication, but
it is probably necessary to have both properties if there is to be control over copy number. The positive
role is clear from the fact that they are essential for plasmid replication. They bind to the iterons and bend
the DNA locally, which helps open up the strands, particularly in AT-rich regions typically located near to
the core of the ori. However, a negative role is clear from the detection of mutations in the genes for these
proteins that lead to high levels of plasmid (so-called copy-up mutations). The Rep proteins are also often
negative regulators of their own synthesis, acting as repressors of transcription. It has been challenging to
see how one protein can serve these conflicting tasks. That has now been partially explained by the
surprising observation that Rep proteins can exist in multiple aggregations -monomers and different forms
of dimers - that each has somewhat different DNA-binding properties, as well as different population
sizes. Important to this model, the various Rep conformations are not in rapid equilibrium. It is argued that
the negative regulation of copy number in part reflects the binding of Rep dimers to the repeated iterons in
such a way that only one monomer is bound to that DNA and the other monomer (of each dimer) is free to
bind to a similar sequence on another piece of DNA, thus physically blocking replication of each DNA
molecule. This has been dubbed the handcuffing model because it posits that two DNA molecules are
locked together by Rep protein dimers.
Antisense RNA as replication regulator: (see ASM2, 2297) There are also plasmids, termed relaxed (e.g.
ColE1 derivatives), that do not require plasmid-encoded functions for replication, merely long-lived host
proteins. This property happens to allow them to be amplified by addition of protein synthesis inhibitors to
the medium; their replication continues, but chromosomal replication stops. This style of replication control
is very different mechanistically from that described for the Rep-protein-based plasmids described above.
In ColE1-type replicons (which include pBR and the pUC plasmids) an RNA (RNAII) is transcribed
from a site upstream of the replication start site and forms a persistent complex with the DNA that is
96
cleaved by RNase H to form a primer for DNA synthesis. Binding of an antisense RNA (RNAI, transcribed
from the opposite strand) to RNAII inhibits replication by inhibiting primer formation. The interactions of
these two RNAs is aided by the rom/rop gene products from the plasmid and by the pcnB gene product
encoded by Ec.
The rate of decay of RNAI is the key element in the control of pBR322 replication. It is cleaved
near the 5' end by RNase E, making an RNA that is 5 nt shorter, highly unstable, and incapable of
interfering with replication. Mutations in pcnB make the cleaved RNAI more stable, leading to better
interference, because the RNAI is not polyadenylylated. PcnB is a poly-A polymerase, and in prokaryotes,
poly-A leads to instability, while in eukaryotes, it actually stabilizes RNA.
In pUC19, a ColE1 derivative, a single base mutation next to the RNAII transcription initiation site
results in a RNAII transcript that is 3 nt shorter at the 5'end. This shorter RNA apparently interacts poorly
with the RNAI transcript, leading to a much higher copy number because of poorer inhibition of primer
formation. Apparently, the role of all of this regulation is to control the timing and amount of replication,
since RNAII is effectively a trans-acting regulator of copy-number.
In R1 (and IncFII, pUB110) replicons, the copB (or rop) gene product is a protein repressor of
transcription of repA, the RNA encoding the activator of DNA synthesis initiation, but there is also an
antisense RNA, copA, that overlaps the repA start site. Again, the interaction of the two RNAs seems to
be dependent on the pcnB gene product. It has been shown that the RNA-RNA hybrid is then cleaved by
RNaseIII, which eliminates mRNA function. In this system, the regulation is indirect, controlling not DNA
synthesis initiation, but synthesis of the protein required for that initiation.
Origins of replication (oris). The general paradigm is that there is an AT-rich region that is the actual site
of DNA strand separation, which begins replication. This region is flanked by a set of iterons in direct
repeat, where the Rep proteins bind, and on the other side by an enhancer region which also assists in
replication. Besides the Rep proteins, numerous other proteins bind these regions, including DnaA, IHF,
Fis, and IciA. Not surprisingly, there is overlap between different binding sites and therefore some
competition between factors, which is an aspect of replication and its control. Very often, the binding of
these proteins leads to substantial DNA bending, which can facilitate protein-protein interactions or DNA
strand separation.
The general pattern for replication is that the Rep proteins bind the repeated iterons and recruit
DnaA and IHF, which causes DNA strand separation. This allows the binding of DNA helicase and
primase, which then support the binding and activity of DNA polymerase, leading to a cycle of replication.
Plasmids with variable copy number. It has been possible to create plasmids where the copy number can
be manipulated and here are two examples. A plasmid was created with a pair of replication origins, one
from pSC101 and one from ColE1, and the par system from pSC101 was added (Gene49:311[86]). The
ColE1 ori was engineered to have lambda Pr promoter upstream and with lambda cI857 elsewhere on the
vector (Pr is a strong lambda promoter and cI857 is a temperature-sensitive allele of the lambda repressor
that controls that promoter). At low temperature, the vector exists at four copies per cell (due to the
pSC101 origin); at high temperature, the cI repressor fails to function, Pr turns on and the RNAII form
overwhelms the RNAI form to give dramatic turn-on of replication (~300 copies per cell). A similar game
can be played with the expression of repA mRNA in R1 replicons (Gene28:45[84]).
Another scheme takes advantage of the inability of ColE1 replicons to function in polA
backgrounds by providing the same cell with an F factor carrying a Tn1000 defective for resolution.
Selection for the desired markers and subsequent mating yields F factors carrying a single copy of the
+
ColE1 replicon and the cloned region (JBact171:5254[89]). In a polA strain, the plasmid copy number is
that of ColE1.
Methylation and membrane association. (see ASM2,782[96]) It appears that plasmids are not floating
freely in the cell, but rather form communities or clumps (JBact181,7552[99]). The precise nature of these
is unclear and it is possible that the Rep proteins, or other proteins noted above, might have a role in this
organization, but it is also apparent that membrane association is part of the process. As covered before
on the role of dam methylation in regulation, there seems to be a role of methylation in membrane
association that probably leads to inhibition of successive rounds of replication as well as aiding
partitioning (see below.
Phage P1, when replicating as a plasmid, has an absolute requirement for dam for replication.
There is also a dam effect on ColE1 replicons, and it seems to be at the level of membrane association,
rather than any regulation of transcription. Why a randomly replicating (i.e., its replication is not connected
to the cell cycle) plasmid would have this timing mechanism is unclear, but it has been shown that part of
the effect is due to MutH binding to the under-methylated sites. It has therefore been argued that MutH
97
delays replication until it has had a chance to scan the replicon for mismatches (JBact173:3209[91]) (This
also explains why dam cells accumulate mutHLS mutations). The synchronicity of plasmid and
chromosome replication with cell cycle has been at least partially described (JBact174:2121[92]).
In the case of the Ec chromosome, it appears that hemi-methylated oriC (origin of replication) is
specifically and transiently sequestered from dam methylation by association with the membrane. For the
oriC of Ec, there are 11 GATC sites within a 245-bp region that binds to membranes only when hemimethylated (remember that, while the average lifetime of a hemi-methylated site is about 4 minutes,
specific sites can be very different). Overexpression of Dam leads to a decreased time between rounds of
replication. The implication is that hemi-methylated sites bind to membranes both to help segregation and
to prevent premature rounds of replication. In addition, however, the promoter of dnaA, whose product is
necessary for chromosomal replication, is also non-functional when hemi-methylated, so that both cis and
trans functions are co-regulated by methylation.
More recently, the segregation of the chromosome in several bacterial species has been analyzed
through a combination of microscopy (using physical tags for specific chromosomal regions) and genetics.
It has become clear that bacteria have a primitive form of mitotic apparatus that causes newly synthesized
copies of oriC to be physically positioned at the cell poles, but the motor that underlies this positioning
remains unknown.
Replication in yeast. The most critical differences between yeast and prokaryotic replicons is that
origins of replication are frequent in the former, so that all sorts of DNA pieces can survive in yeast with
proper selection. Also relevant, however, is the fact that the centromere sets the copy number of the
replicon to one in a haploid cell, and that these chromosome-like replicons are linear, with telomeres at
the ends, at least to this writer.
(i) Chromosome-like replicons. Unlike the case with bacteria, replication initiation in yeast is not
from one or a very small number of sites, but from sites that appear every 50-100 kB. These sites are
termed ARS for autonomously replicating sequence. The consensus seems to be something like
aTTTATuTTTa, where a refers to A or T and u refers to a purine. Because of the looseness of this
consensus, satisfactory sites for replication occur in almost any piece of DNA of reasonable length,
especially if it is of yeast origin, because the yeast genome is roughly 60% AT. So if the requirement for
an origin is hardly a restrictive one in defining a chromosome, what else is involved? Part of the problem
is how much like a chromosome do you want the vector to be?
To be truly chromosome-like, a replicon needs to have one or more ARS (see above); it must be
linear, which means that it has telomeres at the ends; it must be present in single-copy in the haploid cell
and, through meiosis, segregate 2:2, and both of these properties are conferred by the presence of a
centromere; and lastly it must be fairly stable, which oddly enough requires a minimal size of greater than
50 kB. The telomeres have ~80 repeats of a C1-3A sequence that is recognized by telomerase. This
solves the dilemma of how you replicate the ends of things - not a problem in circular replicons. The
centromere is the region at which the microtubule associates with a complex set of other proteins to
cause chromosome disjunction. There is a set of three distinct sequence elements that create a yeast
centromere. Lastly, the size issue is odd, but it happens that a small replicon that has ARSs, a
centromere and telomeres (termed YTCp plasmids) will replicate and segregate, but it will not be very
stable, for reasons unknown to me. Insertion of random DNA, phage lambda for example, solves this
problem.
A replicon with ARS and telomeres but no centromere (termed YTp plasmids) cannot be
segregated by the microtubules, but the absence of the centromere means that it exists at a higher copy
number, perhaps 20. However, such plasmids are not particularly stable, presumable because of poor
segregation. A plasmid with an ARS, but neither centromeres or telomeres (termed Yrp plasmids) are
circular and have copy numbers up to 100, but are very unstable.
(ii) 2 plasmids. The 2 plasmid, so named because it is that length in electron microscopy (~ 6
kB), exists in normal yeast cells at ~50 copies per cell. It is a covalently closed circle and is fairly stable
because of a set of functions that promote the placement of one copy in the daughter spore cell. Rather
remarkably, only a single copy is transferred to the daughter cell upon cell division. This single copy then
undergoes an intramolecular rearrangement, through a site-specific recombination event driven by the Flp
recombinase encoded by the plasmid itself, in a region of direct repeats on the plasmid. This reorients the
origin of replication in such a way that it undergoes rolling-circle replication and rapidly brings the total
copy number back to about 50 in the daughter cell. Without such a rearrangement, replication is
bidirectional and significantly slower for some reason. When proper copy number is achieved,
autoregulation restores the slow replication mechanism. The partitioning of a single copy of the plasmid to
the daughter cell is accomplished by the products of two genes on the plasmid that attach to both the
plasmid and to the nuclear matrix. Because every daughter cell receives a copy of these plasmids (which
98
then amplify up to 50 copies), they appear to segregate 4:0. To the extent the 2 plasmids are now used
as genetic tools (termed YEp plasmids) they have typically been deleted for the gene for the Flp
recombinase, so that they cannot rapidly replicate their copy number after cell division, so they exist at 510 copies per cell in steady state.
(iii) Other replicon types: Plasmids with both a centromere and a 2 replication system exist at
single copy, indicating that the centromere exerts copy number control here as well. In some way, the
presence of unattached kinetochores shuts down replication, though for some reason, connecting a
strong promoter to read into the centromere prevents this copy number control and allows a plasmid that
also has the 2 replication system to replicate as if there were no centromere.
Yeasts also have the mitochondrial genome, which is a 75-kb circular DNA fragment and exists in
about 100 copies per cell. Surprisingly, the nature of mitochondrial DNA replication has been difficult to
figure out, but the evidence seems to be coming down on the side of a rolling circle model of replication.
Lastly, there is the remarkable killer RNA genome in yeast. This ds RNA resides in virus-like
particles, which sequester this RNA from mRNAs. These three ds RNA can be translated, which is an
unusual property, and they encode the capsid proteins and the toxin and something in there encodes
immunity to the toxin. Virtually all yeast have some versions of this and the toxin also has the odd
property that it is necessary for pheromone production and therefore for mating. The toxin acts by binding
to an ion channel and increasing its activity. Under the microscope, the killer RNA capsids look rather like
the structures associated with the Ty elements. Research on this has been negligible for years, but a nice
review is in ARB55:373[86]).
Partitioning in prokaryotes. (see the general review on plasmid segregation in ARG39:453[05], and for
chromosomal segregation, ARM56:567[02], ASM2,1652 & 1662[96]) All replicons have mechanisms that
increase the likelihood that both daughter cells contain a copy of the replicon following cell division. There
are two completely different approaches to this problem. The first is a set of strategies that increase the
likelihood that each daughter cell receives the replicon, while in the second, the daughter cells lacking
plasmids are killed (after cell division). While we refer to a plasmid being "lost" by a cell, the actual
mechanism is almost certainly that the cell never received the plasmid at the previous cell division due to
inappropriate partitioning. The loss of plasmids in a population is sometimes referred to as plasmid
segregation, though that term too is mechanistically deceptive. For most lab-created plasmids, either of
low or high copy-number, this loss occurs at (very roughly) 1% frequency (defined as the frequency of
plasmid-free cells in a population grown non-selectively starting with a low inoculum of a plasmidcontaining isolate). In contrast, most naturally occurring plasmids are exceptionally stable, and in most
cases it is almost impossible to find plasmid-free isolates without a clever approach. Obviously some of
the DNA that has been removed in the course of creating smaller cloning vectors has eliminated functions
involved in plasmid inheritance.
Occasionally, it is necessary to isolate plasmid-free derivatives of a strain currently containing a
plasmid, a procedure termed curing. Some methods of plasmid curing include: (i) spontaneous curing
(perhaps found by replica-printing isolated colonies if the plasmid confers a scorable phenotype); (ii)
following an enrichment (again, if the plasmid confers a growth phenotype); (iii) selection of a different, but
incompatible, plasmid in the cell; or (iv) treatment with elevated temperature or chemicals such as
acridines, ethidium bromide, sodium dodecyl sulfate and novobiocin that tend to interfere with plasmid
replication (since the first two chemicals are known as mutagens, they should be used with restraint).
Prokaryotic systems affecting proper assignment of replicons. These systems involve either active
plasmid localization by filaments or the monomerization of plasmids. The former provides a mechanism
for drawing plasmid daughters into new daughter cells. The latter presumably decreases the likelihood of
a plasmid-free segregant by maintaining a maximal number of free plasmids. Consistent with this notion,
there is a phenomenon termed "the dimer catastrophe" in cases where monomerization is less than
normally efficient. In this case, dimers tend to predominate in the cell, because there are two origins of
replication and therefore they are twice as likely to replicate, but dimers also are much poorer at proper
partitioning, so plasmid-free cells are more common (TIG12:246[96).
Plasmid partitioning appears to come in one of two forms: actin-like ATPases and ATPases that
form dynamic filament structures whose mechanism is not yet understood (ARG39:453[05]).
pSC101 has a par locus of three, non-protein-encoding regions, at least two of which are
necessary for partitioning. It includes a binding site for gyrase. If either the par site or gyrase are
eliminated, the plasmid replicates normally and has approximately the correct superhelicity, but is poorly
maintained. Membrane association also requires par. The argument has been made that gyrase acts as a
recognition protein that causes appropriate association of the plasmid to the membrane. However, a
deletion of par results in plasmids with lower superhelicity and a topA mutation increases the stability of
99
Par plasmids, suggesting that superhelicity is directly involved. The general idea might be that you need
supercoiling of the par region and that while this is typically done locally, more global supercoiling of the
plasmid will suffice. Alternatively, it might be that plasmids with partitioning defects are able to get by this
problem if they are particularly highly supercoiled (PNAS97:1671[00]). It is interesting that the
ts
homologous T4 topoisomerase is membrane-associated and that gyrB mutants fail to segregate their
DNA properly, cannot decatenate, and place septa inappropriately (as if GyrB might also be involved in a
similar function for the chromosome). In summary, there appear to be effects that can be assigned to
supercoiling as well as those that cannot. Are these each separate "systems" or does one cause its
effects on partitioning indirectly, by affecting other mechanisms?
pBR has lost the partitioning site (cer) of its parent, ColE1, and therefore partitions poorly;
addition of cer or par from p101 restores proper partitioning. The cer site is involved in a site-specific
recombination system that is necessary for generation of monomers and therefore aids stability of the
plasmid. Two different Ec recombinases, XerC and XerD, bind to separate halves of the recombination
site to support the reaction. Rec strains tend to show decreased stability of these plasmids, suggesting
that some monomerization can occur through RecA. Similar systems seem to be operative in other small
multi-copy plasmids. Surprisingly, the large, low copy-number plasmid RP4 also seems to carry a system
for resolving multimers: one par gene encodes a product with homology to the Tn3-family of resolvases.
F factor contains the sop system (stability of plasmids; also called parABC) whose exact
mechanism is unclear. There are two trans-acting proteins, the products of sopAB, the latter binding to the
sopC site which is itself a series of 12 direct repeats of a 43-mer. They have also found Hop mutants of
Ec that are defined by their failure to maintain mini-F factors, and their genotype speaks to the partitioning
process. Mutants affected in hopA are defective in partitioning, and hopA turns out to be a gyrB allele.
These mutations cause relaxation of the plasmid and overproduction of SopB, resulting in the partitioning
defect noted above. A hopE mutation allows the formation of large linear plasmid multimers. hopE is
allelic to recD and recD mutations give the same phenotype with pSC101. hopB,C and D are all partially
defective in plasmid replication. All these mutations have no effect on Ec partitioning, which is rather
surprising given the similarities of F and Ec replication systems. Possibly the effects of the mutations on
the latter were not sufficient for detection.
As noted above, yeast replicons have two different segregation systems. Those with centromeres
use protein complexes known as kinetochores to attach the centromeres to microtubules. These are
themselves attached to the two centrioles, which then pull apart the sister chromatids. The general
process of separating the two sister chromatids and sending them to the opposite poles is called
disjunction. The failure to do this properly, such that both chromatids end up in the same cell, is called
-3
non-disjunction and is (to my mind, at least) remarkably frequent - about 10 per cell division in yeast and
higher organisms. In humans, this sort of failure leads to things like Downs syndrome, which is trisomy of
chromosome 21. The segregation system for the 2 plasmids is much more prokaryotic-like in that there
are specific proteins that bind the plasmid to the nuclear matrix, which then leads to proper segregation.
Plasmid-based systems for killing plasmid-free daughter cells. Some plasmids have evolved systems that
prevent segregation by killing any daughter cell that has not received a plasmid. When these systems are
mutated, the general effect on plasmid segregation is similar to that of mutations affecting plasmid
partitioning, although the mechanism is totally different. These systems function do this by producing a
relatively long-lived killing function and a short-lived kill-override function. A daughter cell without the
plasmid will have the both functions initially, because there are multiple copies of each in the cytoplasm
before cell division. However, without the plasmid, neither function will be replenished in the daughter cell
by new synthesis. After a while, the less stable override function decays and the cell is killed. To the
experimenter, these systems look like partitioning systems, since plasmid-free segregants are more
frequently detected in plasmid mutants lacking these functions. They can also appear to be inc functions,
for the following reason. Plasmids appear to be stably inherited because they kill plasmid-free daughter
cells. But if there are two different plasmids in the cell, but with the same kill system, a daughter cell that
fails to receive one plasmid will survive because of the presence of the kill-override provided by the other
plasmid. As a consequence, one or both plasmids will have the appearance of being less stable in the
presence of the other plasmid - the definition of incompatible plasmids. As described below, these kill
systems can either be RNA-based or protein-based.
All RNA-based systems use the same general motif: a long-lived mRNA encoding a lethal gene
product (often termed kil) and a short-lived antisense RNA that inhibits synthesis of the toxin (termed kor).
When the mRNA ends up in a plasmid-free cell, there is no replenishment of the unstable inhibitor RNA,
so translation occurs, and the cell dies. In the case of plasmid P1, the players are termed hok and sok:
(host cell killing/suppression of killing). sok RNA binds to hok mRNA to create secondary structure that
occludes ribosome access to the Shine-Dalgarno sequence. This system can be added to and stabilize a
100
range of other plasmids. F plasmid has a roughly similar system termed flmAB (F leading region
maintenance)(Gene66:259[88]).
Protein-based killing systems operate either to prevent successful cell division until partitioning
occurs or by killing those cells that have been generated without a plasmid (ASM pp1110-1133). With
chromosome partitioning of E. coli, there are two genes, kicAB, that become a killing system when moved
to a plasmid, with KicB being the lethal element, though the mechanism is unknown. F factor also uses a
pair of genes, termed ccdAB, where the role of CcdB is apparently inhibition of host cell division and the
CcdA antagonizes this action. The ccdB gene, encoding the killing function, has also been developed as a
counter-selectable marker (BioTch21:320[96]. Related systems have been found in IncP plasmids, P1,
IncN, R1, R100 and others.
Now the following might be a pointless aside, but I found it interesting: eukaryotic cells have a
system of cell suicide, termed apoptosis. Is it a complete coincidence or a remnant of ancient evolution,
that apoptosis also involves a long-lived toxin (caspase) and a short-lived toxin inhibitor?
Incompatibility. All prokaryotic plasmids fall into only one of the many existing incompatibility groups.
Two plasmids are termed incompatible if either is less stable in the presence of the other than it was by
itself. As an example: you grow a strain with plasmid A for 20 generations without selection and find that
98% of the cells have the plasmid at the end. You then add plasmid B to the same strain and again grow
the strain for 20 generations and measure the percentage of cells that have retained plasmid A. If it is
significantly less than 98%, the plasmids are incompatible. The point is that incompatibility is not
necessarily complete (i.e. the two plasmids cannot coexist at all), but rather statistical.
There are more than 30 incompatibility groups thus far described with no upper limit in sight.
Incompatibility, whose genotypic designation is inc, is often a necessary consequence of a plasmid's
desire to maintain a certain copy number in the cell. If plasmids of a given incompatibility group have a
certain copy number that they attempt to maintain, then a competition will result when two plasmids of the
same incompatibility group are found in the same cell. Whichever plasmid is able to replicate faster, or
has some other advantage, will be represented to a disproportionate degree among the copies allowed by
the incompatibility system. Somewhat surprisingly, plasmids can also be incompatible when they both
possess the same functions for partitioning themselves into daughter cells. Again, an example should
clarify that: Lets say that plasmids A and B have completely different replication system but identical kil
systems. Growing a strain with either plasmid alone might result in virtually 100% of the cells retaining
each plasmid, since plasmid-free cell would have been killed. However, if both plasmids were put in the
same cell and the experiment repeated, many cells would be found that one or the other plasmid, but not
both. Thats because a daughter cell that failed to get plasmid A after a cell division would not be killed if it
did get a copy of plasmid B, since the inhibitor of the toxin encoded by B would prevent killing by the
identical toxin from A. Note that you would not find cells lacking both plasmids (or at least they would be
as rare as they were in the original experiment starting with single plasmids per cell). In this example, the
plasmids are not exactly competing with each other, as in replication, but each would allow cells to survive
that lacked the other plasmid and we would see this as incompatibility.
Conjugation and secretion in prokaryotes. In the past several years it has become apparent that the
conjugation systems the move DNA among bacteria are related to a larger family of secretion systems
that are able, as a group, to transmit a variety of molecules from their hosts into different recipient cells or
simply out into the environment (CurOpMic6:519[03] AnnRevGenet44:71[10], MMBR74:434[10],
JBact192:3850[10]).
(i) There are different families of systems that transport proteins across the inner or cytoplasmic
membrane: Sec (general secretory), SRP (signal-recognition particle) and Tat (twin-arginine translocation
- see below). Rather remarkably, components of the Sec and SRP pathways have been found in every
organism, including eukaryotes, that has been examined. These two pathways also have phylogenies
identical to that of 16S rRNA, against suggesting that they are quite ancient and have not been
horizontally transferred.
(ii) Type II secretion. Such systems appear to transport both toxins and proteins. They have been
found in relatively few bacteria, including Pseudomonas and Yersinia and therefore are not as well
understood. The Sec system is used for moving the proteins across the inner membrane.
(iii) Type II (or TTSS) secretion systems. Such systems certainly transport proteins, but it is
unclear if they can transport other things as well. They were initially thought to be specific to pathogens,
but have since been found in a variety of microbes, though they might all be organisms that associate with
eukaryotes at some point in their lives. Some organisms have multiple systems, though the specificities
are unclear. This system uses a pilus system (a polymer of pilin) to attach to recipient cells (both
prokaryotic and eukaryotic), though the further roles in secretion are unclear (and discussed under the
101
conjugation model below).

(iv) Type IV (or TFSS) secretion systems. These systems also involve pili to identify recipient
prokaryotic and eukaryotic cells as detailed below, though not all examples appear to have such a broad
range of recipients. Somewhat remarkably, the conjugation systems that move DNA between cells are a
subset of this family and most members appear to simply transport proteins. A few organisms use TFSS
to either take up (Helicobacter) or secrete (Neisseria) DNA into the environment. Surprisingly (to me
3
anyway) DNA conjugation between strains is much higher (up to 10 -fold) when both partners express a
TFSS than when only the donor does.
(v) Type V secretion systems. These are apparently function as protein-translocating outer
membrane porins and are only found in gram-negative bacteria, though they are fairly broadly distributed
within that group. Like the Type II system, the Sec system is also involved.
(vi) Type VI is yet another secretion system was found in Vibrio and in Pseudomonas and
appears to be widespread in gram-negative bacteria. This system does not involve N-terminal signal
sequences and therefore is probably independent of the Sec system.
(vii) The Tat system (twin-arginine translocation - named for two adjacent arginines in the signal
sequence of substrate proteins) is unusual in that it transports completely folded proteins through the
membranes, rather than transiently unfolding them. The system is found in bacteria, mitochondria and
chloroplasts.
So there seem to be many variations on the theme of a system that can transport different
classes of molecules out of the cell. Some of these, as described in a bit of detail below, certainly involve
cell-cell contact and a pore of channel between those cells. But obviously some do not: those systems
that send molecule into cell to which a bacterium cannot "fuse" (like plant cells) and those systems that
simply move molecules out of the cell. So what to make of a recent publication with a title of "Intracellular
nanotubes mediate bacterial communication" (Cell144:590[11[)? Is this a fundamentally new observation
as the authors suggest, or simply a variation on the above themes, sexed up with the term du jour
"nanotube?" I confess that I do not know, but my bet is that it is the latter largely for a variety of technical
reasons.
Conjugation per se.
(MolMicro45:1[2] and
FEMSMicLett224:1[03]) Conjugation
is defined as the unidirectional
transfer of genetic information
between cells by cell-to-cell contact.
As such, it is not restricted to
plasmids, but can occur with any DNA
so long as the critical elements below
are present in the cell. The ability of a
genetic element to promote the
transfer of a piece of DNA carrying a
specific site from one cell to another is
termed conjugative ability. The
requirement for cell-to-cell contact
distinguishes conjugation from
transduction and transformation,
which will be discussed below. The
term unidirectional refers to the fact
that a copy of the plasmid is
transferred from one cell, termed the
donor, to another cell, termed the
recipient.
As noted above, this system
is a specialized from of the Type IV
secretion systems. Indeed, the
argument has been made that it
evolved from a strict protein secretion
system by first attaching a protein to
the end of the single-stranded DNA to
be transferred, so that the system
effectively transferred the protein and
Figure 9-1. A two-step model for conjugal DNA transport.

Horizontal thick black lines represent bacterial membranes,
traversed by grey cylinders that represent the T4SS. TrwC is
represented as the two-domain circle + oval (relaxase +
helicase) shape; TrwB is represented as a hexamer, with an
orange-like shape, anchored to the inner membrane. DNA is
represented by a thin black line; newly replicated DNA, by a
dashed arrow. The vertical arrowhead represents the nic site.
Curved arrows indicate postulated motion forces required for
DNA movement. A. TrwB is coupling the T4SS and the
relaxosome; a TrwC monomer covalently linked to the nicked
T-strand is the substrate for T4SS secretion. B. TrwB is
pumping out the T-strand as it is displaced from the donor
plasmid. Upon reaching the nic site for the second time, the
TrwC monomer in the donor would perform a second strandtransfer reaction, thus liberating the T-strand. The
translocated TrwC monomer would rejoin the two T-strand
ends by a reverse cleavage reaction. (MolecMicro45:1[02])
102
the DNA came along. Certainly now the system has evolved to also pump the DNA in a ATP-dependent
process (Fig. 9-1).
There are two dissimilar functions involved in conjugative ability: the first is a site of initiation of
transfer that is called either oriT or mob. The former term is a mnemonic for "origin of transfer" and the
second is short for "mobility. In each case they refer to a site on the DNA and not to a diffusible product.
The second group of functions involves those proteins that cause the range of functions necessary for
mobilization to occur. These are encoded by the tra genes and have a variety of functions.
The first tra function is the formation of the pilus that makes contact with the recipient cell and
draws the donor and recipient cells together. It so happens that there are two slightly different versions of
pili: F factor (and IncH, -T, and -J plasmids) have long flexible pili, while P-type systems (IncP, -N, -W, and
-I) have short rigid pili (the Inc in these cases refer to incompatibility groups that happen to correlate with
certain molecular properties of the transfer systems). Somehow attachment of the tips of these to an
appropriate target seems to signal the retraction (which is also a de-polymerization) of the pili, drawing the
donor and recipient cell together, though some of this is still only hypothesis. At this point, there is
certainly something of a pore opened between the two cells. The identity of many of the players forming
that pore is known, but critical details about the function are opaque. It is certainly an important point that
all sorts of cells, including eukaryotes, can be recipients in conjugation (depending on the conjugation
system), so the entry mechanism is certainly dependent on the donor and not on the recipient.
This then raises the very old question of the role of the pilus. For many years, it was thought that
the DNA actually moved through the pilus from donor to recipient, though this was based on the
correlation between mating and the presence of pili and nothing more. Then this notion was completely
rejected, in part because it appeared that the pilus was simply too small in diameter to perform this
function. Now, however, the point is again debated, because the pilus would seem to be a good
mechanism for punching a hole in the membrane of some recipients, and there is some evidence to
support that view. However, it might simply be that the pilus serves a hole in creating a pore, but that DNA
passes through that pore and not through the pilus itself. It is also true that some Agrobacterium mutants
fail to make pili but still transfer their DNA to plant cells, suggesting that in this case, at least, the pilus is
not essential for transfer.
Some event in this sequence triggers the nicking of oriT by a specific single-strand nuclease and
a subsequent binding of one or more pilot proteins to the free 5' end of the DNA. A single strand is then
transferred from this end to the recipient while a rolling circle form of replication occurs in the donor. The
protein pump that causes this transfer is FtsK, though a substantial protein complex of tra factors is
apparently involved. If the DNA being transferred is a plasmid, it is made double-stranded through the
action of the pilot protein serving as a primase (but see below). It is circularized in the recipient by an
unknown mechanism, whereupon it can presumably replicate. If the transfer DNA is chromosomal,
circularization does not occur, but homologous recombination with the chromosome can occur (in any
case, the incoming DNA must become associated with a replicon if it is to be inherited).
The conversion of SS DNA to DS DNA is actually a bit tricky, for a variety of reasons. One is how
does a primase actually start, but the other is more complicated. SS DNA is not a common feature in
normal cells and actually serves as a signal of DNA damage, which elicits the SOS response and the
action of the Rec system. But if you are a conjugating plasmid, you do not want to go down this path, so
you need to prevent your transferred SS DNA from causing this response. This is apparently done by the
production of a protein, PsiB (in the case of F factor) that interferes with RecA action in some way and is
encoded by the leading end of the transferred F factor DNA (CritRevBiocMolBiol42:41[07]). But of course
the F factor cannot make this protein all the time because that would harm the ability of its host to deal
with real DNA damage, so this protein should only be produced in the recipient cell during conjugation. So
this raises two issues: how does the cell produce a protein from SS DNA and then how does it not
produce the same protein from DS DNA? It appears to be the case that very select DNA sequences can
function as promoters when they are single-stranded (SS), or more precisely, when the relevant region
does not have a perfect DNA complement present and therefore forms an imperfect DNA duplex with
another section of SS DNA. Presumably, there is not a lot of SS DNA in most cells, so these situations
have only been seen so far in phage (Cell70:491[92]) and plasmids (Cell89:897[97]). Though there are
some differences in the two cases, such as the role of single-stranded binding protein, the notion seems
to be that the imperfect double-stranded region is adequate for binding holoRNAP and then easy to
convert to an open complex because of the base mis-pairing. Thus, the same regions when found in a
normal DNA duplex do not serve as promoters. At least in the case of F factor, this means that the singlestranded DNA that enters the recipient cell during conjugation has a promoter near the oriT that allows
expression of certain genes only in the recipient cell. The same promoter is also involved in conversion of
the entering SS DNA to a DS copy that eventually circularizes in the recipient.
The problem with conjugative plasmids, at least for in vitro cloning work, is that the tra region is
103
large, so the plasmids are necessarily large. However, the oriT region is small and can be cloned onto
any plasmid, making it non-conjugative and yet mobilizable (if the tra products are supplied by another
plasmid). Many applications in inverse genetics employ small plasmids containing oriT regions. Finally, a
plasmid lacking both the tra functions and oriT functions would be non-conjugative and non-mobilizable.
A conjugative plasmid must make an interesting decision: Should it perform rolling circle
replication at oriT, which leads to conjugation, or should it start normal replication at oriV (the genes refer
to the incP family of plasmids)? To prevent the simultaneous use of both replicons, there is a complicated
regulatory circuit that determines which path will be followed: at the heart of it is a pair of divergent
promoters, of which only one can function at a time, that lead to the expression of the two competing
replication proteins.
It happens that F can mobilize other plasmids like pBR322 derivatives, by the occasional
transposition of the (Tn1000) of F into the smaller plasmid to form a cointegrant, which moves to the
recipient and resolves by homologous recombination to yield pBR with a insert. The occasional
appearance of -free pBR's in a recipient is apparently due to the presence of a pBR dimer in the donor,
which is resolved to monomers, with and without , in the recipient.
Regulation of conjugation. F factor expresses its tra functions constitutively, but this is the exception.
Typically, regulation of transfer is negative and this is often relieved by a quorum-sensing system. One
implication of this is that zygotic induction should occur immediately following plasmid transfer to a new
host, effectively leading to a self-propagating wave of conjugation in the population (zygotic induction
refers to the situation in which a piece of DNA entering the cell does not yet have the normal complement
of regulatory proteins accumulated and therefore transiently behaves in an unregulated manner). Besides
the conjugation induction by drugs noted above, there are quorum-sensing plasmids whose conjugation
ability is induced by bacterial pheromones in Enterococcus faecalis (JBact175:6229[93]). These
organisms and plasmids are of clinical significance, but the mechanisms behind the pheromone-induced
aggregation, leading to enhanced conjugation, are poorly understood.
Suicide plasmids. One of the common uses of plasmids actually involves conditions when they cannot
replicate. Obviously, they must replicate as plasmids in some cell or they would not be plasmids, so one
takes plasmids that can replicate in some strains but not in others. The various manipulations for plasmid
construction are performed in the permissive strain and then the plasmid is moved to a non-permissive
+
strain. Typically this is done by conjugation (such plasmids are typically mob , but tra ), but it can also be
done by transformation. Selection for a drug-resistance marker on the non-replicating plasmid demands
that it integrate into some replicon in the cell, and this will typically occur by homologous recombination
between a chromosomal region cloned on the plasmid. This is a very useful method for creating mutation
in the chromosome of any prokaryote that can accept DNA and the methodology is discussed in greater
detail at the end of LT4.
607 Lecture Topic 10............. TRANSFORMATION (see ASM2,2449[96])

Transformation is the process of uptake of naked DNA into cells.
Natural transformation in prokaryotes. Natural transformation refers to situations where organisms
possess a specific system for taking up exogenous DNA and almost certainly use that system in their
environment (ARM40:211[86]). While natural transformation is of considerable interest to the organisms
that do it and to the researchers that deal with such organisms, the phenomenon is not so widespread
that it merits a vast amount of coverage in this course. It has been observed in Streptococcus,
Haemophilus, Neisseria, Bacillus, and others; all but the Neisseria seem to regulate the timing of
competence (the ability to take up DNA). Somewhat surprisingly, most of these beasts chew the incoming
DNA into single-stranded pieces and then use recombination to incorporate those pieces for which there
is homology. You would therefore expect to have trouble introducing plasmids by such a mechanism,
since there will typically not be such preexisting homology. Indeed you do have problems transferring
plasmids unless plasmid multimers are introduced (presumably, overlapping double-stranded regions are
created and the plasmid is circularized by a rec system), or through use of a high concentration of plasmid
monomers (probably overlapping fragments allow some version of the circular plasmid to reform). There
seem to be four stages to the process of natural transformation:
(i) Development of competence: Many of the gram-negatives develop competence when they
stop growing, but in at least some of the gram-positives, competency is induced by excreted protein
factors that reach a critical level only when the cell density is high (quorum-sensing). The gram-negative
104
Campylobacter spp. show competence for their own DNA during log-phase growth (JBact172:949[90]). In
Streptococcus pneumoniae, the comA and comB genes are necessary for the induction of competence in
response to population density. Bs also regulates competence based in response to cell density, through
recognition of an extracellular peptide that starts a signal cascade. This results not only in a change in the
cell surface, but in the synthesis of a DNA uptake system (PNAS91:9397[94]).
(ii) DNA binding: Most of these organisms bind DNA randomly, but Haemophilus has a specific
recognition system that has a strong preference for a particular 9-bp sequence or its inverse complement
that is somewhat common in it own genome (JBact172:5924[90]). This implies that it might be a
deliberate process to exchange DNA among related strains. The specificity that Campylobacter also
shows for its own DNA indicates some recognition sequence or modification. In Bs, both a DNA-binding
protein and an associated nuclease (see below) have been identified and the genes cloned
(JBact170:3703[88]).
(iii) DNA uptake: Bacillus and Streptococcus make single-stranded nicks in the DNA, and then
only transport single-stranded sections, with the complementary stand being degraded. In the case of
Streptococcus, the nuclease responsible for the nicking, encoded by endA, is necessary for DNA uptake;
endA has been cloned and the EndA product analyzed (JMB312:727[90]). Haemophilus and Neisseria
(JBact170:756[86]) take up the double-stranded DNA fragments, the former into vesicles. It is unclear
how the DNA is removed from these or how it is processed to a single-stranded form.
(iv) DNA integration: It seems that integration of the incoming DNA into a replicon in the recipient
always involves a single-stranded displacement of DNA already in the host. In the case of Bs, there are a
set of gene products known to be involved in this integration. With S. pneumoniae, the recP gene product
has been recognized as necessary for efficient recombination of donor DNA into the chromosome.
As an aside, remember the argument in LT1 that homologous recombination was mainly a
mechanism for DNA repair? One might expect that organisms with natural competence might be a little
different, since they expect to handle incoming DNA for inheritance purposes. On the other hand, bacteria
that do not have natural transformation have little reason to expect homologous DNA to enter.
Presumably then successful transduction, transformation, or conjugation (where recombination in
required) might be by a mechanism that mimics some events in repair. Indeed. this is probably why
single-stranded DNA is sometimes taken up in transformation. Also, the double-stranded blunt ends of
transforming and transducing fragments are reminiscent of damaged DNA and are therefore
recombinogenic
Chemically induced transformation. (ASM pp1177ff) This approach will receive the most treatment,
though electroporation (see below) has largely supplanted it.
Transformation methods. The general requirements for induced transformation are divalent cations and a
o
transient temperature near 0 C. Since different cell lines have different surface and membrane properties,
it is not surprising that other procedures can also work: treatment of some lines with cations, DMSO or
dithiothreitol (Cleland's reagent) has also been effective. It is unclear if this treatment's effects are
chemical or physiological.
One measures the success of transformation by the number of transformants, which are cells with
the inherited marker. This is normalized to either the number of input cells or the amount of input DNA. In
a sense the relevant number to the experimenter depends on what is limiting in their particular system. In
9
optimized systems, one obtains 10 transformants per g DNA and this reflects a probability of about 1%
for a given plasmid being inherited by some cell. As noted above, transformants per input cell can
approach 10%, under the best conditions.
In some organisms, only a fraction of cells are competent to take up DNA at any given time and
this has led to the technique of co-transformation. In this procedure, you move an unselected marker into
the recipient by selecting for the simultaneous inheritance of a selectable marker known to be on another
piece of DNA. The underlining is to emphasize that this flies in the face of the assumptions typically used
for mapping, that is, that two coinherited markers must be genetically linked. The principle is clearly that,
in systems where competent cells are rare, the inheritance of the selectable marker serves to flag
competent cells, which are the only ones likely to have acquired the other, desired marker. The problem is
that if an unselected marker is inherited in a reasonable fraction of the competent cells, they must be
taking up vast amounts of other DNA and incorporating it more or less randomly, with obvious implications
for mapping and strain construction.
Biology of the process. It is clear, at least in the case of plasmids, that there are two phases in chemically
induced transformation: uptake of DNA from the environment and establishment of the plasmid, in which it
achieves a replicating state. The frequency of cells in the population capable of performing the first phase
105
-5
-1
varies (depending on the cells and the protocol) from 10 to 10 of the population.
There are a large but defined number of sites through which DNA can enter the cell. These
channels may be coincident with the zones of adhesion of the inner and outer membranes, but they could
also be spaces in the LPS surface. In any event it is clear that the LPS provides an impediment to DNA
++
uptake, since strains with less LPS tend to be more transformable. Treatment of cells with Mg seems to
cause the loss of LPS and this correlates with better transformability. Both DNA and LPS are polyanions
++
and would be expected to repel each other, so the typical treatment of cells with Ca ions to aid
transformation is consistent with masking the two anions. Shifts to low temperature, known to improve
DNA uptake, no doubt solidify the membrane, perhaps making it easier to generate a channel with the
anionic charges shielded. There are contrary reports, however, that argue that LPS is unimportant and
that the function of the treatment is to induce bacteria to synthesize poly-hydroxybutyrate and
incorporate this into their cytoplasmic membrane, causing a new lipid phase transition for both Ec and
Azotobacter.
Something curious also happens with the DNA itself. It must assume a relatively compact
structure, since the extended size of many transformable DNA molecules are substantially larger than the
bacteria themselves. The nature of the establishment phase of transformation is mechanistically unclear,
though the ASM review suggests that it might involve generation of an appropriate protein coat for the
introduced DNA, where the mechanism for the generation of that coat might be limiting. The inheritance of
replication-proficient plasmids, particularly supercoiled ones, could be imagined to involve little, if any,
DNA metabolism. However, transformation of genes not associated with a replicon would require some
recombinational event for eventual stable inheritance. Transformation, unlike generalized transduction,
does not involve double-strand crossovers, but rather single-strand invasion of the host replicon. This
argues that the state of the introduced DNA by the two schemes is different and, as will be touched on in
the section on transduction, it is likely that phage factors allow transduced DNAs to assume a circular
form, which might allow them to use a different recombination pathway. There are also reports that SOS
induction increases transformability due to the presence of more RecA.
Induction of transformation by protoplast fusion. In gram-positive bacteria, a useful method of causing
DNA uptake is through the generation of protoplasts, or cells lacking cell walls. These cells are generally
formed by growth under conditions that inhibit wall synthesis and also stabilize the cells that are so
formed (high osmotic strength); enzymatic treatment of the cells is often necessary. Such cells are then
exposed to DNA and treated in such a way that they fuse to each other, apparently incorporating DNA
during this cell fusion. The cells are then put on a high ionic strength solid media to allow them to reform
the walls and eventually generate colonies. Such fusion events are difficult to control and the
multinucleate intermediates can be slow to resolve back to a haploid state. Indeed, this has been used in
mapping in Streptomycetes when strains of dissimilar genotypes were fused, but the results were rather
cumbersome.
Electroporation. (MolBiotechnol7:5[97], Bio/Tech6:742[88]) This technique was originally developed for
the fusion of eukaryotic cells, but has been used increasingly with prokaryotes. It entails applying a short,
but intense, pulse of current to the cells that seems to open transient pores in the cell membranes
allowing exogenous substances to be taken up. If DNA is present in the solution, it is also taken up,
apparently in a manner analogous to protoplast fusion. These pulses seem to generate a greater potential
than the cell membrane can support, leading to its transient breakdown in localized regions. There is a
dependence on the concentration of DNA in the solution and inversely with the size of the DNA to be
taken up. The best conditions for Ec so far give slightly better results than those obtained by the very best
conventional chemical transformation. If you employ this method on a different bacterium, you may need
to spend some time optimizing conditions, but it should be successful. Preparation of cells for
electroporation is also significantly easier than preparing competent cells for transformation. The method
has also been used to extract plasmids (and other cell components) from cell into the media. The large
size of eukaryotic cells allows more sophisticated uses of the method including treatment of single cell
(CurOpBiotech14:29[03]).
607 Lecture Topic 11..........VIRUSES AND OTHER INFECTIOUS ELEMENTS

Why, its just like lambda! (said with reference to almost anything in a seminar) Waclaw Szybalski
Viruses and other infectious agents: Phage are parasites like ISs and plasmids and also require bacterial
hosts for replication, but phage also have the ability to move between hosts in an autonomous form.
(Clearly, there is a rather fine line between phage and some conjugative plasmids.) Years ago, Andre
106
Lwoff defined viruses as infectious but non-autonomous agents consisting of proteins and nucleic acids.
As we have seen before, we can define things in whatever way we want, but there is no reason to hope
that there will not be cases that challenge the consistency of those definitions. At very minimum, it seems
that phage need a site for replication, though they typically also have proteins involved in their replication
as well as in forming a coat for their extracellular existence. As with ISs, they cannot afford to be too
virulent, but their ability to survive outside their host lessens this constraint somewhat (relative to ISs).
The definition of viruses is rendered more complicated by the existence of viroids and prions.
Both of these are infectious agents, but neither has genetic information in the normal sense, nor do they
have any other structural features of viruses. Nevertheless, they do manage to move between hosts and
to amplify themselves in the cell. By my rather general definition of viruses above, these elements
certainly qualify, though I think no one calls them viruses.
Viroids are small single-stranded RNAs of less than 400 nt that do not appear to encode anything.
Nevertheless they cause a number of plant diseases. Oddly, though the underlying biology simply has to
be interesting, these have largely disappeared as a topic of active research, perhaps because only plants
are known to be affected and the diseases are not of large economic impact. It is starting to look as
though an important aspect of their biology is through the ability to create viroid-specific RNAs that affect
host gene expression through the general mechanism of silencing (TrendPlntSci9:339[04]).
Prions are another matter, scientifically and politically. They are even more remarkable than
viroids scientifically, since they are infectious protein, and they are politically different because they cause
diseases in humans. In a number of diseases, which include scrapie in sheep and Creutzfeldt-Jakob,
sc
cjd
kuru, and fatal familial insomnia in people, there is apparently a protein (PrP or PrP , as appropriate)
that is a homolog of a normal cellular protein, but exists in a different conformation. The altered protein is
c
infectious because it can be taken into the cell convert the normal cellular protein (PrP ) to this altered
sc
form. Thus, PrP is increasing it level even though it is not exactly replicating itself. The infectious form
is more protease-resistant, less soluble and has more -sheet structure than the normal form. Genetic
analyses have identified a number of mutations in the gene encoding PrP that cause inherited versions of
the same disease (Sci302:814[03] & TICB13:337[03]). These last variants are essentially proteins that are
sc
inherently more predisposed to adopt the PrP conformation spontaneously.
The above description implies that a prion is a structurally altered protein that perturbs the
structure of other proteins with the same primary structure, but with different tertiary structures; that is,
products of the identical or nearly identical gene. However, one can envisage prions that interact with and
perturb the structure of completely different proteins, though that newly perturbed protein then needs to
be able to continue to propagate the prion state. Such a situation seems to be the case with some yeast
prions that are able to aggregate with different proteins as long as they share a region rich in Asn and Gln
residues (PNAS106:1892[09]). Interestingly, at least some prions need some modest level of activity of
certain chaperones, such as Hsp104 in yeast. The logic is that without any chaperone activity, the prions
all aggregate and are not partitioned into both daughter cells upon cell division. However, with high
chaperone activity, the prion is unable to successfully aggregate the non-prion form of the protein
PNAS105:16596[08]).
Another complication for our conception of a phage/virus is the existence of some massive
viruses (not phage!) that are found in some amoeba and termed mimiviruses. The first curiosity is the
massive size, with a genome of 1.2 Mb and 1,200 ORFs. Another surprise is the large number of genes
whose products are normally only found in living cells, such as tRNA synthetases, some translation
machinery, DNA repair enzymes and chaperones. Perhaps most oddly, when some of the genes whose
products are conserved in all living organisms are compared phylogenetically, it places the mimivirsues
near the eukaryotic clade. The author actually suggests that it might represent a fourth branch of life that
has degenerated into a virus-like state (ASMNews71:278[05].
Almost unique among living organisms, yeast are NOT known to have associated viruses, and I
cannot really imagine why this should be so. However, remember that both the Ty elements and the killer
RNAs of yeast are associated during at least parts of their life cycles with virus-like particles. It is unclear
if this reflects their prior history as viruses or if they have simply grabbed the genes for this process from
another long-forgotten element.
Molecular biology of phage (for a fuller treatment, see ASM2,2325[96]): Phage can typically be found
anywhere there is a population of their hosts. Most small soil samples have some phage capable of
infecting a random streptomycetes strain, for example. Indeed the claim has been made that there are 10fold more tailed phage particles than cells in environmental samples by direct counts (Nat340:467[89]). As
30
there are approximately 5x10 prokaryotic cells on earth (PNAS95:6578[98]), thats a lot of phage.
Perhaps more interestingly, there is significant similarity in specific functions between phage from very
different hosts, leading to the hypothesis that all phages are evolutionarily related to some extent
107
(PNAS96:2192[99]). One can find phage by direct plating of samples onto a soft agar lawn seeded with
the desired hosts and looking for clearing of the lawn in small circles, termed plaques. Alternatively a preenrichment step can be used, where the phage source is incubated with a liquid inoculum of growing cells
for a time to allow phage propagation, and then filtered to remove bacterial cells and screened as above.
A bit of care must be taken in interpreting clearings as phage. A small colony of antibiotic-producing
bacteria can have a similar appearance, as can Bdellovibrio cells, which are bacteria that feed on other
bacteria.
th
Phage have been studied since their discovery in the late 19 century, but there are now three
main reasons for our interest in them. (i) They are fast and relatively simple models for many basic
biological phenomena. Their analysis can provide insight into the functions of their host cells directly. For
example, gro mutants of E. coli allow growth of defective phage and were first identified that way. A major
class of these turns out to be heat shock/ chaperone proteins, which are of major importance across all of
biology (ASM2,922[96]). (ii) They are useful as tools for genetic manipulation, and this is the concern for
this course. (iii) They have a potential impact on any industrial process involving the use of bacteria (see
"Uses" below). Comparisons of the genomes of different phage have also provided insights into evolution
and genetic exchange (ARG33:565[99]).
Phage methods. Phage stocks are generally kept cold, or even frozen with cryoprotective agents. Storage
conditions usually include divalent cations for both phage stability and subsequent bacterial infection, and
sometimes an agent (gelatin) to increase the solution viscosity, which seems to protect phage from shear
damage. Stocks for long-term storage should have as little bacterial debris as possible to prevent phage
from slowing killing themselves by infecting such debris. Obviously stocks should be sterilized, typically by
chloroform addition or filter sterilization.
One can count the number of phage in a lysate several ways. Electron microscopy has been
used, but has drawbacks: phage need to be present at extremely high titres to find a useful number in a
11
field of view (>10 /ml). It can also be difficult to tell phage from debris, and impossible to resolve viable
phage from non-viable. The functional assay of plating dilutions of stocks on lawns of bacteria to detect
plaques is most commonly used. The use of a soft agar overlay for the bacterial suspension is particularly
good since it gives an lawn of even density and allows more consistent mobility of phage through the
lawn.
The common way of describing the ratio of phage to cells in any given infection experiment is by
the multiplicity of infection, abbreviated m.o.i.. Another common abbreviation is p.f.u., which stands for
plaque-forming units. This is the enumeration of the number of cleared zones on a bacterial lawn, each
representing the progeny of a single phage. It is therefore a measure of the number of viable phage at the
time of plating under the conditions tested.
Specific properties of different types of phage. There are almost an uncountable number of possibilities
for the properties of a phage, but this is a list of some common features and issues faced by phage.
(i) Phage can carry either single- or doubled-stranded DNA or RNA (though there are no phage
that carry more than one type). For purposes of gene transfer between bacteria, the double-stranded DNA
phage are the only ones known to be useful.
(ii) Phage genomes range in size from 3-400 kb.
(iii) Most phage have proteinaceous heads, but some also contain lipid (EMBOJ7:1821[88]),
which causes them to be chloroform-sensitive (strictly proteinaceous phage are chloroform resistant, so
phage stocks are often stored with a drop of chloroform to prevent any cells from growing in them).
(iv) Most phage have defined architectures to their capsids (the protein shell that contains the
nucleic acid). These structures are almost always inflexible because the proteins that form them can only
be fit together in a single pattern. As a consequence, most phage are limited in the maximum size of their
genome - they simply cannot exceed that which they can package in a single phage head. However, there
are some phage that have more amorphous capsid structures. These phage typically form a complex in
which a monolayer of protein subunits wraps around single- or double-stranded nucleic acid in a loose
helix, which results in a protein tube around that nucleic acid. As a consequence, the virion (a term
referring to the entire phage particle) can be as long as the nucleic acid requires (within certain limits).
(v) Most phage produce many progeny in the infected cell simultaneously and then lyse the cell to
release all those progeny (and the average number released is termed the burst size). However, some
phage cause the infected cell to continuously leak phage into the environment, without killing the cell. The
mechanism is complicated, but essentially involves placing the coat proteins in the cell wall and coating
the phage nucleic acid as it extrudes through the wall. Perhaps as remarkably, the infected cells grow
fairly well, at least under lab conditions, so that massive amounts of phage can be released into the
12
environment (up to 10 phage per ml).
108
(vi) Almost all phage have a lytic mode, except for those that extrude as described above. In this
mode, they synthesize progeny and kill the cell. But some phage also have a dormant stage, termed
lysogeny or the prophage state, where they repress the lytic functions and exist stably in the cell, typically
as an integrant into the chromosome (lambda and P22), but occasionally as an autonomously replicating
plasmid (P1).
(vii) Phage that integrate their genomes into a replicon in the host in the lysogenic state have two
rather different strategies. Some integrate by a roughly random transposition mechanism. The most
famous of these in Mu for Ec, but Ec phage D108 behaves similarly and there are at least two
Pseudomonas aeruginosa phage with similar insertional properties, though they differ in a number of
other respects (JBact172:1899[90]). This strategy has the advantage that you are not dependent on the
existence of specific sequence in the genome, but the phage runs the risks of damaging the cell because
of where it inserts. The alternate strategy involves integration by site-specific recombination. As
mentioned in the section on plasmids, some of these sites of integration are tRNA genes. The virtue of
these as targets is that the importance of these genes to the cell insures that they will be present. The
phage avoids the problem of damaging the gene by carrying part of the tRNA gene in its genome, so that
the recombination integration recreates a functional copy of the tRNA gene.
(viii) Temperate phage can often, but not always, be induced to switch from their dormant state to
a lytic one. Typically we can solve this by putting their lysogenic hosts under conditions that indicate to the
phage that they are better off on their own. Agents that induce DNA damage, or conditions of poor cell
growth, can sometimes be effective for phage induction. There are many phage that seem to be
uninducible by external stimuli, which seems at first blush like an evolutionarily poor strategy. However,
the phage lysogens appear to spontaneously enter lytic phase at low frequency, which might ensure a
constant, albeit low, level of phage being released for as long as the population of infected cells survive.
(ix) The specificity of a phage for a certain host is due to not only the ability of the free phage to
recognize a given cell as a target, but also its ability to propagate in that cell, which reflects the various
host factors necessary for phage growth. Some phage have particularly specific targets, such as the
presence of a specific type of pili. Such phage do not seem to have obvious DNA injection abilities, so this
specificity might be because the phage gain entry into the cell upon pilus retraction. This specificity
creates a different problem, though, because a wide variety of cells can produce these pili and it is highly
unlikely that the phage can propagate in all these because of differences in necessary host machinery.
This specificity might therefore imply the fairly serious cost that many virions would infect cells in which
they could not propagate.
Others can vary their virion structure to recognize different hosts: Mu uses a site-specific
inversion system (gin) to produce either of two possible tail fibers. One allows the infection of E. coli, while
the other allows the infection of Citrobacter, another enteric bacterium. Some lysogenic phage alter the
host's surface to limit superinfection by similar phage (termed sie, for superinfection exclusion,
MicoRev42:385[78]). Phage have been isolated that are specific for each of the 3 porin systems of Ec,
allowing selection against the presence of each system (JBact172:1660[90]).
(x) When a phage decides to enter the lysogenic state, it runs the risk of accumulating random
mutations that prevent it from subsequently being able to make progeny under any circumstances.
Lysogens that have apparently undergone this process are sometimes termed cryptic prophage. They
might have only a single critical mutation initially, but over time, they will accumulate more. Eventually
large sections of the phage genome are certain to be deleted, because there is no selection for the host to
maintain them. Some cryptic phage can lyse cells and produce defective particles like phage tails, and
these are often detected as a class of bacteriocins (see below). More recently, of course, genome
sequencing has revealed regions of genomes that are homologous to known phage genes, which is a
strong indication of cryptic phage. See the general review in ASM2, 2041[96].
(xi) Bacteriocins are toxins produced by some cells that have the property of killing only closely
related bacteria (as opposed to antibiotics, which kill lots of things), and many of these are actually
defective (lacking DNA) phage or phage tails that can kill susceptible hosts by damage to the outside of
the cell. In the case of defective phage, one often finds that they fail to kill cells containing the same
phage due to an alteration of the cell surface that prevents adsorption. The widespread occurrence of
such defective phage through the Bacillus family suggests that this might be of significant competitive
advantage to the host (JBact172:2667[90]).
This is a reasonable place to mention the other class of bacteriocins that have nothing to do with
phage, but are species-specific antibiotics (because of phage specificity, phage tails will also be speciesspecific, of course, which is why these two mechanistically very different phenomena were lumped
together). ColE8 encodes a colitoxin that is a DNase, while others encode small peptides that
permeabilize cells. Colicin M inhibits the synthesis of peptidoglycan in some way (MGG222:37[90]).
How do these bacteriocins differ from antibiotics? The distinction is fairly arbitrary: antibiotics are
109
typically active against a fairly broad range of organisms, while bacteriocins are defined largely by their
very narrow specificity, typically a single species. The rationale for the evolution of bacteriocins is that
they are produced by bits of selfish DNA and that that enrich for themselves in the environment by
eliminating hosts that are very similar to their own (and therefore competing for a similar niche) but lack
the selfish DNA. While this is the rationale for the species specificity, it is hardly the mechanism. If this
explanation is not persuasive, then you will also be bothered by the fact that we actually do not have a
much better idea of why antibiotics are produced.
(xii) All phage need to have some mechanism for packaging as little host nucleic acid as possible
into their capsids (ARM43:267[89]). After all, it is in the phages interest to produce many viable progeny.
Double-stranded DNA phage with fixed head sizes accomplish this exclusion of host DNA in one of two
ways (with implications for their utility for gene transfer). Some use unit-length packaging, by which
identical DNA pieces are cut from the concatamer (which is a head-to-tail arrangement of multiple DNA
pieces formed during rolling circle replication) produced during rolling circle DNA synthesis, by a phage
enzyme that makes two nicks (separated by 10-20 bp) at specific sites in the phage genome (cos
sites)(JMB195:75[87]). These DNA fragments, with the short single-stranded regions at either end
resulting from the nicks are packaged into the virion. After injection of the DNA into another host, these
staggered ends anneal to each other to circularize the phage and are ligated together. Host DNA has few
if any sequences similar to the cos sites and these are never spaced one phage-length apart, so such
phage do not mis-packaged entire pieces of host DNA.
The alternate method is termed headfull packaging (JMB199:467[88]). In this case the phage
system starts packaging the DNA concatamer at a site (pac) and proceeds until the head is full. Since the
head can carry slightly more than one genome-length, the packaged DNA is 102-110% of a single
genome, which means that there is 2-10% redundancy at the ends. The concatameric DNA that didnt get
into the head is then cut by a non-specific nuclease and the newly formed concatamer end is fed into
another phage head. By this system, if host DNA is mistakenly cut at a pseudo-pac site, then it can be
added to multiple phage heads because that long DNA has entered this packaging pathway. As a
consequence, rather a lot of host DNA can be packaged in any cell with this mis-packaging starts. The
phage DNA (but not transduced host DNA) recircularizes in the next host by homologous recombination
between the redundant regions at either end, which generates a single circular copy of the genome. In
P1, pac cleavage involves the PacAB proteins, IHF and HU (JMB243:258,268[94]), and requires that the
site be methylated, apparently delaying cleavage until the proper time (PNAS87:8070[90]). P1 also does
not grow well on dam cells and the phage produced from these cells do not have appropriate pac ends.
T4 does not seem to employ a pac system, perhaps relying on the degradation of host DNA to address
problems of DNA specificity. RNA phage also have mechanisms for selective packaging of their nucleic
acid (JMB204:939[88]). Well return to both of these packaging processes in the sections on transduction.
The following packaging issues do not relate to transduction, but will be addressed because they
say something about phage biology. The Pseudomonas phage 6 has a particular packaging problem,
since each virion contains one copy of each of three different double-stranded RNA molecules that
together make up the genome. The mechanism of correct packaging has apparently been elucidated and
it involves the selective packaging of one strand, which then provides a sequence that is necessary for
packaging the second strand and then this pattern is repeated yet again (MMolBR63:149[99]).
The discussion above addresses the organizational matters of packaging, but does not address
the actual mechanism by which a huge amount of negatively charged nucleic acid is jammed into a very
confined space. Not surprisingly, this involves energy and it has been thought for a long time that there
was a spooling mechanism by which ATP hydrolysis causes a rotational motor at the base of the phage
head to somehow translocate the DNA. The structure of such a motor, with a fairly plausible hypothesis
for function has been described. Essentially there is a rotating connector that pinches the DNA helix; ATP
hydrolysis causes that molecule to move 1/5 of a rotation. The DNA then has a choice - it too can rotate,
but that is energetically challenging. Instead, it translates 1/5 of its pitch (i.e. moves 2 bp into the phage
head), which allows it to return to its proper position with respect to the connector (Nat408:745[00]). Thus
ATP hydrolysis drives the DNA into the capsid where it spools into a tightly packaged structure.
(xiii) As with transformation, it is not obvious how the negatively charged nucleic acid gets through
the similarly charged membrane/LPS. Some indications are that ion channels in the membrane are
utilized. Also, phage need to get their nucleic acid back out of the capsid, and since they are outside of
the cell at that point, they cannot use further ATP hydrolysis. Phage seem to address this several ways,
not the least of which is chemically altering the environment of the DNA within the head after it has been
packaged to improve the thermodynamics of its leaving the capsid when it has the opportunity. In T7, at
least, several phage proteins disaggregate from the head and form a channel in both the outer and
cytoplasmic membranes. Two per cent of the genome is then injected and transcription begins. The act of
transcription is necessary for pulling the rest of the DNA in. One of these proteins has a translocase
110
activity. The notion of protein-mediated translocation is also consistent with the observation that the rate
of nucleic acid injection is constant and temperature-dependent, so it is not merely entropy-driven. The
sequence of the injected 2% is not important and the model is that the translocase uses the membrane
potential to get this far, whereupon the potential is exhausted and transcription is necessary. Other phage
simply seem to rely on the thermodynamics driving the nucleic acid from the capsid.
(xiv) Restriction and modification. In a related area, at least some phage like T4, T7 and encode
small acidic proteins that block the host type I restriction enzyme function, possibly by direct proteinprotein interaction (JBact174:5079[92]).
Problems and curiosities unique to phage. RNA phage have two inherent problems: First, there does not
seem to be an RNA recombination system, which limits their ability to evolve and hampers our genetic
analysis of them (TIG7:186[91]); second, RNA replication does not have the numerous levels of error
correction found with DNA synthesis, so these phage have a rather higher mutation rate
(Nat333:473[88])(remember that phage are much more able to tolerate a high frequency of errors within
their population, due to their ability to massively propagate and therefore sustain a population containing
non-growing variants). Finally, RNA phage mimic a common cell component, mRNA, that is generally not
stable, so these phage need to survive the cell machinery that degrades RNA.
The specific hurdle faced by DNA phage is the host restriction system, but since the phage
typically only propagate on a similar host, and therefore already be modified, this should not be much of a
problem. In Bs, phage PBS-2 survives in part by using U instead of T in its DNA and specifically inhibiting
the U-glycosylase function in the cell. Under curiosities has to go the case of the introns in T4; where did
the phage find the idea (ARG24:363[90])?
All phage have an extremely short life cycle in the lytic phase, and the timing, order and degree of
gene expression are critical if the maximum number of progeny is to be produced. This gives rise to
several phenomena. First, the short time frame for proper regulatory decisions means that transcriptional
regulation is not sufficiently rapid for fine-tuning and might explain the numerous levels of posttranscriptional regulation seen in phage (see ARG33:193[99]). Second, phage typically have nucleic acid
size constraints, so that any additional complexity will need to be developed within the existing genetic
information - this gives rise to overlapping genes, close-packing of encoded functions, and multiple
functions for certain gene products. The cheapness of DNA in living organisms allows, but does not
demand, such adaptation.
Uses of phage.
Generalized transduction. (ASM p1154ff & ASM2,2421[96]) The general phenomenon is that one
occasionally has a phage particle in a lysate that is filled with host rather than phage DNA. This particle is
able to inject this DNA into other hosts, where it has a reasonable possibility of recombining with a
replicon in the host. As mentioned above, phage try to package their own DNA based on a sequence
recognition system, but packaging of host DNA by wild-type phage occurs at about a 2% frequency for
P22 and 0.3% for P1. For both P1and P22 so-called HT variants (for high frequency transduction) have
been isolated that provide more transductants per pfu, because of their reduced ability to discriminate
between host and phage DNA. Generalized transduction is used for genetic mapping (to be considered
under that topic); localized mutagenesis (covered under mutagenesis, Hong & Ames PNAS68:3158[71]);
and strain construction (since you are only altering a small portion of the genome of the recipient).
So how does host DNA ever get into the phage capsids to begin with? Remember that these
phage start packaging phage DNA at pac sites and then continue to package the same strand into
successive heads. Subsequent headfulls of phage DNA obviously cannot start from a pac site because
each headfull is slightly more than one phage genome length, so the "second pac site" probably ends up
in the first phage head. But the problem is solved as long as the same DNA concatamer continues to be
packaged in subsequent heads. This means, however, that essentially one pac site defines multiple
headfulls of DNA as being "phage." Thus, if the phage enzyme cuts at a chromosomal sequence that is
similar to the phage pac site, and then a number of progeny in that cell package host DNA. Since
packaging of host DNA will always begin at one of these pseudo pac sites, then some regions of the
chromosome will be transduced better than others because they are near one of these sites. In fact,
because packaging proceeds by headfulls, then chromosomal markers on either side of that side will
almost never be co-transduced, while markers on one side of a pseudo pac site that are half a phage
length apart will still be frequently packaged in to the same phage. As a consequence, chromosomal
regions near pac sites have a somewhat poor correlation between genetic linkage and physical distance.
The HT variants of P22, and presumably P1, have lost the ability to discriminate between any
different regions of DNA, including their own, so they appear to package DNA rather randomly. These
phage mutants therefore give many more transductants because a much higher fraction of the virions in
111
the lysate contain host DNA. These phage also eliminate the odd linkage phenomena referred to in the
previous paragraph.
Finally, in performing generalized transduction, remember that there are also viable phage
around, so you need to protect your transductants (cells that have the selected phenotype by virtue of
receiving DNA via transduction) by some mechanism. This might be either by using a low moi and limiting
reinfection, or by use of a temperature-sensitive transducing phage.
Fate of the transduced DNA in the recipient. As a rough estimate, an incoming piece of transduced host
DNA has about a 10% chance of being incorporated into the recipient's genome. This does not mean that
only 10% of the incoming fragments recombine, since there are end effects and multiple crossover events
to consider. The structure of the incoming fragments is not clear. They are typically drawn as being linear,
but this may not be the case (see below). It is even unclear if integration occurs by double-stranded DNA
exchange or by single-stranded DNA displacement.
What happens in the other 90% of the cases? In a dissecting microscope, one can very
occasionally see micro-colonies following a transduction, whose frequency is roughly ten times that of the
large colonies resulting from complete transduction. These seem to represent non-replicating versions of
the incoming DNA that are nevertheless capable of gene expression. They are also quite long-lived (at
least 5 hours) and they thus allow one daughter cell to express some gene products while the other gets
these in addition to the single copy of the selected region. The mechanism of survival of these fragments
is unclear, but they seem to exist in both relaxed and supercoiled forms held together by protein (ASM
p1161). They only rarely give rise to subsequent complete transductants, suggesting that their structure is
not a normal intermediate in the formation of complete transductants.
The general issue of a piece of DNA existing in a cell for some time and expressing products, but
neither being replicated nor degraded is not unique to generalized transduction. The same phenomenon
comes up when DNA is introduced into cells through transformation or conjugation. We simply fail to see
it in most cases because the amount of growth that results is too little to be detected.
Specialized transduction (ASM p1169ff, & ASM2:2442[96]) In specialized transducing phage every virion
carries the same portion of the host genome fused to part of the phage genome. Typically such phage are
temperate and defective, since at least some of their normal genome has been replaced by the host DNA.
The most studied specialized phage is , but P22 specialized phage are easily produced and virtually any
phage can have other sequences introduced into it by cloning.
Phage capable of specialized transduction are those that enter the temperate state by integrating
into the host genome at one or a small number of sites. By integrating into the host genome, they are
physically associated with DNA other than their own, so that inappropriate excision of the phage DNA
leads to a specialized phage. This excision involves recircularization of the phage and if the wrong sites
are used then a circle recombines out that is different from the phage genome that started. Now obviously
all sorts of recombination events could be imagined, but remember that the phage DNA then needs to be
recognized by its packaging system (so that site must be there) and the total size of the DNA must be
about right: if it is too large, it cannot get in and if it is too small, it turns out that the capsid doesnt function
properly. So if the DNA must contain some phage DNA (to be packaged) and yet it is not precisely
correct, then it must lack some phage DNA and it must also have some host DNA to compensate for the
missing phage genome.
In order to produce specialized phage for a given region in vivo, the phage genome had to be
integrated into that of the host near the region of interest. This was accomplished several ways: deleting
the normal bacterial att site and looking for secondary sites near the desired region; seeking
rearrangements of the chromosome that move the att site appropriately; having similar ISs in the phage
and near the region, to provide homology for integration; or putting both the integrated phage and the
desired regions on incompatible plasmids and demanding that they exist in the same cell, leading to
plasmid fusions. By all of these schemes, the desired phage line is found by genetic selection: Inducing
the phage from the above construction and selecting for transductional transfer of the desired marker into
an appropriate recipient that has a normal att site. Resulting colonies were tested by induction of those
cells for another round of appropriate transfer, typically with a wild-type helper phage present to supply
the phage functions necessary for the generation of new virions.
A specialized phage can also be generated by cloning the desired region into an appropriate
region of the phage. If a plaque-forming derivative is desired, it is possible to start with a phage where
non-critical regions have been spontaneously deleted, typically by a selection for phage variants that
survive heat/chelator treatment. The phage head can now contain at least as much new DNA as was
deleted in the selection.
Specialized phage have a variety of uses, chiefly as a source of large amounts of DNA of the
112
desired region and as a means for generating merodiploids with a only two copies of the target region in a
stable form (in distinction to plasmids). A major use of such phage is in in vitro schemes involving the
generation of libraries. The advantages of such vectors are two-fold: they can carry large amounts of DNA
(relative to multi-copy plasmids) and phage virions can be assembled in vitro. This last point means that
you don't need to transform your constructs into recipient cells, but rather transduce them in, which is
typically a much more efficient method, particularly with large fragments of DNA.
Industrial concerns with phage. A major concern in any industrial process using bacteria is the possibility
that phage contamination will ruin the process. There are two major ways to address this problem: keep
the phage out of the fermentation or utilize bacterial strains resistant to the known phage. The former
requires good housekeeping measures, monitoring of the environment, use sterile media and sterile air
3
(the last is non-trivial at perhaps 25,000 ft /min). The phage-resistant approach is possible, but fraught
with problems. Developing strains resistant to the next phage is impossible until you have that phage in
hand so that you cannot anticipate the nature of next phage problem. Also, phage-resistance is often
accompanied by untoward growth defects, because phage resistance involves alteration of physiologically
important parts of the cell. The problem is particularly acute in the dairy industry because they rarely
sterilize the milk and pasteurization often fails to inactivate phage. A description of the problems (and
solutions) of phage in lactic acid bacteria fermentations in the dairy industry is in ASM news68:388[02] &
ARM55:283[01]).
Phage and human health. There are some interesting cases where lysogenic phage carry genes that
affect the behavior of the host in surprising ways. One of the most striking is the case of Vibrio cholerae,
the causative agent of chorea, which has been known for some time to require a lysogenic phage (termed
CTX) carrying the cholera toxin itself. This is surprising enough, but it more recently became clear that
the receptor for CTX is encoded by another phage, termed VPI. A virulent stain therefore must be a
double lysogen, with the presence of VPI being necessary for CTX infection (Nat399:375[99]).
Phage have also been used in medicine as a therapeutic agent to kill pathogens since 1917, but
the general ignorance of the nature of phage made it impossible to understand problems with the
approach. The approach then became somewhat popular in the Soviet Union, which did nothing to
enhance its appeal in the West, but there is increasing sympathy for its utility in some situations
(ARM55:436[01]).
RNAi and CRISPR. These two processes are not the same or even homologous apparently and they are
also not phage (or virus) specific. However, they are mechanistically similar responses of eukaryotes and
prokaryotes respectively to deal with foreign DNA, which will often be viral, so I have lumped them
together at the end of this LT. Recognize however that that each system might be employed against
viruses, MGEs or, in the case of the prokaryotes at least, plasmids.
RNAi refers to a system in many eukaryotes that recognizes ds RNAs and cuts these into short
(~20 bp) fragments with an enzyme termed dicer. These are then recognized by the RNA-induced
silencing complex (RISC) and processed to single-strands (termed guide RNAs). These in turn form
double-stranded complexes with a target RNA and a protein termed Argonaute cuts that target RNA at a
fixed distance from the end of the guide-RNA-target hybrid. This is a major process is almost all higher
eukaryotes. Though it is absent in many protozoa (trypanosomes) and Saccharomyces cerevisiae (so
how important can it be??), it is present in some budding yeast such as S. castellii and Candida albicans.
A recent review is in Nat457:405[09].
CRISPR (clustered regularly interspaced short palindromic repeats (folks in both fields are
pathologically inclined towards cutesy abbreviations!) refers to a superficially very similar system found in
many prokaryotes that, however, displays no apparent homology to the RNAi system. (See a recent
review in Cell134:401[09]). These systems were first recognized as genomic regions with multiple copies
of a 21-47-bp sequence, separated by 20-72-bp unique sequences, but their function was completely
unclear. These spacer-sequence repeats were always found 3 of a cluster of genes predicted to encode
RNA-binding proteins, helicases, nucleases, and polymerases, and termed cas genes (for CRISPRassociated). In 2007, it was shown that the presence of a viral-derived sequence in the CRISPER region
allowed the organism to be resistant to that virus (Sci315:1709[07]) and this has since been extended to
plasmid resistance. Reminiscent of the RNAi system, the CRISPR genes are transcribed and processed
by CAS products down to small RNAs. These bind to complementary ds DNA or ss RNA, depending on
the specific CRISPR, and target them for nucleolytic attack, though there might other interfering
mechanisms as well (for RNA targeting, see Cell139:945 & 863[10]).
An obvious question is how this immune system recognizes self from non-self. A hint about this
might come from the following. Each CRISPR system chooses unique sequences that lies immediately 5'
113
of a short conserved sequence, termed a "proto-spacer adjacent motif" or PAM, and AGAA is one
example. This PAM ends up in the CRISPR repeats and therefore in the expressed RNA that is used to
target sequences AND its presence in a sequence to be degraded is essential. Thus and this is only my
hypothesis - one can imagine the following: In the host, the PAM sequence in modified in some way
(methylation on each strand?), so host PAM sequences are "never" chosen for inclusion in a CRISPR, but
unmodified PAMS (from a virus or plasmid) can be. Thus, only non-self DNA is chosen for inclusion and
only non-host targets would be subsequently targeted by the system.
Another question concerns the roles of Cas proteins, though it seems we have plenty of functions
that need to be encoded for the system to function. The slight curiously here is that every cas system
seems to have homologs for Cas1 and Cas2, but the other players vary. But again, there are easy
explanations for this.
A final question is why this not found in every organism, if it is such a good idea (it has been
recognized in about 85% of archaea and 50% of bacteria)? Relevant to this, the distribution of systems
does not follow other phylogenies of the organisms, so it must be horizontally transferred. There are also
fairly clear cases of cas/CRISPR systems that have been mutated into non-function, and some organisms
have multiple non-identical systems (up to ~18). The possible advantage of a system, is fairly clear,. but
what are the possible costs? There is a cost to the DNA sequence itself, but this seems minimal to me.
There is perhaps a greater cost to the expression of the CRISPR RNA, but I would guess that the biggest
cost is the possibility of targeting oneself. That is, even with the host-methylation system, mistakes
happen and either host DNA will end up in the CRISPR or, more likely, a viral sequence with high identity
to the host will end up there, which will kill off the hoist with some frequency.
607 Lecture Topic 12...................COMPLEMENTATION

In this LT, complementation in prokaryotes will be the focus, though most of the arguments are
also applicable to yeast. Some yeast issues will be covered in LT14. The traditional goal of
complementation analysis was to define complementation groups, which provided an insight into the
number of genes. This is, of course, largely irrelevant now, but we will go through it briefly because
historical reasons and because it says something about protein function.
In this analysis, one asked if two different mutations, each causing a similar phenotype, were in
the same complementation group. This was done be asking if a strain with one gene copy of each of
these mutations could supply all functions necessary for a wild-type phenotype. Complementation was
therefore a test of function.
Only functions necessary for the desired
phenotype, under the conditions used, was demanded
in a complementation test. Mutations affecting genes
whose products were not important for the examined
phenotype were ignored. Fig, 12-1 gives an idea of the
results one could expect from straightforward
complementation tests. In these examples when the
two mutations in the separate mutant alleles affect the
same gene, then neither is capable of generating a
wild-type product of that gene and the resultant
merodiploid strain is mutant in phenotype. On the other
hand, if the two mutations affect different genes, so
that each copy of the region is able to generate some of
the gene products required (and between them all
Figure 12-1. A simple diagram of the
necessary gene products are synthesized) then the
possible interpretations of a
resulting strain is phenotypically wild-type. One problem
complementation analysis.
with this set of examples is that no one (in doing
microbial genetics) routinely put the two alleles in the
"cis-configuration as a control for complementation (you do build such strains for other purposes,
however). It is too hard (for reasons we will cover when we get to "mapping") and it provides very little
information, since the presence of the wild-type allele on the other copy will nearly always be dominant. It
is, however, often appropriate to consider effects of a mutation on genes in cis, but this is not the same as
generating "double mutants" affected in the same small region.
There are three sorts of controls useful in analyzing the results of traditional complementation
experiments: (i) If either copy of the merodiploid contains a wild-type region, the phenotype of the
resulting strain should be wild type, and the wild type is said to be dominant to the mutant. If it is not, the
mutant allele is said to be trans-dominant to the wild type. In either case the merodiploid has the
114
phenotype of whichever allele is dominant. We see this as the failure of allele 1 to even complement the
wild type allele (and we know that the wild-type allele by itself confers growth, as shown by the bottom line
of the table which is the vector-only control.) (ii) A merodiploid strain constructed with the same mutant
allele in each copy should display the mutant phenotype. If it does not, it suggests that mere diploidy for
the region of interest can confer a wild-type phenotype. One way of this occurring would be if the mutation
conferred a leaky phenotype so that a double dose might yield a pseudo wild-type response. (iii) The
result should not depend on the location of the alleles; i.e. the same result should obtain no matter which
allele is on the chromosome. If this is not true, it indicates that the two locations are not equivalent and
therefore the test has marginal validity. This is a variation on the concerns noted for multi-copy plasmids
above.
Since complementation analysis treats only those functions necessary to generate the required
phenotype, it does not allow the detection of complementation groups unless their products are required
for the phenotype in question. If, for example, a region encoding such an unimportant product (at least for
the conditions of the selection) is transcriptionally polar onto an important function, then mutations in both
genes behave as a single complementation group. This reflects the fact that the only mutations detected
in the transcriptionally upstream gene would be ones polar onto the functionally important gene
downstream.
Complications in complementation analysis. The above examples imply that two mutations that
complement each other must affect different genes and gene products. This would suggest that the
results of complementation analysis would be to define the number of genes in the region. In fact, what
complementation analysis does is to define the number of cistrons or complementation groups. More
often than not, the number of complementation groups is coincident with the number of genes, but there
are a number of special cases where this correlation will not hold. The complications that give rise to
these special cases are discussed below and they fall into two general classes: when complementing
mutations actually map to the same complementation group and when non-complementing mutations
actually map to separate genes.
Cis-dominant mutations are a reasonably common type of complication in complementation analyses.
Cis-dominant mutations are those that affect the expression of genes encoded on the same piece of DNA
(as the mutation itself), typically transcriptionally downstream, regardless of the nature of the trans copy of
the merodiploid. Such mutations exert their effect because of termination of RNA transcription, and not
because of altered products they encode. There are two dissimilar examples of these sorts of mutations:
(i) If a mutation in a transcriptionally upstream gene exhibits strong polarity onto downstream genes, then
that mutation will fail to complement mutations in downstream genes. (ii) Similarly, a mutation in the
promoter or in other regulatory regions outside the translated area can eliminate transcription of the entire
operon and thus be negative in complementation for all gene functions encoded by that operon. In each
of these cases, the mutation is eliminating the function of genes that are themselves genotypically wild
type. There are also rare reports of cis-acting proteins, which typically involve DNA-binding proteins with
high affinity for DNA, such that they bind near their own genes and fail to diffuse through the cytoplasm.
Negative complementation. Another complication involves the rare phenomenon known as negative
complementation or trans-dominant mutations with mutant phenotypes. Mutations of this type cause the
resultant merodiploid strain to have a mutant phenotype even when the other copy of the region is
genotypically wild type. The phenotype of the mutant allele is thus trans-dominant to the wild type
(obviously the reason that wild type is dominant to most mutants is because it supplies the function that
they have typically lost by mutation). There are three general schemes that can be envisaged for
mutations that display negative complementation. In each of them, it is necessary to propose that the
mutant allele generates a product that, while not wild type, nevertheless possesses some activity that
leads to the mutant phenotype. Possibilities include (i) multimeric enzymes where the merodiploid strain
would generate multimers whose subunits come from both the mutant and wild-type genes in a random
assortment. As shown in the Fig. 12-2, if the protein is a tetramer and if any multimer containing one or
more mutant subunits is completely inactive, then the presence of the mutant allele decreases the amount
of functional wild-type gene product by approximately 8-fold (this number ignores regulation and assumes
a two-fold dosage of the product due to a two-fold dosage of the gene).
115
(ii) The mutant gene might cause the

Figure 12-2. Possible explanation for negative
generation of an altered protein that
complementation: a multimeric protein. The right panel
interferes in some reaction with the cell
shows the situation in haploids, where either "all WT" or
and thus causes a deleterious
"all mutant" tetramers accumulate. In the diploid strain,
phenotype. In this case, the presence
however, random mixing of monomers yield the percents
of a wild-type allele would restore the
of each tetramer type shown. If only a fully WT tetramer is
function missing in the mutant, but
active,
then only 6% of total tetramers will have this
would not eliminate the deleterious
property.
phenotype caused by the mutant
protein. Thus, the mutant phenotype
would be seen in the merodiploid. (iii) It is also
conceivable that the mutant allele generates an
altered protein that, while it does not carry out
the wild-type function, might be actually be
competitive with the wild-type gene product. In
each case, an altered product is responsible for
the trans dominance. Remember, these are
rare, special cases: in general, the wild-type
allele is dominant to the mutant since the latter
typically lacks a function that is performed by the
Figure 12-3. The case of two completely
product of the wild-type gene. Such transseparate protein functions within a single peptide.
dominant mutants are often very interesting for
The cartoon at the top of each panel depicts the
further biochemical analysis because the protein
folded protein. In a merodiploid strain containing
product has altered function, rather than merely
the proteins depicted in both the center and right
a lack of function.
panels, there would be at least one functional
copy of each protein domain.
Intragenic complementation is yet another
possible complication in complementation
analysis. This term refers to cases where two
mutations that do affect the same gene, and therefore
the same gene product, are able nonetheless to give a
wild-type phenotype in a complementation analysis.
There are two general cases of such a phenomenon:
(i) If the product of the gene in question is a bifunctional protein, especially when those functions are
independent of one another, then mutations in the
gene often show intragenic complementation (Fig. 12Figure 12-4. Cartoon of two protein
3). Such an example is easiest to understand if the
monomers that form a homodimer. In this
product is pictured as two beads on a string. If each
simple example matching charges are
bead has an independent enzymatic function, one could
necessary for function and this match is
imagine that a mutation affecting either (but not both) of
missing in the two mutant haploids, but
the two functions might well leave the other function
possible in the merodiploid. Note that the
intact. If two such mutations were put in a merodiploid
merodiploid would presumably also
situation, each would be able to produce one of the two
contain some inactive homodimers.
required enzymatic functions, giving rise to a wild-type
phenotype. In the case of such a gene, intragenic
complementation would be fairly common because many missense mutations would affect only one of the
two functional regions. This model also predicts that mutations affecting each of the two functions would
cluster in either half of the gene. (ii) It is also possible, though less likely, for pairs of complementing
mutants to occur in cases where the gene product is a multimeric protein with a single function (Fig. 12-4).
116
In such cases, a particular mutant allele might encode a protein that only functions when allowed to
aggregate with another particular mutant allele. In this case, unlike the case of bi-functional protein above,
instances of intragenic complementation will be limited to specific pairs of mutants. Further, there is no a
priori reason to predict any clustering of complementing or non-complementing mutations to distinct
portions of the gene. Would such a case, where two mutations out of 100 in a given gene are capable of
complementing each other, be sufficient to say the gene had two complementation groups? This question
is largely a semantic one, but in general, unless intragenic complementation is fairly common, the few
exceptional complementing pairs would not be said to define separate complementation groups.
Diploidy itself can occasionally be a problem by affecting regulation or metabolism in such a way as to
perturb the results of a complementation assay. Two copies of the region might titrate out regulatory
factors or cause an imbalance between gene products encoded by the diploid region and those encoded
elsewhere. This is obviously much more likely to occur if one allele is on a multi-copy plasmid. A slightly
different problem can occur when the second copy is either on a different replicon (e.g. F' or multi-copy
plasmid) or in a different place in the chromosome (e.g. at the phage attachment site) and this altered
location affects gene expression, perhaps due to different local superhelicity.
Some curiosities in complementation analyses. For those interested, here is a laundry list of some
curiosities in complementation analyses: (i) Complementation of an N-terminal deletion by a C-terminal
deletion in mtlA (JBact172:1509[90]), as well as complementation between two non-overlapping deletions
in lacY, the lactose carrier protein (JBact172:5374[90]). A similar result is used in the lacZ
complementation story and in tetB. In the last case, there are two complementation groups within the
gene, although there is only a single known function, that of tetracycline efflux (JBact173:4503[91]). (ii)
r
The observation that only a fraction of Rif alleles of rpoB are dominant to wild type (JBact171:5229[89]).
This is the result of a set of issues: each allele has different levels of activity in the presence of rifampicin,
though they all have enough when in haploid. In complementation, however, the total amount of RNA
polymerase is fixed by the level of the other RNAP subunits, so an allele with barely sufficient activity in
haploid has insufficient activity in diploid, since it is effectively diluted two-fold by the wild-type. (iii) There
are examples of cis-acting proteins, which are therefore not trans-dominant (e.g.JMB202:495[88]). (iv)
Mutations can be dominant for one phenotype, but recessive for another: a mutation that turns the cell's
only tryptophanyl tRNA into a nonsense suppressor is dominant to the wild type in terms of suppression,
but recessive in terms of lethality.
Interspecific complementation. The goal of this approach is to clone a gene from a poorly understood
organism based on its ability to correct the phenotype caused by a known mutation in a well-characterized
recipient. The general notion is that you are likely to clone the gene homologous to the mutated one,
since that is the one most likely to provided the missing activity. The approach involves either in vivo or in
vitro cloning of DNA on high or low copy number plasmids (or specialized phage) with subsequent
introduction into the mutant recipient. A Rec recipient is typically not necessary, since
microheterogeneity in DNA sequence prevents most recombination between the heterologous DNA and
the chromosome. On the other hand it may be wise to use Rec hosts, since most cloned regions will be
more stable to intramolecular recombination and deletion formation in a Rec background.

However, because the foreign gene product finds itself in a different cytoplasm, with different
proteins to interact with, one cannot assume that apparent complementation really means that the foreign
protein has exactly the same role (in its normal genetic background) as the mutated product. Rather a
positive result means that at the level of protein accumulation achieved, there is a sufficient level of the
demanded activity for growth. Nor does the lack of complementation mean that the two proteins must
necessarily be functionally different. The proper interpretation of inter-species complementation
experiments depends on the nature of the question being asked in the experiment.
There are obviously a number of reasons for failure to detect interspecific complementation.
Quite possibly the native promoter will not be expressed in the heterologous host. This can typically be
overcome by cloning into an expression vector system. Alternatively, if the introduced gene is from an
organism with a radically different codon usage, then translation in the new host might be quite poor and
protein accumulation might not reach the threshold necessary for the demanded phenotype. The
synthesized product may also fail to function in a heterologous host for a variety of reasons including
failure to be properly processed or transported, to be stable, or to interact with appropriate
macromolecules.
117
Complementation in eukaryotes can be different. Complementation is performed in yeast by mating

two haploid parents and then analyzing the diploid product. By this approach, gene dosage will always be
normal and the results will generally be somewhat similar to those seen with prokaryotes. There are,
however, fundamental differences in the outcomes of complementation analyses in prokaryotes, yeast
and higher eukaryotes. In general, in higher eukaryotes, mutations are more likely to have discernibly
mutant phenotypes in complementation than do comparable mutations in microbes. The reason for this is
not that the gene products are fundamentally different (though sometimes they are), but rather because of
the difference in detectability of mutant phenotypes in the two classes of organisms. In microbes,
extremely deleterious mutations are detectable, largely through nutritional supplementation and haploidy
also allows the easy identification of all sorts of recessive mutations. However, subtle alterations in
phenotype are often not detectable because mutant microbes do not necessarily appear to be very
different to our eyes. Higher eukaryotes differ in all three of these considerations. Mutations causing
severe phenotypic changes will almost certain not survive development and any detectable mutation will
need to have some degree of dominance to be detected. But the critical issue of detection of mutant
phenotypes is because subtle changes in metabolism are much more detectable in these organisms
because they often cause macroscopically observable defects. The result is that in the latter organisms,
you are often detecting and analyzing mutations with rather less severe effects on the function of the gene
product than in the case of prokaryotes. As a consequence, complicated complementation and negativecomplementation due to protein-protein interactions is probably more common.
As a separate matter, it so happens that aberrant proteins are typically degraded more rapidly in
bacteria than in eukaryotes, for reasons I do not understand, and this can affect complementation as well.
For example, the portions of a protein that are synthesized from genes with nonsense, frameshift or
insertion mutations are typically very rapidly degraded in bacteria. As a consequence, even if the protein
fragment was deleterious, it rarely can accumulate to a level where it has an effect. In eukaryotes, these
protein fragments tend to be much more stable and, because a diploid eukaryote will necessarily have the
protein product from the other allele, these fragments can often have negative complementation defects.
Again, such trans-dominant protein fragments are rare in bacteria.
607 Lecture Topic 13................Genetic Mapping in Prokaryotes (for a short summary of genetic
mapping, see ASM2:2511[96] and ASM2:2518 for physical mapping). I am aware that this chapter is
becoming obsolete with the increasing speed of sequencing. Nevertheless I am retaining some text
because linkage remains a useful concept in all sorts of strain constructions, and because there is still a
lot of useful old literature involving mapping that you might have to interpret. The topic of genetic mapping
in yeast is addressed in LT14 and, while it also involves the concept of linkage, it uses fundamentally
different methods and provides a rather different type of data than does any style of mapping in
prokaryotes.
The objective of mapping is to establish the position or relative order of mutations and, by
implication, the order and position of genes. Genetic mapping does not establish physical distance, but it
is possible to tentatively infer physical distance based on empirical observations. We will treat the
rationale behind mapping in two ways, one for gross mapping and one for fine structure mapping. One of
the virtues of genetic, as opposed to physical, mapping is that the former only sees those mutations that
cause detectable phenotypes. This is because you are actually ordering the genetic regions that cause
the different phenotypes. In sequencing, you detect all mutations whether or not they actually affect the
phenotype.
In prokaryotic mapping, but NOT in yeast, the experimenter takes a recipient cell that cannot grow
under the conditions used and supplies DNA from a strain that has the appropriate genetic loci to allow
the recipient to grow. However, mapping is always done in such a way that the incoming DNA (whether it
be by transformation, transduction or conjugation) cannot replicate on its own. There is therefore a
selection for recombination, if the recipient cell is to acquire the critical genetic material. We then monitor
the frequency with which that happens, typically normalized to the frequency with which some other
recombination event occurs as well. As you will read in the next LT, this is fundamentally different than the
case in eukaryotes, where both donor and recipient (or male and female) DNA are present in the cell
whether or not recombination (or crossing over, as they say with these organisms) occurs A different
metric is therefore required for determining recombination frequency in these organisms.
One is performing gross mapping if one's intentions are to either place the marker of interest
somewhere on a chromosomal map. This usually means establishing the position of mutation
relative to those of mutations in other genes. This sort of mapping was necessary to produce a genetic
map as a basis for further work, but it did not really tell you very much. It just set up the system for future
strain constructions, allowed preliminary genetic analysis of other mutations, helped in the construction of
118
plasmids for complementation analysis, and allowed some sort of comparison to genetically similar
systems. For example, if you knew you had three mutations with the same phenotype and showed that
they were each linked to different markers and were unlinked to each other, then you had established that
each mutation was in a separate gene and that at least three gene products were necessary for the wildtype phenotype. However, that was about all the interesting questions that could be answered with this
sort of genetic analysis.
One can also do gross mapping by physical analysis. This has required the identification of
restriction enzymes that cut very rarely (<20 times per genome) and the development of an
electrophoresis system, orthogonal field electrophoresis, capable of resolving very large DNA fragments.
The localization of a gene to a given fragment, using physical or genetic methods, provides gross,
physical mapping information. Increasingly this is done by simply sequencing the entire genome (~$8K as
late 2007). There still does seem to be a utility to some sort of mapping in order to connect the last few
fragments of sequence (termed contigs) together. David Schwartz in Genetics has established a clever
method for doing this under a microscope, by a method termed optical mapping.
The goal of fine structure mapping was to order mutations, which were known to map to one or a
few contiguous genes, into a one-dimensional array. Properly, this array was ordered with respect to other
external markers. This ordering allowed you to make sense of your complementation data (you could then
tell polarity from allelism). It also it allowed the clustering of mutations of similar phenotype that, in
conjunction with complementation, helped define genes and gene functions. Certainly, fine structure
mapping has now been completely replaced by sequencing.
A term that is used with great frequency in discussions of mapping is linkage. Linkage is defined
as the frequency with which two sites (a site can either be the site of a mutation or the site of the wild-type
version of the mutation) on a piece of DNA are co-inherited using a particular gene transfer system. As
such, it is a function of two variables: (i) The frequency with which the two sites are brought into the same
cell by that particular gene transfer system (termed end effects in some of the following sections, with
reference to the ends of the transferred DNA), and (ii) the frequency with which both sites from the donor
recombine into the recipients chromosome. Another statement of the latter point is that, for linkage to be
observed, the recombination events occur outside each site and not between them. Ignoring end effects,
linkage is inversely proportional to the likelihood of a recombinational event occurring between two sites
and therefore to the distance between the sites. This assumes that recombination events are random and
that their likelihood increases with the increasing size of homologous regions available for recombination.
The modest non-randomness of recombination sites is probably one of the reasons why genetic maps are
not coincident with physical ones. (As will be explained again in LT15, the term recombination is used
somewhat differently in eukaryotic genetics. It refers to reassortment of genes, which might well be on
different chromosomes, from the organization that existed in either parent. Importantly, this does not
necessarily have anything to do with breakage and rejoining of DNA as is the sense of the term in
prokaryotes. Instead the term crossing over is often used with eukaryotes to refer to the action of
breakage and rejoining of DNA.)
linkage
(of two loci)
1
recombination
frequency between
1_______
distance between
the two loci the two loci
The product strain (the genotypically altered recipient) of a recombinational event is often referred to as a
recombinant.
Genetic mapping also makes the assumption that if two markers are inherited by a recipient cell in
a single cross, they must have entered on the same piece of DNA and they therefore must be linked by
that gene transfer system. If one utilizes a gene transfer system in which more than one distinct piece of
DNA can enter the same recipient cell, one of the assumptions used in mapping is violated and the
apparent linkage would reflect the frequency of the two markers entering the same cell separately and not
the genetic distance between them. This latter case can occur in either transformation or in generalized
transduction with the highly efficient transducing phage P22HT, since these two systems can be so
efficient at moving DNA into a recipient that more than one piece of DNA can enter a given recipient.
Such a phenomenon is known as congression.
119
Another mapping term is that of interference, which refers to the effect of a mutation on detected
recombination in its immediate vicinity. Positive interference, or just interference, refers to the case where
less recombination is detected than is expected and negative interference (the term itself is a double
negative) refers to the opposite situation. These might results from any of a variety of very different
mechanisms. some of which include the following. There might be sites that are hot spots or cold spots for
recombination itself. Alternatively, there might be features in one or the other recombining region that
make recombination less likely; for example, the presence of a deletion or insertion in either copy makes it
less likely that RecA will be able to efficiently pair the homologous regions (since there are bocks of nonhomologous DNA that break up the homology of the regions to be recombined) and recombination
frequency near these will decrease. Lastly, detected events might depend on the actual genotype of the
region. For example, if an unselected allele makes a strain somewhat sick, then recombinants with that
allele will be more likely to die and we will detect fewer progeny that contain this allele. We would score
this as lower linkage between the selected and unselected sites (and therefore a higher frequency of
recombination). Similarly, if a mutation suppresses the poor growth of a linked mutation, then one will
detect fewer recombination events than expected between these sites because these will often leave just
the deleterious mutation (without the suppressor) the progeny will not grow well.
Two-factor crosses as an example of linkage. I will use two-factor crosses to introduce the concept of
linkage, which continues to remain useful. For example, when you move your cloned gene, which you
have altered in vitro, back into the cell on a suicide vector, it will recombine with the chromosome under
proper selection. However, the frequency of different
types of recombinants will be governed by the rules
of linkage as described. (This is discussed at the end
of LT4.)
Imagine you are working with a new
bacterium and you have isolated two mutants with
different growth requirements, and you term the
mutations arg-1 and his-1. Assume that you also
have a reasonably efficient generalized transduction
or transformation system. One generates transducing
lysates on both strains and uses these to transduce
the other mutant. First, consider using the arg-1
lysate to transduce the his-1 recipient: One does this
on arginine-containing medium, which demands that
the recipient lose its his mutation, but allows (but
does not demand) the arg mutation to be inherited. If
the arg-1 mutation maps far from the wild-type
version of his-1 in the donor, then no transducing
+
fragment bringing in the selected his allele will also
Figure 13-1. Two-point mapping. In the top
+
carry the arg-1 allele. In this case, when the His
panel, the recipient's chromosome is shown as
transductants are subsequently analyzed for their Arg
the bottom line and displayed above that the
+
phenotype, they will all remain Arg . If, however, the
set of possible transducing fragments capable
+
of bringing the his allele. (No recipient cell
genes for these two phenotypes are close enough to
both be carried by the same transducing fragment,
would actually receive more than one of these
+
then some of the time, when the donor his allele is
fragments.) Note that some of the fragments
+
recombined into the recipient chromosome, the donor
carrying his would not even carry the arg
+
arg-1 allele will also replace the recipients arg allele.
allele. In the lower panel, the specific set of
Subsequent analysis of the transductants will reveal
recombination events is diagrammed. The
+
some number of Arg transductants. As the distance
selection is for His , so any fragment that
between the arg-1 mutation and the his-1 mutation
yielded a colony must have had a
gets smaller, the likelihood of both regions entering
recombinational event on each side of the site
the cell increases as does the probability that when
of the his mutation in the chromosome to
one donor allele is selected, the other will also be
replace the mutant allele. Thus one event must
recombined in. This coinheritance of the unselected
have been in the "x" region and the other in the
arg allele with the selected his allele means that the
"y+z" region. The detected co-transduction
two mutations are linked. This analysis assumes (i)
frequency between the his and arg alleles will
recombinational events are random and will occur
reflect both the fraction of fragments that carry
more frequently in longer regions of DNA homology;
both alleles as well as the relative likelihood of
and (ii) recombinational events do not depend on the
recombination events in region y vs. region z.
location of the mutations. The system obviously
120
requires that you have a gene transfer system

capable of transferring both loci together at some
detectable frequency; that you have two different
alleles at each locus, each with a dissimilar
phenotype; that at least one of the alleles must be
selectable in each case; the gene transfer system
must allow selection for recombination; and that both
alleles at one loci can be selectively neutral while an
allele at the other locus is being selected for.
Since the result depends on both the
frequency with which the two loci are transferred on
the same piece of DNA and the frequency of
recombination events occurring between (relative to
Figure 13-2. The Wu equation and its
outside) the two loci, the raw linkage frequency is not
graphic depiction.
linear with physical distance. The relationship between
linkage and physical distance is fairly linear when the
markers are close (i.e. much closer than the length of
the transducing fragment), because the complicating
factor of the end effects of the incoming DNA fragment
can be ignored. As the markers get farther apart, the
end effects become a greater factor in the equation
and the relationship between linkage and physical
distance becomes strikingly nonlinear. Such a
relationship has been worked out and is described by
the Wu equation (Fig. 13-2).
Two-point mapping analysis using Hfrs is
a rather different story (Fig. 13-3), since we are
Figure 13-3. Hfr linkage. In the figure, the
not dealing with very short pieces of transmitted
horizontal axis depicts the position of any
DNA. End effects do exist because of the gradient
unselected marker "b" on the chromosome
of transfer inherent in conjugation (only 0.1% of
relative to the oriT. "a" depicts the position of the
chromosomal conjugation transfers the entire
selected marker and the linkage in the vertical
chromosome). However, if one selects a distal
axis would be that detected when the recipients
marker (relatively farther from the origin of
selected for inheritance of a from the donor are
transfer), then you know that any proximal marker
screened for the inheritance of b.
(one relatively closer to the origin of transfer) must
have been introduced into the cell. Since
recombination events are reasonably common, then any proximal marker will be at least 50% linked to the
selected distal marker. If the two markers are sufficiently close that the likelihood of a recombinational
event between them is low, then the linkage would be greater than 50% and would indicate a true
proximity of the two markers on the chromosome. When one selects a proximal marker and then scores a
distal marker, the linkage detected will be a function of both recombination and the gradient of transfer.
The figure below gives an indication of the sorts of linkages one might see for pairs of selected and
unselected markers relative to the direction of transfer.
To gain a feeling for the frequency of recombination events, remember the figure used in the
discussion of generalized transduction: about 10% of the donor markers brought into the cell are
recombined into the chromosome. This suggests that several recombinational events can occur in a small
region (the transducing particles we are referring to are approximately 1% of the cell chromosome) at
reasonable frequency. Markers separated by several percent of a chromosome are therefore likely to be
separated by several recombination events.
Two-factor crosses were always a bad way to do fine structure mapping as they told you the
genetic distance between two markers, but not their order. An attempt to order a set of mutations against
one another would hit the snag of trying to interpret the difference between, for example, 70% and 80%
linkage. Another difficulty with two-factor crosses, is the possibility that one of the markers perturbs the
linkage.
Classical three-factor crosses as an example of end effects. Any time you had phenotypically similar
mutations that mapped very close together, you had a problem in ordering them by two-factor crosses.
This is because you need to determine their linkage to a third, phenotypically dissimilar marker and then
compare the two linkages to determine their order with respect to that third marker. But because the two
markers were close together by definition, then the differences in their linkage to the third marker would
121
Figure13-4. The two possible sets of results from a three-factor cross, depending on the relative
order of the markers.
be unreliably small. The solution now is of course to simply sequence them, but there was a genetic
solution that is instructive if no longer useful per se: you demanded the rare recombinational event
between the closely linked markers and scored the inheritance of the phenotypically dissimilar marker a
short distance away. As an example, consider the two cases shown in Fig. 13-4. In each case there are
two arginine mutations, arg-1 and arg-2, which are each about 80% linked to a histidine marker. What we
would like to
resolve is the order of the two arginine mutations relative to the histidine marker. For purposes of the
+
example, the recipient strain is arg-1 his-1 while the donor strain is arg-2 and his . The example
represents the case with generalized transducing phage, and the experiment is performed by demanding
+
Arg transductants on medium containing histidine. Selection demands a recombinational event between
the arg-1 and arg-2 mutations, thus bringing the wild-type version of the arg-1 mutation into the
+
chromosome. The interpretation of the cross is based on the relative frequency with which the his allele
is brought into the chromosome. It is important that the histidine marker be fairly highly linked, of the order
of 70% or higher, to the arginine alleles, otherwise the two orientations cannot be distinguished because
they will give rather similar results for the scored marker. This mapping system relies on the fact that
linkage is inversely related to recombinational frequencies. Thus 80% linkage suggests that there is only a
20% possibility of a recombinational event occurring between the his and arg markers. In the two cases
diagramed, the order of the two arg mutations determines whether you typically bring in the donor his
allele or leave the recipient his allele.
So what does this tell us of current relevance? First, it emphasizes that recombination is a
measure of distance, and it is more striking in this situation because the distances are defined: both
recombination event must occur within the length of the transducing fragment or the selected marker is
not inherited (and we never even see that event). Second, it emphasizes that there have to be an even
number of recombinational events between a circular chromosome and a linear fragment in order to
maintain the required circularity of the chromosome. Third, it demonstrates the ability of a genetic
selection, in this case the demand for a rare recombinational event between two closely linked mutations,
to identify an interesting subset of the much larger population of cells. We are able to ignore the vast
number of cells that either did not become infected by the appropriate transducing fragment or had
recombination events outside the region of interest.
Deletion mapping. The above types of mapping use the frequencies of appearance of different
phenotypes to determine the linear order of mutations. Deletion mapping is a bit different in that it asks a
simpler question, "whether or not" rather than "how often." (As youll see below, this statement is
somewhere between an exaggeration and a lie. One really is scoring linkage to a deletion when doing
deletion analysis. "Zero recombinants" indicates that the point mutation is 100% linked to the deletion and
the interpretability of this number is a function of how many events are scored. Nevertheless, it is true that
the data was typically interpreted as yes or no.
Consider a cross between a donor and recipient where one strain contains a deletion of part of
the region of interest and the other strain contains a point mutation in the region of interest. If the point
mutation and some portion of the deletion coincide (affect the same base pair), then there is no way to
restore a wild-type genotype by a recombinational event. If the point mutation and the deletion mutation
do not coincide, then there will be some frequency of recombination events between the two, generating a
wild-type sequence. There are a number of reasons why this was the best way to do fine structure
mapping: (i) it tends to be unambiguous; (ii) it did not need to be done reciprocally; (iii) it tended to have a
122
very low background due to reversion, especially when the recipient was a deletion. Other mapping
systems have reversion of the one or two point mutations to deal with and this lowers the signal-to-noise
ratio of the mapping scheme. (iv) Deletion mapping could be performed with the entire range of gene
transfer systems.
Unless the deletions had been physically characterized, the actual use of deletions in genetic
mapping involved a circular argument and protocol. One took a number of point or insertion mutations in
the region of interest as well as a number of putative deletions and crossed them in all possible
combinations. The presence of a deletion mutation was verified genetically if it failed to recombine with
one or more point mutations that can be shown to recombine with each other (this might serve as the
genetic definition of a deletion). Thus one is using the point mutations to establish the presence and
identity of various deletion mutations and using the deletion mutations to order the point mutations. If one
is presented with a table of recombination data, one can typically identify deletion mutations by the very
fact that they fail to recombine with more than one point mutation.
A deletion map turns out to be a series of clusters of mutations (which might be points, insertions
or deletions), with each cluster defined by the end point of one or more deletions. Thus, the presence of
such a "deletion interval" is a function both of one's ability to generate a deletion end point as well as
one's ability to find point mutations in that region to allow recognition of different deletion end points.
A sort of "inverse deletion" mapping can be performed with physically characterized regions of
DNA. These can be introduced into various mutant backgrounds to ask if the cloned region is capable of
recombining with the mutation to give a wild-type genotype. If a positive result is seen, the mutation is
covered by the cloned region.
607 Lecture Topic 14.................. YEAST GENETICS

As noted previously, yeast and prokaryotes are not very similar in terms of their cell biology, but
some yeast such as Saccharomyces cerevisiae happen to be very similar to many bacteria in terms of
their utility as genetically tractable organisms: (i) They share a small size, so that massive numbers can
be analyzed simultaneously. (ii) They grow readily on defined medium and with a reasonable doubling
time, so that genetic experiments are readily performed. (iii) They readily perform homologous
recombination with introduced DNA fragments, which makes it easy to create strains with mutations
created in vitro. As you will see below, the nature of the yeast mating ability and its production of haploid
spores create additional useful features for genetic analysis.
This LT is relatively short for the simple reason that much of the text on prokaryotic systems apply
here as well. The nature of mutations and mutation frequencies is substantially similar, as are selections,
and complementation. Other yeast-specific issues have been addressed in the previous LTs. The most
important difference with yeast is in mating and genetic mapping, and these are the primary topics of this
section. Additional information can be found on Fred Shermans web site of yeast
http://dbb.urmc.rochester.edu/labs/sherman_f/yeast/ and on the Saccharomyces genome database,
which has all sorts of cute features, at http://genome-www.stanford.edu/Saccharomyces/).
Yeast forms. Like most yeasts, S. cerevisiae can exist in
two rather different morphological forms: termed fungal
and yeast. The fungal form occurs when the cells are
growing on a solid surface under nitrogen-limiting
conditions. This form is also termed pseudohyphae and
has elongated cells that remain attached to each other,
branch and can grow into the agar itself. The more typical
form in the lab is an ellipsoid yeast form, and the rest of
the text of this chapter will refer to this form when it is not
stated explicitly.
In the yeast form, cells tend to cluster together
and the yeast form of S. cerevisiae forms progeny cells by
budding. On the mother cell surface, the site from which
the daughter cell grew is marked by a chitin ring, which
cannot be involved in the formation of another daughter
cell. As a consequence, when the surface of the mother
cell is covered in these rings, it is no longer about to
divide, even though it can remain metabolically active.
123
Figure 14-1. The yeast cell cycle.
The number of progeny that can be produced over a cells lifetime is between 15 and 40.
Cell cycle. Like other eukaryotic cells, yeast cells have a distinct cell cycle, which means that a variety of
structural changes and metabolic activities are constrained to specific temporal periods and in an ordered
fashion. In other words, DNA synthesis only occurs in a certain period and this period is different from the
one in which mating with other cells can occur or the period in which cell division actually takes place.
What is unusual about the yeast cell cycle is primarily that it leads to budding of a spore, rather than cell
fission. As with other eukaryotes, these different periods are evident because of large morphological
changes that can be observed in the cell, though of course there are many more molecular events that
underpin these morphological changes. The cell cycle exists for a variety of reasons, but in yeast it seems
that a major factor is the decision as to whether or not there are sufficient nutrients available to support
DNA replication and the generation of a new cell. This claim is based on the fact that the absence of
appropriate nutrient levels freezes the cell in an early stage of the cycle. Presumably this is not the case
in higher organisms, in which individual cells are less dependent on external nutrients and more
dependent on proper development and differentiation control.
At first (or even second) glance, prokaryotes do not seem to have a cell cycle, since they typically
lack such obvious distinct phases. Certainly it is true that log-phase bacteria can initiate multiple rounds of
DNA replication even within a single cell and DNA replication can be continuous during growth. However,
log-phase enteric bacteria might be a deceptive example. First, outside of lab conditions, cells rarely face
a situation of unlimited nutrients. In the starvation conditions that most cells face all the time, they have
something of the same problems as yeasts: do they want to start DNA replication if there is a possibility
that they will not be able to complete the process because of low nutrient levels? I dont know the answer,
and it is true that prokaryotes do not have quite the morphological challenges to solve during their cell
division, such as creation and movement of spindles and the migration and division of the nuclei.
However, it still seems like a poor plan to go into starvation when half the chromosome is replicated. This
simply cannot be an optimal situation. So it seems plausible, at least, that there should be some primitive
cell cycle controlling the initiation of DNA replication.
Second, there actually are cases of cell cycling in prokaryotes that undergo a specific
morphological change. This occurs in B. subtilis during sporulation where there are specific controls that
prevent the spore cell from behaving like a vegetative cell (EMBOJ13:1566[94]). It is also apparent in
Caulobacter when the mother cell (tethered to a solid surface, and termed a stalked cell) produces a
motile progeny. In this case, the mother cell remains competent for DNA replication, but the motile cell
remains incompetent until it also attaches to a surface (PNAS95:85&120[98]). Lastly, we should recognize
that some of our ignorance of possible cell cycle stages in prokaryotes is because these cells lack most
microscopic observable changes and, perhaps equally importantly, we have not looked at the
phenomenon much. But back to yeast.
The yeast cell cycle is depicted in Fig. 14-1. These phases were defined by the use of a large
number of mutants that were stuck in one or another place in the cell cycle, which could be identified
because of some of the morphological events depicted in the figure. Because freezing cells at different
stages in the cycle would be lethal events, these mutations were all conditional, typically cold- or heatsensitive. G1 is the phase that cells spend the most time in under nutrient limiting conditions because exit
from that phase requires approval from some factors that sense available nutrients. Once the process is
started, it must be take to completion, so the S, G2, and M phases are largely the ordered steps of three
substantially separate processes: DNA replication, bud emergence and growth, and creation of utilization
of the spindle pole bodies. Governing the entire process is a complex set of signal pathways that order
each process and provide some level of coordination among them. The G1 period is also the time when a
diploid cell might decide that nutrients are so limiting that it wants to sporulate, which initiates the meiotic
pathway, which is dealt with below. (As an aside, the filamentous form can be rather different and nutrient
depletion of this form causes growth to pause in G2 rather than in G1 in the case of the yeast form.)
Mating Types and their effects. Natural isolates of yeast can have any number of mating types, but the
most commonly studied lab strain, S. cerevisiae, has two. These are unfortunately termed a and , which
are awkward enough now, but must have been brutal in the days before word processing. For mating to
occur, the two cells must be of different mating type. A fungus in which progeny of the same individual
can mate is termed homothallic, and one in which gametes can be fertilized only by gametes from a
different individual are termed heterothallic.
The ability of a homothallic cell to have progeny of a different mating type is due to a remarkable
system that will be very briefly described. In the wild-type strain, there exist three copies of the mating
type locus, all on chromosome 3. Only one of these is functional expressed and this is termed MAT and
this is located rather near the centromere. The other two loci, HML and HMR are located near the left and
124
right ends of the chromosome, respectively and, as is often the case near the telomeres, their expression
is silenced. Each of these loci has two regions of several hundred base pairs of sequence identity, but
between these is a 600-bp region that exists in one of two forms. One form specifies the a mating type
and the other form specifies the mating type. Typically the two silent loci, termed cassettes, each have
dissimilar versions of the region, but the mating type of the organism is determined absolutely by the
nature of the MAT loci. Although the silent cassettes do not affect the mating type, they can be involved in
a process that changes the cassette at MAT by the following mechanism. A specific endonuclease termed
HO makes a double-strand break in the conserved sequence region only at the MAT locus. This break is
then repaired by a gene conversion recombination event where the DNA used for the repair is from either
of the silent cassettes. If an cell (and therefore with the region at MAT) is cut and repaired with DNA
from the silent a locus, the cell becomes an a mating type cell.
The mating process is quite complex and developmentally regulated. Among other things, cells
secrete small peptides, termed a or , appropriate to their mating type. These peptides are therefore
pheromones. These have the ability to arrest the growth of cells of the opposite mating type. As a
consequence, when two haploid cells of different mating types are near each other, they stop growing at a
stage in their cell cycle appropriate for mating. Moreover, since yeast cells are non-motile, the cells
elongate toward the source of the pheromone, which potentially brings them in contact with each other.
They then fuse and form a diploid bud, while the fused parental cells become anucleate. (For complex
regulatory reasons diploid cells expressing both a and factors are sterile.).
The literature on the molecular basis of mating types, and on the exceedingly clever genetic tools
used to dissect the entire pathway before cloning and sequencing where available, is substantial and very
interesting, but not critical for our analysis. For our purposes, one can take two strains or different mating
type, mix them together at high density (because they are non-motile) and obtain very effective creation of
diploid progeny. As will be described below, such progeny are actually selected by using parents that
each has recessive mutations causing nutrient requirements, so that only the diploid cells can grow.
Patterns of chromosome segregation and recombination. In mitosis, the pattern is relatively simple.
The chromosomes are each duplicated (and there are a total of 16 per haploid), they condense, and sister
chromatids are moved to opposite poles of the spindle, whereupon new nuclei form. Meiosis is more
complicated and more important for our purposes, since it is at the heart of genetic mapping and analysis.
In this process, a diploid (typically formed by the mating of two haploids) pairs its homologous
chromosomes and replicates these. At this point, crossing over occurs between different pairs of these
four chromosome copies (in prokaryotic terminology, this would be called recombination, but that term
unfortunately has a different meaning among eukaryotes, as was explained in LT1. For ONLY this
chapter, I will try to use the term crossing over to refer to the physical breakage and rejoining of DNA
duplexes). It so happens that this recombination frequency varies greatly among different organisms, and
organisms with higher frequencies are more genetically useful because it means that a smaller population
of cells is necessary to find individuals with crossing over events between two markers on the same
chromosome. One of the great virtues of the lab strain of S. cerevisiae is that it has one of the highest
crossing over frequencies of any organism. The centromeres of the chromosome pairs are separated
such that each daughter cell has one pair. This completes the first meiotic division.
During the second meiotic division, recombination is much less frequent and the primary effect of
this process is to create four haploid cells. In yeast, these are packaged together in a sac called an ascus.
In one of the key advantages of yeast for genetic analysis, it is possible to pick these ascal sacs with a
micromanipulator and remove the four individual spores, which can be physically positioned on an agar
grid to allow them to each produce colonies. This is termed tetrad analysis. The importance of this is
explained when it is contrasted with the situation with prokaryotes or with higher eukaryotes. In
prokaryotes, one needs a selection for a crossing over event, because crossing over is relatively rare and
we therefore need a mechanism to distinguish those cells where it has occurred from the majority where it
has not. We routinely do this by providing the selected marker on a non-replicating piece of DNA (either
by transformation or generalized transduction, or mating) such that progeny of such crossing over events
are those cells that have acquired a selectable phenotype. In yeast, this is not the case, since one can
mix any two genotypes together by mating and determining whether or not crossing over has occurred
between any two markers by analysis of the progeny. In other words, since mating is itself selectable (by
complementation of recessive alleles) and crossing over is frequent, there is no need for a selection for
crossover itself. Additionally in prokaryotes only one product of the crossing over is recovered, because
the event is between one fragment that can replicate and another that cannot, so the non-selected copy is
necessarily lost. Now this is not terrible and we can guess what was carried by the lost region, but in
yeast, one recovers all the progeny of the crossing over because all four spores are typically viable.
125
Yeasts present numerous advantages when

compared to higher eukaryotes. One is that a sexual
mating in higher eukaryotes yields diploid progeny,
but not few products of a given meiosis are
recoverable (i.e. only the ones that give detectable
progeny). As a consequence, the results of the
cross have to be studied in large populations of
random products of a cross, which is much less
powerful than tetrad analysis. There is also the
serious problem that the products of most eukaryotic
crosses are diploid, so another round of matings is
often required to get a sense of the genotype of the
progeny. The fact that yeast generates haploid
progeny removes these problems. Lastly, as already
mentioned, S. cerevisiae is particularly enthusiastic
about performing crossing over, which makes it
especially valuable.
Before discussing the analysis of genetic
crosses in any detail, we will start by noting the
three general classes of progeny that come out of
any cross (lets use a standard example of a cross
between AB and ab, where the capital letter refers to
the dominant allele in each case). The classes here
refer to the patterns of the genotypes of the different
types of tetrads detected. The first class is called the
parental ditype (PD), where the last term refers to
the fact that there are two distinct parental
genotypes, so that two of progeny match one of the
two parents and the other two progeny match the
other parent (AB and ab). While one might be
Figure 14-2. Consider a cross between two
tempted to think that such progeny result from cases
parents with genotypes AB and ab, and
where there is no crossing over, that is not the case
where you do not already know if the A and B
in yeast where crossing over is so common. Rather
loci fall on the same chromosome. When the
this class reflects the situation that even with
frequency of PD>>NPD, then the genes are
crossing over the genotype of each progeny cell
on homologous chromosomes, because this
matches one or the other parents.
positioning will allow few NPD tetrads since
The second class is termed the non-parental
they result only from rare four-strand crossditype (NPD) and reflects the situation in which each
overs. In contrast, if the genes are on nonprogeny cell is distinct from both parents (two are Ab
homologous chromosomes, or if they are on
and two are aB). Again, this might involve crossing
homologous chromosomes but are very
over, but it might also simply result from differential
distant from each other, then PD=NPD
segregation of chromosomes. Obviously, if the two
because of the independent assortment (or
scored markers are on different chromosomes and
multiple crossovers in the case of
these segregate independently, then the progeny
homologous chromosomes). The figure and
would be equally likely to have received
caption is a slight modification of
chromosomes from either parent and so the
http://dbb.urmc.rochester.edu/labs/sherman_f
frequency of PD would be equal to that of NPD. The
third class is called tetratype (T) and in this situation,
there are four distinct genotypes among the spores: one looks like one parent, and other looks like the
other parent, and the other two look like neither parent, nor like each other (Ab, aB, ab, and AB). A T
tetrad can only occur when a crossing over has taken place. Hopefully this is clarified in Fig. 14-2 and its
legend.
So what sort of result indicates that two genes are linked? That is, if A and B are fairly close to
each other on the same chromosome, what would you expect in terms of the types of tetrads? If the two
loci were very close together, then one should almost always see lots of parental genotypes, since the
only way to get something different would be when a crossing over event occurred in the small space
between the loci. In other words, we see that two genes are linked if PD tetrads are more frequent than
are NPD tetrads. Obviously if they are so close that crossing over never occurs, then you will see
exclusively PD. Obviously too, if they are only very weakly linked, then the difference between PD and
126
NPD frequencies will be small. So how does one calculate a linkage in this system? Actually what is
calculated is a recombinant frequency, but even this is not trivial to calculate because two closely spaced
crossing over events cancel each other out, even though they did occur. Without belaboring the method,
the Recombination frequency (RF) is calculated as follows:
RF =
NPD + T
total tetrads
x 100
This value can then be converted into a physical distance if one has already determined the R per
unit distance. This is quantified in terms of centimorgans (cM), where 1 cM reflects a 1% chance that the
two markers will be separated by crossing over. For Saccharomyces, the map distance is given by:
cM = 100 x T + 6NPD
2
PD+NPD+T
Now none of this would be particularly attractive as a modern mode of analysis if crossing over
were rare, because you would have to examine many tetrads to get a large enough sample size to say
with confidence that PD did or did not equal NPD, and to determine linkage. However, S. cerevisiae
performs cross-overs multiple times per chromosome, so even the examination of 50 tetrads can provide
a great deal of information. This frequent crossing-over is reflected in the fact that a cM in S. cerevisiae
reflects 3 kb, while in humans, it reflects 1Mb! The utility now is not so much for fine-structure mapping,
but rather to determine how many loci one is dealing with, and roughly where they are (and therefore
whether they lie near known loci that provide similar phenotypes). This rough mapping can also then be
used to focus cloning methods to precisely identify the gene and the mutation itself.
Cloning methods. We will mention two general yeast methods that have significant utility. The first is a
form of cloning by complementation. As described in LT4, there are a variety of replicons available for
yeast. The larger ones with centromeres and telomeres are not convenient for in vitro manipulation (to
shot-gun clone a gene for example), but the smaller multi-copy vectors work rather like plasmids in
prokaryotes. As noted previously, yeast does not have useful dominant antibiotic resistance markers, but
the use of the wild-type alleles of genes encoding biosynthetic enzymes has worked extremely well. One
would therefore take a ura3 cell with another recessive mutation and clone random DNA from the wild
+
type into a vector carrying the URA3 gene. Selection for Ura and either selecting or scoring for the other
marker would allow the wild-type allele to be cloned.
S. cerevisiae has two other very useful properties that allow the introduction into the chromosome
of alleles constructed in vitro. First S. cerevisiae is efficiently transformed and secondly it is able to
perform homologous recombination between the chromosome and non-chromosomal regions of
homology with good efficiency. Together these features mean that it is possible to transform in either
circular or linear DNA and, if it shares homology with the chromosome, it will recombine with the
chromosome. More specifically, the method can be used to introduce either selectable or non-selectable
markers into the genome by a method very similar to that used with suicide plasmids in prokaryotes and
described at the end of LT4. The fact that the replicon in the cell is linear rather than circular is irrelevant,
as you will see if you reconsider the mechanism in Fig. 4-1 for a yeast viewpoint. In the yeast case, The
suicide vector can be almost any circular piece of DNA, presumably a bacterial plasmid, carrying a
selectable yeast gene such as URA3 and a region of cloned yeast DNA into which a mutation is created.
+
Selection for Ura yields primarily plasmid integrants as in bacteria. These integrants are then resolved to
the haploid state (but by a recombination event that might leave the constructed allele in the genome, Fig.
4-1) by random recombination. However, as noted in LT8, the URA3 gene of yeast has the charming
feature that it creates not only uracil prototrophy, but also sensitivity to 5-fluorootic (5-FOA). This
compound is not itself toxic but becomes toxic when there is a functional URA3 gene in the cell. So
plating the cells with integrated plasmids on uracil plus 5-FOA results in strains that have typically lost the
integrated plasmids and can be tested for retention of the desired allele.
Not all yeasts have these features. For example, some pathogenic yeasts that are important
targets for study are able to be transformed and integrate selectable suicide plasmids into their genomes,
but they do this without significant regard to homology. In other words, the plasmids integrate but not in
the region homologous to that carried by the plasmids. As a consequence, allele replacement is difficult to
impossible in these species.
607 Lecture Topic 15.................. SUPPRESSION
127
See the aged, but excellent, review by Hartman and Roth: AdvGenet17:1 [73]); and a less complete but
updated version in TIG15:261[99]. Note, however, that this review makes a hash of the concept of
informational suppression (see below) in referring only to tRNA suppressors (gad!), an oversight that was
pointed out by our own Phil Anderson (TIG16:157[00]). On suppression of nonsense signals and
frameshift mutations, see ASM2, 909[96].
Suppressors are that class of second mutations that modify the strains phenotype in the
presence of the original mutation. The effect of the modification is to make the phenotype more like wildtype. Occasionally, suppressors are termed compensatory mutations, which can be viewed as a slightly
more general term for the same phenomenon. It has become common to see the term pseudo-revertants
applied to suppressor-containing strains, and the term revertants used for restorations of the wild-type
genotype. While this usage is clear, I find it awkward: what does one call the strains (derivatives of a
mutant) that have not been sequenced to determine whether they are revertants or pseudo-revertants?
(This section focuses on suppressors that one is aware of, but the same arguments, of course,
even in cases where the scientist is unaware that a suppressor mutation has occurred. See the last
section on Selections in LT8.)
An important fact concerning suppressors is the following: When you start with a mutant that fails
to grow, any alteration that allows it to grow well enough to be seen is acceptable even if the growth is not
as robust as for the wild type. Revertants are therefore cells that are significantly more normal than the
mutant but are not necessarily of a fully wild-type phenotype. They are strains that have the requisite
amount of growth to be detectable under the arbitrarily chosen conditions.
All suppressors have two general problems: (i) they often entail the loss of some or all of the
normal function of the gene with the suppressor mutation, and (ii) they often involve the acquisition of a
new function, which might by deleterious under some or all conditions. Such suppressors are detected,
despite their deleterious properties, because they provide more growth than that of the original mutant.
Suppressors are fundamentally similar in prokaryotes and eukaryotes, except the more complex
cell biology of the latter allows for some more classes not available to prokaryotes.
Use of suppressor analysis. In a sense, basic genetics is the isolation and the genetic and physiological
characterization of mutant classes. The analysis of suppressors is the next logical step: you alter a
function in a such a way that a deleterious phenotype results, then ask for better growth, and determine
what alterations in the cell make up for the initial defect. The approach of seeking and analyzing
suppressor mutations is one of the oldest and most powerful in genetics. Its power come from the fact that
you allow the organism to reveal the sorts of changes in metabolism, based on whatever functions are
affected in the suppressors, that have effects on the original pathway altered in the mutant. As a
consequence, you often find connections between aspects of physiology that you could not have
anticipated and are of real biological importance.
Genetically this requires mapping the suppressor mutation using standard mapping methods,
except that all strains must have the initial, suppressible mutation, so that the suppressor mutation can
always be mapped as a selectable marker.
As will be detailed below, there can be vast range of possibilities of mutations that suppress the
phenotype of another and it would be next to impossible to chase down the biochemical alteration
underlying all of them. Also one has the problem that different types of suppressors do not all appear at
equal frequencies. The most common suppressors will be those whose genetic events are most frequent
and these will greatly predominate in the population, even if other classes happen to be more interesting
to you. It is often better to start with a mutation and then ask if suppressors of it map to a particular gene
or region. This approach yields only those classes that map to regions that the experimenter can imagine,
but makes the subsequent timely understanding of those mutations much more likely.
The worst part of suppressors is the nomenclature, in part because of the convention for +/superscripts. Because the wild-type allele of a gene should receive a + superscripts, the allele that
actually performs suppression should be termed sup . However, because the mutant allele provides the
+
appearance of function, many people refer to the mutant allele as sup . An additional problem comes in
naming a suppressor some appropriate term and then realizing subsequently that the suppressor is an
allele of a gene with a previously established gene name. In the case of informational suppressors, for
example, supF (the old suppressor designation) is an allele of tyrT, the gene for a tyrosine-inserting tRNA,
yet the old sup name is pretty well built into the literature.
Informational suppression. (see review in ARG19:57[85]) These are suppressors that affect the way the
cell uses genetic information to produce the end-product of the gene, and function at the levels described
below. In effect, this class of suppressors causes some level of error to occur in the processing of
128
information and those errors allow some necessary level of product activity to be formed from the flawed
information. Analysis of these suppressors provides insight into protein synthesis and has had uses in
generation of conditional mutations (temperature-sensitive suppressors). The frequency with which these
suppressors make these errors, which is to say, the frequency with which they suppress, is termed the
efficiency of suppression.
These suppressors allow production of some wild-type or sufficiently effective product from the
mutant gene by altering the protein-synthesizing system so that it mis-processes certain types of mRNA
information to systematically create a small amount of functional product from a mutant mRNA.
Alternatively, the fidelity of transcription or the processing of mRNAs can be systematically error-prone so
that they mis-process certain information from mutant genes to produce some functional product. The
specific alterations in the transcription and translation machinery with such effects might directly affect
tRNAs, tRNA synthetases, or tRNA modification functions: in all these cases an amino acid might be
occasionally be replaced by an incorrect one. Altered release factors can be a factor in nonsense
suppression, because the suppressor tRNAs must compete with these proteins. Alterations in RNA
polymerase or in mRNA processing functions can also result in certain types low level errors. Such
suppressors can suppress nonsense, missense, and frameshift mutations. tRNA suppressors of the last
class of mutations might affect the acceptor stem, the TC loop, or the anticodon. These are typically
suppressors of +1 frameshifts that pair with four bases or slide, but suppressors that only move two
bases, and therefore suppress -1 frameshifts also exist (JBact174:4179[92]). There are even ribosomal
suppressors that are specific alterations of the 16S rRNA that result in suppression of only UGA signals.
This alteration does not affect RF2 binding to the ribosome, which suggests a direct involvement of 16S
rRNA in reading the mRNA (PNAS85:4162[88] & NAR18:5625[90]). Thymine deficiency causes
suppression of both nonsense and frameshift mutations, perhaps because tRNAs are not properly
modified and thereby miscode or slip with greater frequency.
Informational suppressors only mis-process the information some of the time. Much of the time
they cause a product to be made that matches the mutant information, but it is the occasional error in
information processing that allows some functional product. The frequency with which these suppressors
mis-process the information (which in turn leads to some functional product) is referred to as the efficiency
of suppression. Note that a suppressor that has a high efficiency of suppression is one that routinely misprocesses and since most of the information being handled is actually wild-type, then an efficient
suppressor makes lots of errors in handling the majority of other (non-mutant) information in the cell.
A very cute suppressor is supK, a recessive UGA suppressor in St that maps in prfB, whose
product, RF2, recognizes and terminates at UGA signals. Remember (LT1) that the normal gene requires
a +1 frameshift at codon 26 to get by a UGA signal. The suppressor mutation is itself a UGA mutation at
codon 144, which reduces RF2 levels, allowing the suppression of other UGAs and itself; it allows the
synthesis of about 10% the wild type levels of RF2 (PNAS87:8432[90]).
Informational suppressors can be dominant or recessive to wild type, although most tRNA
suppressors tend to be dominant because they still suppress in the presence of the wild-type tRNA. In
contrast, most suppressors that result from the poor function of tRNA modification enzymes are recessive
because the presence of the wild-type allele causes normal tRNA modification.
As a bit of history, it has long been an oddity that many of the lab strains of E. coli have amber
suppressors and it is clear that these mutations existed in some of the strains used as progenitors for
most of the current strains. This fact was not known until we started to have a grasp on suppressors in the
1960's, so suppressors were inadvertently built into all the standard strains. So why did these suppressors
exist in the original strains? The current notion is that some of the early strains had amber mutations in
rpoS, the gene encoding the sigma factor for expression of genes in stationary phase. RpoS is not
important for logarithmic growth and only has an impact on cells when they are not growing, whereupon it
affects their survival. The notion is that the amber mutation occurred sometime between the isolation of E.
coli from a diphtheria patient in 1922 and the choice of strain K-12 by Tatum (a UW grad), and that
storage of that strain in the lab then selected for derivatives with the suppressors (Genet165:455[03]).
Non-informational suppression (of enzyme defects). None of the following types of suppressors affect
the way information is processed, but rather correct the original defect by some other mechanism. They
might affect a vast range of possibilities in cellular physiology, and their analysis has provided information
on protein structure and metabolism.
129
For the following cases, metabolites are

designated by capital letters, and genes and enzymes
by numerals (see Fig. 15-1). A mutation in gene 2 (the
structural gene for enzyme 2) causes a metabolic
defect (the absence of the necessary compound C) and
might be suppressed by a second mutation that falls
into one of the following classes. In each case, the
suppressors work by either causing a sufficient
increase in the level of the required compound C or by
decreasing the requirement for the compound to a level
that the cell already possesses.
Intragenic suppression. The critical issue for the sorts
of suppressors that are possible depends on why the
Figure 15-1. A hypothetical set of
original mutation causes a mutant phenotype.
enzymes and metabolites as a framework
Depending on that point, the phenotype might be
to consider different suppressor types
corrected by yet another mutation by the following:
discussed in the text. In this case, C is the
(i) A change in a second letter in the same
essential metabolite that is lacking in a
triplet codon affected by the first mutation might allow
strain mutated for the genes encoding
insertion of an amino acid more compatible with
enzyme 2.
functioning of enzyme 2. Alternatively, a samesense
mutation could make a codon more translatable or alter
a site involved in transcriptional, translational or posttranscriptional regulation.
(ii) A second, genetically separate mutation in gene 2 that leads to a "doubly mutant" enzyme 2
might be a suppressor in the following ways: If enzyme 2 has lost enzymatic activity by the original
mutation, perhaps by an alteration of the active site, then that can be restored by other substitutions that
compensate for that change (e.g. trp synthetase intragenic suppressors, Bioc25:6356[86], and restoration
of a functional phage P22 tail spike protein by a compensatory charge-change, Genet125:673[90]).
Similarly, a compensatory frameshift at a different position in the gene can either restore protein function
or relieve polarity without restoring function of the affected gene product. If the deleterious phenotype was
due to the accumulation of a toxic polypeptide, then a suppressor might simply destroy that product. If
enzyme 2 was mutationally slowed, but also normally inhibited by ions, metabolites, or macromolecules,
then suppressor mutations might eliminate the ability of those inhibitors to bind (e.g. gyrB suppressors of
a leu-500 promoter mutation, EMBO7:1863[88]). It is even possible that the mutation creates a very
poorly translated codon, resulting in low protein synthesis, and this could be compensated for by
improving the codon context with mutations in adjacent codons.
(iii) Any mutation that increases the amount of partially defective enzyme 2 through increased
gene dosage or through altered regulation or altered gene expression could be a suppressor. These
include duplications/amplifications, transcript fusions by deletion or duplication, mutations that destroy a
repressor or stimulate the activity of an activator, or alter the local (or global) superhelicity in such a way
as to enhance transcription.
(iv) If enzyme 2 in inhibited by ions, metabolites, or macromolecules, then mutations in other
genes that decrease the amount of the inhibiting substance will be suppressors. For example, if the wildtype enzyme 2 is inhibited to some extent and the original mutation further damages activity, then such
suppressors would at least relieve the inhibition. Alternatively, the original mutation might make the
mutationally altered enzyme 2 uniquely sensitive to these factors.
(v) If the original mutation causes a poor affinity of enzyme 2 for its substrate, or if it lowers the
activity of enzyme 2 so that it is below saturation by its substrate, then increasing substrate B would
suppress the original defect. For example, an increase in the amount (or the regulation of) enzyme 1
activity would have this property. Alternatively, the levels of B could be increased by damaging the ability
of a competitive enzyme (this is not shown in our example, but imagine another enzyme that also used B
as a substrate for a different reaction). An example of this is the case of ATCase and OTCase, which both
compete for carbamyl phosphate as a substrate; a mutation raising the Km of either gene product is
compensated for by a mutation causing a similar effect on the other enzyme because carbamyl
phosphate levels rise to allow both enzymes to function.
(vi) It is even possible that another protein can mimic enzyme 2 in its function. Suppressors might
require the expression of this duplicate function when it is not normally expressed, or the modification of
its active site so that it now runs this reaction instead of a slightly different one. Examples include the bgl
locus as revertants of lacZ, in which both regulation and enzymatic properties are altered
(Genet120:887[88]); or neuC mutations as suppressors of leuD, where a homologous protein subunit is
130
stolen from another protein complex by mutationally killing the competing peptide (JBact170:3115[88]).
(vii) If the key requirement is the production of metabolite C, then it is conceivable that it might be
supplied by a completely different metabolic pathway, which might be parallel (X = B, Y = C) or unique (X
and Y distinct from B and C) from the normal pathway. The suppressor mutations could then affect
accumulation of Y by damaging the activity of enzyme 4, or causing the conversion of Y to C, by altering
the control enzyme 5. These types of suppressors are particularly possible in cases of "channeled"
pathways, which are enzyme complexes that channel metabolites through a number of enzymatic steps
without their release. In such cases, similar or redundant enzymatic functions are necessary for each
pathway under normal conditions, but mutational alteration of one pathway can allow a flow of metabolites
to the other.
(viii) If the negative phenotype is actually caused by the accumulation of compound B such that it
inhibits other reactions (the inhibition of enzyme 3 in our example) then mutations that lowered the level of
compound B would be suppressors. These might include those that limited synthesis of B by altering the
regulation or efficiency of enzyme 1. Alternatively, the inhibitory effect itself could be addressed by
lowering the sensitivity of enzyme 3, by increasing the amount of enzyme 3 or by increases the level of
the normal substrate, compound D, and out-competing the inhibitor B.
(ix) Though not depicted in our example, if the negative phenotype results from defects in the
interaction of enzyme 2 with other macromolecules, then altering those macromolecules might provide
suppression. Examples include mutations in genes encoding regulatory factors that compensate for
operator mutations (JBC263:6193ff [88]), and mutations in genes for tRNAs that compensate for
mutationally damaged tRNA synthetases. A nice example of suppressors involving protein:protein
interactions has been reported for tonB, btuB, and cir, which are Ec membrane proteins involved in
transport (JBact172:3826[90]).
Multi-copy suppressors. Imagine that you have a mutation in a strain that causes a deleterious
phenotype and you isolate derivatives that grow better and contain suppressor mutations. If you made a
library in a multi-copy plasmid of inserts from that suppressor containing strain and transformed these
into the mutant parent, you might reasonably hope that strains with a WT phenotype might have a plasmid
carrying the suppressor allele. Indeed, if the suppressor is dominant, then you probably will get some of
those. However, with surprising frequency, you also get plasmids that carry a completely WT region of the
chromosome. Moreover, these plasmids are causative of the WT phenotype (that is, the colonies did not
arise because there was a mutation in the chromosome that was coincident with receipt of the plasmid).
So what is going on? There are a number of possibilities, but the most plausible one is that some gene on
the plasmid, when in multi-copy, compensates for the original mutation, where that same gene in single
copy does not. Presumably, this is because there is more gene product accumulation in the former case,
but why should that suppress some other mutation? I believe the most common phenomenon would be
where the protein product of the originally mutated gene and that on the plasmid are homologous, such
that they have evolved distinct functions but where there is a bit of the activity of the former to be found in
the latter. Thus, overexpression can give enough total activity that the strain can grow. So this sort of
analysis allows one to find some minor activities of some proteins that are not readily detectable
otherwise. And of course, there are many other possibilities for some a scheme working, rather along the
lines of the non-informational suppressors described above. And, similar to those, all such odd cases of
suppression can reveal something about function and regulation that could not be readily found by any
other means.
607 Lecture Topic 16..............EVOLUTION

Discussions of evolution typically cover the period from the earliest living cells to the development
of more complex organisms, and that will be the focus of the bulk of this chapter for the simple reason that
genomic and biochemical information give us some insights in to that process. However, we will start with
a brief comment on the transition from abiotic to biotic systems, which is actually the tricky bit, as the
British might say. While I think that continuity of all known life is pretty evident based on current
information, and I have no trouble guessing a plausible route for the evolution of even a complex structure
like the eye, I have not seen a plausible description of the formation of the primordial cell.
The key problems are the following: (i) You need to evolve a code - both a molecule and a coding
scheme - but that code is worthless unless it is able to replicate and also be decoded into something
useful. In other words, there could be no obvious selection for a coding molecule unless its replication and
coding utility were both already satisfied. (ii) Efficient replication of a coding molecule would certainly
require some special function, which certainly would have to be encoded by the coding molecule itself.
131
This causes an immediate problem - how did one simultaneously evolve a coding molecule and a
replication catalyst? However, maybe we can escape this problem by simply assuming that replication
was not efficient and really did occur non-enzymatic polymerization of nucleic acid on a single-strand
template with subsequent, random strand separation. So perhaps one can imagine a piece of DNA sitting
in a vesicle that is able to replicate itself at some low rate. (iii) The more serious challenge to evolution is
to explain how such a self-replicating nucleic acid gained the ability to code for amino acid polymers. That
is, even if a piece of DNA can replicate in its vesicle, its vesicle will have no particular reason to survive (if
we can use the term for a non-living vesicle) unless the DNA code makes something useful for the
vesicle. And now the most serious problem comes in: the protocell needs the ability to convert the genetic
information into a product. Now we obviously do not need todays sophisticated ribosomes, but you do
need something. Importantly, there does not seem to be any sort of chemical affinity between amino acids
and specific codons that would allow this to happen without a special translation process. So if the cell
needs a simple translation system, it seems that it would have to encode such a thing, but how could it
have evolved such a coding region until the system was efficient enough to lead to its own translation?
The above argument ignores little problems like metabolism, the concentration of compounds
within a proto-cell, cell division, and the like. But even in this stripped down form, I think there is a real
problem in thinking about how this could conceivably have occurred. The other little oddity is that it
occurred fairly quickly, at least on the evolutionary scale. It was apparently only a couple hundred million
years between the earths cooling to habitable temperatures and the first living cells.
The solution might involve what has been termed an "RNA world, because RNA has been shown
to be a catalyst of enzymatic reactions, thus resolving both coding and functional properties into a single
molecule. Most strikingly, rRNA has been shown to be the essential catalyst for peptide bond formation on
ribosomes (rather than merely serving as a scaffold for interesting proteins) (Sci256:1416,1420[92] and
Sci256:1396 & 1402[92]). The problem with the RNA world view is that is does not explain why or how
one would then develop the protein synthesis machinery that we see now. That is, the very solution of
having the same molecule be both code and product means that it is fundamentally different from the
nucleic acid-protein basis of life we see today. An interesting hypothesis for closing this gap in the model
has been proposed (TIG15:223[99]). Recent reviews on evolution of the code (Naturwissenschaften
89:542[02]) and one with some rationalization of the transition from a RNA world, is in TIBS24:241[99],
but I confess to a lack of enthusiasm about many of its arguments.
For reasons that I fail to understand, the above arguments do not appear in most texts. Quite
possibly the writers are reluctant to give fuel to those who see any gap in the scientific argument as
support for one or another type of special creation. Indeed, it is somewhat amusing that creationists have
fought their war on aspects of evolution (such as the evolution of man from other primates), which are,
scientifically at least, no-brainers. What caused me to expand this section of the text (which used to be a
few sentences that students failed to really understand) were a couple new articles that restate the
problem. I do not know if the authors have real insights, but at least it is well-written and has a lot of
references about evolution papers over the past 40 years (CellBiol Intl28:729[04]).
While I am hardly an expert on evolution, the rise of fields like evolutionary psychology suggests
that my ignorance is commonly surpassed. The purveyors of these and many other social sciences
appear to believe that everything that we see in nature (both physical features of organisms as well as
behavioral traits) has been selected for in evolution. They further assume that these features and
behaviors are encoded in our genes and set themselves about the task of guessing the evolutionary
pressure for the presence of each and every trait. I believe that this approach is flawed for a number of
reasons: (i) It makes the assumption that every trait in every organism has been strongly selected for, and
this is just silly. To be sure, strongly deleterious features and behaviors have been strongly selected
against, and obviously some very advantageous properties, such as disease resistance, have been
strongly selected for. However that does not mean that there is a strong selection for everything else.
Many genetically encoded properties exist in the population because the feature in question was part of
an organism that did have a number of very positive traits that were selected for and that naturally brings
along all sorts of other baggage (the genetic term for this is hitchhiking as coined by Maynard Smith). (ii)
Evolution in big animals like people has been stunningly brief, especially for complicated behavioral traits.
Based on mitochondrial DNA, for example, it appears that all humans are descended from a single female
who lived about less than 200,000 years ago, or roughly 10,000 generations. It is therefore hardly the
case that evolution has had its time to refine our more human-specific attributes, as it has for most
bacterial traits. It is also true that the population size for humans has been pretty puny for most of the
relevant period, so the number of interesting variants tested is rather small. (iii) Then, while I believe that
most human behavior is profoundly influenced by the genotype, it is a mistake to assume that this
relationship is simple and therefore that it can be simply selected. That is, the more complex the set of
genes whose products are involved in a given behavior, the less likely that improvements in this
132
behavior will be readily selected for in a limited time frame. (iv) Perhaps more importantly, it is probably a
mistake to spend a lot of time asking why? We are on firm ground when we ask what and how, but
why is always fairly unanswerable, albeit interesting. First, we cannot repeat evolution to test
hypotheses on why something happened. Second, we are always extremely good at rationalizing virtually
anything, so it is not as if our ability to come up with a hypothesis that is consistent with the data should
give us the slightest confidence in the correctness of that hypothesis. For these reasons, then, I would
suggest casting a baleful eye toward those who would explain the behavior, or much of any other human
property, in terms of evolution.
Increasingly, we think about evolution through molecular phylogenetics, which is defined as the
study of phylogenies and processes of evolution by the analysis of DNA or amino acid sequence data,
where phylogenetics are the hierarchical relationships among organisms arising through evolution. The
process, and the analytical methods used, have undergone a huge evolution themselves, both because
increased interest has brought more intellectual rigor and because of the huge growth in data bases. See
TIG19:696[03] for review on the phylogeny of eukaryotes; NatRevGenet 4:275[03] for a review of the
comparison methods; and Bioessays24:203[02] on the general approach.
Mechanisms of genetic change. The ASM2 series has a number of articles that are relevant here, from
pages 2627-2720.
Mutations of various sorts are clearly a central mechanism of change. For subtle modulations of
protein activity over time, base substitution mutations are key. Here we should simply mention that there
is variability in mutation rates as a function of chromosome position and growth conditions. For example,
studies on the evolution of a particular tRNA cluster suggests that tRNA genes may be hot spots for
recombination, which affect their evolution (JBact171:6446[89]). But other issues covered in LT2, such as
the influence of fluctuating growth conditions on mutation rates, and the selections for mutators in rapidly
changing environments, are also relevant.
Duplications, deletions and insertion sequences all have roles in evolution, though their relative
impact is unclear. For example, detectable homology between genes indicates that multiple forms of
duplication has been involved in evolution, but the precise mechanism by which this occurred is unknown.
It has been argued that ISs (and other repeated elements) provide the homologous regions for
duplications to occur and that these duplications might even represent a sort of adaptive mutation (ASM2,
2256[96]). A review on transposable elements in chromosomal rearrangements - largely in eukaryotes - is
in ARG36:389[02].
As a mechanism of change duplications imply that the organism must be willing to retain an extra
copy of a region while it mutationally plays with it. (As noted earlier, I assume that these extra copies are
in the form of non-tandem duplications, so that they are stable enough for advantageous mutations to
occur with some frequency.) While it does cost something to replicate DNA, it is clear that DNA is cheap
based on the amount of baggage that bacteria carry around. It is, however, even cheaper in eukaryotes there is a claim that 75% of the DNA in maize and barley is that of transposable elements. Massive
duplications were apparently common in the evolution of Arabidopsis, since 17% of the genome is
arranged in tandem arrays of two or more copies and 58% exists as readily detectable duplications
(Nat408:796[00]). The data on duplications in humans is covered in Nat408:796[00], with a broader
discussion in Sci300:1707[03]. Based on this, there seems to be little selection for having less DNA, but
there are clearly prokaryotes that have tried to minimize their genomes. Buchnera, a relative of E. coli, is
an endosymbiont that has reduced its genome to 641 kb, essentially by many small deletions. The driving
force for this is not certain, but may reflect the fact that it is polyploid and that smaller chromosomes
replicate faster (TIG17:615[01]). In other words, there is coincidentally a competition for replication among
the genomes of the cell. It is less clear what has caused the reduction of the genome of Carsonella ruddii,
an intracellular parasite of some insects, to only 160 kB (Sci314:267[06]). This organism appears to be
well on the way to becoming an organelle.
The non-mutagenic approach to evolution is horizontal transfer, which clearly has had a role in
bacterial evolution, but until genome sequencing, it was often technically difficult to recognize. Now of
course any horizontally transferred region underwent mutation and selection in other hosts, so it is not as
if horizontal transfer exists completely apart from mutation. Rather, for the evolution of a given genome,
horizontal transfer has consequences not readily matched by mutation and selection, since it opens a
genome to independent evolutionary pathways.
Identifying the presence of horizontally transferred genes. (see the general arguments in this by Eisen in
Sci292:1903[01] & CurOpGenDev10:606[00] and then PNAS100:9658[03] & CurOpMic6:498[03]). The
dilemma is that everyone believes that horizontal transfer has occurred and continues to occur, and
because of this belief we are perhaps too ready to accept even very weak evidence on that notions
133
behalf (but see below with reference to Pernas recent data). For example, the fact that a gene has
somewhat novel GC content compared to the rest of the genome, or a rather unusual codon choice, is
certainly consistent with the notion that the region evolved in another genetic background and was
subsequently transferred, but hardly proof of that. There are such things as random fluctuations in
sequences and there might well be selections for odd GC content in specific regions of the genome that
we are unaware of. I believe that, as a generality, such evidence is strongly suggestive of horizontal
transfer as a mechanism (though certainly not conclusive proof for any specific example). However, that
view might simply reflect my prejudice in support of the hypothesis. Even weaker is the argument that
your gene must have been horizontally transferred because the best hit in the data bases for your gene is
from a relatively unrelated organism. This argument is obviously based on the absence of data for other
organisms and such negative results are never compelling. The strongest arguments in support of
horizontal transfer are those based on phylogenetics: where it has been demonstrated that that the
phylogenetic lineage of a given gene is different from that of the bulk of the genes in the organism,
including those for the 16S rRNA. This has been done in many cases, but these represent only a small
fraction of the claims for transfer.
What are the barriers to gene transfer? There are actually a couple of completely different issues
here. There must be a means for moving the genes from one organism to another, a mechanism for
stably recombining them in the genome, and then a selection for their retention. By this definition, I am of
course only referring to those genes that we detect as being transferred and not to all sorts of
unproductive movement of DNA that might be taking place, since this has no impact on evolution. These
different issues then directly affect the sorts of genes that are transferred.
The actual movement of DNA might be through transformation, transduction or conjugation.
Doubtless transformation is of great relevance to the relatively small number of microbes that are naturally
transformable, but I assume it is irrelevant to the rest, so we will ignore the mechanism with this caveat.
Transduction can also move DNA, and it has been argued to be a significant factor in horizontal transfer,
but I think this depends on what DNA you care about. There is no question that there are a lot of old dead
viral genomes laying around in most microbial genomes and that this does reflect a sort of gene transfer.
However, I think that it is not a very interesting one for more evolutionary matters, because most of these
only move viral genes. This is not to say that such genes can never be interesting to evolution, but rather
that they are typically only interesting to viruses. While I certainly believe that viruses can also move
prokaryotic genes by generalized or specialized transduction, I do not see this as a major pathway for
such movement for two reasons. First, phage do not move host DNA very often or move very much DNA
when they do. Second, the host requirements for transduction tend to be rather narrow, such that
transducing phage usually only move host DNA to closely related organisms. This is fine and certainly
+
counts as horizontal gene transfer, and might be relevant for a mutator strain to return to a mut genotype,
for example. However such transfer between closely related strains is unlikely to explain a very odd GC
content or a screwy phylogeny. (Having made this argument for years, it dawns on me that I really do not
know that phage cannot transfer DNA to distantly related cells from their normal host. What I think that I
do know is that phage typically cannot plaque on cells that are much different from their normal hosts, but
it is plausible that they can inject their DNA in many cases without yielding viable progeny, and this is all
we really need for horizontal transfer. So while I remain skeptical, I have little basis for that skepticism.)
So that leaves us with conjugation, which seems to provide all that we require. Donors can transfer DNA
to almost any cell, sometimes including eukaryotes, and large amounts can be transferred. There would
be restriction systems to overcome, but this is a consideration for any transfer even between rather
closely related organisms.
So what happens in the recipient cell? Obviously if the transferred DNA is a replicon, then youre
all set because it need not recombine with anything in the recipient to replicate. If the transferred DNA is
chromosomal, then there are two issues: it needs to recombine with a replicon in the cell and it needs to
provide an advantage, or at least not a significant disadvantage, to the recipient or the recipient will not
prosper in the population. Remember that the mismatch repair system dramatically reduces the ability of
slightly different sequences to form heteroduplexes in recombination, so that the greater the difference
between the transferred gene and the chromosome sequence the less likely that recombination, at least
homologous recombination, will occur. While RecA is able to align sequences with only 70% identity in
vitro, differences in vivo of as little as 10% effectively eliminate recombination because of mismatch
repair. This problem for gene transfer is substantially eliminated in mismatch repair mutants
Genet150:533[98] and ASM2:2250). This has implications for adaptation through acquisition of
heterologous genes (Genet164:13[03]). Obviously rare events still happen and not all recombination is
homologous, nor are all recipients in possession of a functional mismatch repair system. In any event,
some sort of aberrant recombination event would have to occur to allow some very different incoming
genes to be retained.
134
So what sorts of genes might provide a selective advantage for the host, such that they would be
retained? (Here I am assuming transfer between rather different organisms. Transfer between rather
similar organisms is addressed below.) One example is a gene encoding drug-resistance in an
environment with antibiotics, but this situation is simply too unusual to explain the apparent transfer that
we have already detected. While it might well be possible that basic functions like 16S rRNA, polymerase
or central metabolism activities from one organism would work reasonably well in another, there is no
reason whatsoever to assume that they would provide an advantage. In other word, all organisms have
tuned their central metabolism functions so that they work well together: they are transcribed
appropriately, translated well, and the protein products interact as they should. An incoming gene simply
cannot be expected to improve on those properties, and seem more likely to cause a problem. I believe
this is the reason that phylogenies based on such genes tend to give consistent results with each other,
including those from 16S rRNA. It is not that such genes cannot be transferred or inherited, but rather that
there is no advantage to the recipient, and some obvious potential disadvantages.
We are therefore left with genes for non-central metabolism as transfer candidates (between
rather different organisms), but this actually makes sense in a couple of broadly important areas. First,
catabolic functions have obvious advantages for transfer. A perfectly functional organism might find itself
in an environment where nitrate is the only nitrogen source, so inheriting the catabolic genes for nitrate
utilization would be a terrific benefit. Moreover, these are precisely the sorts of genes that would be lost
over time: An organism that never saw a nitrate-rich environment would tend to lose such genes even if
its ancestor had them, but a change in environment would change its needs. As a consequence, genes
and operons whose products allow the degradation of compounds make terrific gene transfer candidates.
I think that it is not a coincidence that such genes are almost inevitably clustered, while genes for central
metabolic functions are not. For example, the nif genes (for nitrogen fixation) are almost always in a single
location, which suggests that they might well have been transferred as a group at some point - note that
transfer of part of the cluster provides no advantage whatsoever. In contrast, the genes for the
biosynthesis of many amino acids are found in multiple locations in most genomes. Presumably they have
been in that genome for so long that random mutation has moved them around without an impact on their
function.
The other candidate genes for transfer are those that allow a completely different lifestyle, such
as the ability to be a symbiont or pathogen. In this case, there certainly are a lot of genes involved and it
is not as if a bug that never saw a human cell could inherit a region or two and become a human
pathogen. Rather, it is that there are a large number of strategies necessary for living in and around hosts
and it is apparent that different symbiont/pathogens have traded these functions over time.
Nicole Perna has opened my eyes to a different aspect of all this (GenomeBiol7:R44[06]). She
has compared the genomes of seven different E. coli strains (and she lumps Shigella in there as well) and
has seen regions within a single gene of one strain that appear to have different histories: One part of the
gene had multiple changes that matched a different E. coli, and another portion of the gene had a set of
changes that matched a third strain. In other words, two different gene transfer events could be traced
within a single gene because of a modest but significant number of sequence differences. There are a
number of interesting points to be made here: (i) While you cannot detect gene transfer between very
similar organisms based on obvious GC content differences, you can detect transfer if there are multiple
sequence differences between different alleles in your data set. If your data set is rich enough, you can
break the transferred bits to very small regions. (ii) Such transfer and recombination is of course very
reasonable between rather closely related strains because mismatch repair will not serve as much of a
barrier. (iii) Because you are exchanging DNA between two organisms with very similar metabolic
systems, such transferred regions will almost work fairly well (i.e. they will not be grossly out of tune with
the rest of the metabolic system, as will often be the case for genes from rather different organisms). This
means that genes involved in central metabolism might well be exchanged, though of course it also
means that they will not likely cause a dramatic positive or negative effect. Curiously, Nicole and
colleagues see that genes whose products are involved in protein synthesis, cell division stress and RNA
and protein metabolism are under-represented in her set of apparently transferred genes, while DNA
repair and replication, motility and (amazing to me) biosynthetic pathways were over-represented. She
argues that about 25% of a given E. coli genome was transferred horizontally but not by homology
(pathogenicity islands, phage, MGEs etc), while 5-10% has been transferred horizontally AND BY
homologous recombination. Note too that this latter number will only go up as we have more similar
genomes to compare, since it allow the detection of ever-smaller transferred units.
In light of all of this horizontal transfer, I used to find it somewhat odd that similar sets of ISs were
not found in related bacterial species, or even among E. coli strains, since that suggests that horizontal
transfer is not all that common. That is, if horizontal transfer were very common, then certainly ISs would
have plenty of opportunity to move and all related strains should have the same set of ISs, although
135
certainly not in the same sites. Apparently, however, the notion that each organism has only very different
ISs was based on too small a data set of strains, since there is better agreement in a larger data set
(Genet133:449[93]). An example supporting horizontal transfer is the case of the msDNA where a large
random set of E. coli were examined for msDNA and 13% were found to have these structures, but this
group did not correlate with any other taxonomic determinants, suggesting the msDNA might have been
recently acquired (JBact172:6175[90]). In general, conjugative plasmids are fairly common in natural Ec
isolates (Genet143:1091[96]).
It has been argued that horizontal transfer has taken place in fungi, though I do not know how
compelling the data really are in terms of phylogeny (AR-Phytopathology38:325[00]). It has also been
argued that 0.5% of human genes were copied into the genome from bacterial sources (TIG17:235[01]),
but subsequent analysis has shown that the vast majority of the elements of this claim were based on
poor data, which is illustrative of the problem: The original analysis sought human genes that were found
in bacteria (and vertebrates), but not in non-vertebrates. The problem was that the sample set of nonvertebrates was small and possibly non-representative, so that many of the genes just happened to be
missing from these genomes. In fact a larger non-vertebrate data set has turned up all but a handful of the
candidate genes, which invalidates the notion at least for these genes. It is possible that the rest will
eventual be similarly invalidated based on even larger data sets. The point is that it is dangerous to make
conclusions based on the absence of something, especially on its absence from a small data sets.
Constraints and driving forces.
(i) Variable environments. Organisms need to both optimize growth under existing situations as
well as be as adaptable as necessary to deal with possible environmental changes. I believe that many
bacteria probably see this type of punctuated equilibrium in their environments more than do animals, but
I'd hate to try to defend this view. The fact that many bacterial species clearly exist as clonal populations
(i.e. one sees all the cells of a given organism in the environment being very closely related) supports the
notion that bacteria do behave this way (ASM2,2663[96]). In other words, clonality means that the
population has undergone a dramatic selection, which implies that the environment became somewhat
different from that to which the organism had already evolved. Bacteria in certain environments show
radically different rates of change, as populations, than they do in others (see LT2). This has been tested
3
experimentally where identical cultures of a soil bacterium (Comamonas sp.) were cultured for 10
generations in liquid or solid medium. Cultures on solid medium showed greater variability and a greater
degree of change in fitness from the parent (PNAS91:9037[94]). The notion was that solid media provide
a more diverse set of microenvironments, so reinoculation causes a significant environmental change for
specific cells. Highly variable environments select for mutator strains that typically lack the mismatch
repair system (see LT2) which also has the implication that they more readily recombine heterologous
DNA into their genomes following horizontal transfer (Genet164:13[03]).
(ii) Biochemical considerations of gene product function cause constraints on allowable genetic
change. A gene whose product had several different metal clusters, and therefore several motifs for
holding those clusters, might be seen to change less because a high frequency of mutations affecting
those regions would have deleterious consequences.
(iii) Not all genetic possibilities are tested. The low probability of multiple simultaneous genetic
changes, coupled with the limited set of genes to experiment with, means that many genetic possibilities
are not examined. In a sense, evolution is smart in that it varies a theme that already provides some
function (instead of completely random sequences, for example), which increases the likelihood that a
functional product will arise. However, the downside to this is that evolution is largely a prisoner of its own
history - it does not explore radically different paths because, by definition, these cannot arise at any
reasonable frequency.
(iv) The coordinate evolution of codon choice and tRNA abundance affects the evolution of the
genetic code as well as the functionality of horizontally transferred genes. That is, when genes are
transferred in to a new background with very different codon usage, they will simply not be translated very
efficiently (ASM2, 2627[96]). Having said that, it remains unclear how strong a selection for optimal codon
use is actually present. A mini-review on the topic of the data and theories concerning optimization of
translation (i.e. the balancing of tRNAs and codons) suggests that only highly expressed genes are under
a strong selection for optimal codon use (Genet149:37[98]).
(v) Gene conversion is a term used to describe the situation when two different alleles of a given
gene in the cell undergo a recombination event with the net result that one ends up identical to the other.
What is presumed to happen is that a strand of one copy is exchanged to the other duplex, where it is
used as a substrate to correct the complementary strand of the other copy. The exchanged strand
presumably returns to its original duplex and the corrected strand is then used in another cycle of repair to
correct its complement as well. Obviously this mechanism, or something like it, can act to reverse genetic
136
drift between two genes that have resulted from a duplication event (JMB211:395[90]). I assume that this
is also the reason that all the genes encoding 16S rRNA in a given organism are typically identical.
Without such a mechanism, one would expect a fair amount of sequence drift among the copies of these
genes.
This, however, raises the following paradox: if duplications are the substrate for most evolution
and the process relies on random mutations in the newly created copy, how can these mutations be
retained, and therefore a new function developed, if gene conversion is constantly homogenizing things?
A role for gene conversion in some aspects of evolution as been proposed (TIG177[01]).
Evolution of genes and pathways. (ASM2, 2638 & 2649 [96] on the evolution of metabolic pathways)
Note that much of the data implied in the following is based on our ability to detect evolutionary
relatedness and that there is a limit to how much two regions can have drifted and still be detectable as
homologous.
Genes evolve by simple mutation of existing (duplicated?) regions and by domain reassortment.
The latter can occur either by illegitimate recombination or intron reassortment (see below). Duplication
can either utilize similar functions already dedicated to the same pathway or recruit of functions from other
pathways. As an example of the former, Horowitz has coined the term retrograde evolution for a situation
in which a pathway is created by successive duplications backwards through the pathway, until it ties that
pathway into other metabolic paths (JBact171:6084[89]). An example of the latter is pdxB of Ec
(JBact171:6084[89]). In this case, enzymes for pyridoxime (vitamin B6) and serine synthesis are very
similar and therefore one was probably recruited from the other pathway. There are four different systems
that can be mutationally recruited to allow -glucoside utilization in E. coli lac mutants (Genet119:485[88])
and almost certainly these all normally degrade similar but non-identical sugars.
A separate and more complicated issue is the co-evolution of regulatory proteins and their target
sites. In other words, for a given site regulated by a given protein, how can you tolerate change in either
one (to evolve to a new regulatory role), since a change in either would likely be deleterious to the present
function? The answer is unclear, but the c1 activator of phage P22 binds at a site within its own coding
region, which allows the following: A single mutation has been found that changes both the protein and
the site such that they continue to form a functional pair, though neither interacts properly with the wildtype version of its partner (PNAS90:9562[93]). This can hardly be a very general solution to the problem,
however.
The role of introns in evolution is another curiosity. Lewin has argued that introns allow the cell to
experiment with different (RNA) deletions through splicing without making more permanent DNA
deletions. Introns also allow assortment of protein domains, in at least some cases, since exon:intron
boundaries often correlate with the boundaries between protein domains so that added modules tend not
to disrupt existing structure. The ongoing debate as to whether introns arose early (and have frequently
been deleted) or arose late remains unresolved, but see TIG14:132[98] and CurOpGenetDev12:701[02].
Finally, there is the slightly different matter of the evolution of portions of selfish DNA, such a
mobile genetic elements, phage and plasmids. The general relatedness of different lines of IS elements
from different organisms has been noted. Indeed, it has been argued that most ISs in a given host are
very similar to each other, as if they have been recently acquired. This in turn suggests that the elements
are lost with some frequency as well (CurrOpMicro9:526[06]).
Concerning plasmids, it has been my prejudice that the ability to replicate should be sufficient to
allow plasmids to exist, but this is apparently not correct. It is not the presence of genes beneficial to the
host that explains them either, as these genes would be recombined in to the chromosome and the
plasmid would lose its edge. Instead it has been argued that it is only the ability to move newly selected
genes into organisms that can explain their presence (Genet155:1505[00]). I have abbreviated the
argument and probably mis-stated it a bit, but it is unclear to me if one can therefore explain the ubiquity
of non-conjugative plasmids.
Evolution of prokaryotic species.
Molecular taxonomy. (For Woese's proposal for three domains of living systems: archaea, bacteria and
eukaryotes, see PNAS87:4576[90] and MicroRev58:1,10[94]. See Genet152:1245-1447[99] for a series of
mini-reviews on all aspects of archaeal genetics and biology.)
The 16S genes have been termed chronometers for evolution. However, is 16S taxonomy any
better than any other molecular clock? This breaks down into two rather different questions: (i) Is this
phylogeny any more likely to be correct than any other gene in describing the whole organism/species?
(ii) Is the phylogeny based on 16S RNA any better a predictor of unexamined traits (for that is the role of
such groupings after all), than similar trees based on more easily observable traits?
I think the answer to the first is a very clear yes. Traditional phylogenies were based on easily
137
detected differences, especially on nutrient source utilization (which might be the best genes for horizontal
transfer anyway), so they did not reflect the evolutionary history of the microorganism. Phylogenies based
on 16S rRNA are much better, largely because these are one of a number of genes encoding central
metabolism features that simply do not move between organisms for the reasons argued several pages
back. There are some issues here, of course, such as why all 16S rRNA genes in an organism tend to be
identical, but the coherence of phylogenies based on such central genes is comforting for the entire
approach. As a cautionary note, however, there is one well-characterized case where the 16S phylogeny
fails to match that from 23S and a number of other genes, implying horizontal transfer of the gene(s) for
16S (IntJSysEvolMicrobiol55:1021[05]).
The second question is trickier to answer, because these other properties (such as catabolism
and virulence) are the very things that are more likely to change through horizontal transfer. As a
consequence, the approach is useful for some predictions, but hardly perfect for predicting the sorts of
genes likely to be transferred.
The definition of species. (a species is a group of interbreeding individuals) Unlike species of higher
organisms, horizontal transfer continues to be a possibility for prokaryotes, even between relatively
unrelated organisms. While horizontal transfer is not frequent in the bench science sense, it is frequent
over evolutionary time scales. For example, sequence analysis indicates that there has been substantial
transfer between archaeal and bacterial hyperthermophiles. It follows that very different microbes that
inhabit a particular environment are excellent candidates for having participated in genetic exchange at
some point in the past. Many such events will not be recognizable if the organisms are somewhat related
to start with, because the sequence differences in genes from the two organisms will not be sufficiently
striking to verify transfer. Even extremely basic properties, like the ability to grow anaerobically, have
been shown to be transferable at imaginable frequencies (JBact180:3137[98]). To the extent that
horizontal transfer occurs, any attempt to identify phylogenies of species is less meaningful, since it
means that the recently transferred traits do not fit that phylogeny. On prokaryotic speciation, see the
numerous relevant arguments in ASM2 starting on pp.2620.
The species E. coli ..."consists of a set of widely distributed clones rather than a vast array of
allelic combinations expected in a frequently recombining species" (Genet120:345[88]). (I believe this is
so because enterics can move through their host populations fast enough that the world is one slightly
dysfunctional chemostat in terms of enteric bacteria. Therefore similar isolates are similar because they
have only recently been separated in time. This argument is not true for bacteria in other less fluid
environments.) The authors then argue that the clones are actually portions of the chromosome, because
of horizontal transfer of large regions carrying a favorable mutation (ibid, p359ff). It has certainly been
observed that different regions of E. coli (gnd and trp) diverge at different rates among wild type strains
and even give rise to different phylogenies (Genet113:s71[86]) (due to horizontal transfer??). On the other
hand, it must be asked how random these strains are, as well as how representative enterics really are as
models of general bacterial evolution. Similarly, all Mycobacterium leprae are genetically very similar by
RFLP (MolMicro4:1653[90]). This suggests that they are closely related and therefore the pool of
analyzed organisms is not diverse. However, this apparent clonality among enterics and some diseasecarrying organisms probably results because we (their hosts) inadvertently select for the most productive
mutants.
A related topic is the use of genome sequences to determine the ancestral genome for a set of
existing organisms. This has been done by comparing both conservative and non-conservative base
substitutions between related organisms (TIG15:254[99]), with the conclusion that a reasonable guess
can be made with highly related pairs of organisms. Another analysis, which shows that the genomes of
related microbes can display profound differences in overall structure, is in ARG32:339[98] and a related
article ARG32:185[98].
The mismatch repair system provides a mechanism for speciation of relatively similar bacterial
types. That is, as genetic drift makes homologous regions unable to successfully recombine, due to
mismatch repair, the possibility of recombination keeping the two organisms similar is diminished.
This brings us to the question of whether or not there is even such a thing as the phylogeny of
species/organisms in the presence of horizontal transfer, or must each gene not follow its own phylogeny,
based on where its predecessors have been? In other words, while phylogenies based on a given gene
make sense, they only make sense for that gene, not the organisms in which they are found. See
PNAS100:9658[03], CurOpMic6:498[03] and TIGS16:196[00] for references on the implication of gene
transfer for taxonomy. Again, I think this is a case where biology has become too complicated for our
terminology. The term species clearly has a very different and weaker meaning with prokaryotes with
horizontal transfer than with eukaryotes, since the chromosomal structure (among other things) severely
lowers the likelihood of genetic exchange between dissimilar organisms. On the other hand, there is
138
substantial evolutionary coherence to the central metabolic properties, and therefore to the genes that
encode those properties, in most all microbes.
Detectability of organisms/species. Until about 1990, the range of organisms that we put on our
phylogenetic trees was limited to those organisms that we knew how to culture. However, it has been
demonstrated, through the use of PCR to amplify conserved regions of 16S rRNA from the environment,
that we are only able to culture about 10% of the bacteria that are present and that many of the nonculturable ones represent apparently novel species (Nat345:20,60,63,[90]). Similar results were obtained
in the marine picoplankton community, with a PCR-amplification (JBact173:4371[91]). This latter paper is
interesting for the tricks they performed to avoid having their PCR primer select the better matches in the
population; the possibility of a bias in any step of the amplification procedure is a very serious one. A very
cute system has been devised to detect an organism based on its 16S sequence and then optically trap it
with a laser to remove it from the other organisms for culturing (Nat376:57[95]). It is unlikely that this
method will be broadly applicable! A nice summary review of microbial detection and its implications from
phylogeny and evolution is SystBiol50:470[01]).
I was always curious about how many prokaryotic cells there are in the world (relative to other
organisms) and I finally found a paper that did the guesstimate (PNAS95:6578[98]). They argue that there
29
29
30
are about 10 bacteria in aquatic habitats, 2 x 10 in soil (top 8 m), but an astounding 5 x 10 in the
24
subsurface below the soil and oceans. The numbers living in and on animals is a puny 5 x 10 .
Prokaryotes have about the same total carbon as plants (which is to say about half the carbon in
biomass), and about 10 x as much total nitrogen and phosphorus, and therefore represent the majority of
these elements in the biomass on earth. Finally, they probably make up a majority of the living protoplasm
on earth.
Evolution experiments with microbes. Microbes present a unique set of organisms with which to test many
predictions and hypotheses of evolution theory because they actually grow fast enough to do experiments
4
with meaningful numbers of generations. Twelve identical cultures of Ec were grown for 10 generations
and analyzed for differences. (Note that this is roughly the number of generations in the entire history of
Homo sapiens sapiens.) Among the conclusions were that change was initially rapid and then slowed;
and that the end populations differed from each other for a range of properties including (surprisingly?)
fitness (PNAS91:6808[94]). This supports the Sewall Wright hypothesis of fitness peaks in evolution. (As
a very rough and brief description of this theory: the effect on fitness, or perhaps on the phenotype in
general, of a mutation is a function of the other mutations in the cell. As a consequence, two different
daughter cells with different advantageous mutations have potentially different sets of mutations that
would provide further advantage. Eventually each line could achieve a level of fitness that was high, but
for such different genetic reasons that they were each in completely different physiological and genetic
situations - hence the term islands. Apologies to Wright and all real geneticists.)
Richard Lenski has monitored more than 50,000 generations of E. coli for cell size and fitness.
His results are important for various reasons, but not the least is that he is actually doing evolution in real
time, rather than monitoring it results after the fact. A paper on some recent results is Nat461:1243[09]. At
least for me, one of the striking results is that some of his cultures became mutators, even though the
fitness gain from any beneficial mutation was very small, which would seem to be more than
counterbalanced by the cost of being a mutator, but (see discussion in PlantBreedingRevs24:225 [04]
no p. 246.
Microbial communities. It is clear that most microbes exist in rather complex communities that affect their
biology in absolutely critical ways. In some sense, we are as hard-pressed to say anything important
about microbes in isolation as we are to talk about the behavior of any given cell that has been excised
from a multi-cellular organism. Now obviously this is a bit strong, as there is a substantial body of subcellular biology that is worth studying, but behavior of isolated pure cultures is or only modest relevance to
the real world. If we then accept that it would be valuable to talk about the environment of a microbe, we
are presumably talking about its microbial community. What would we need to know in order to say
something smart about that community? (As a bit of an aside, realize that almost all plants and animals
are actually associated large communities of microbes that have dramatic consequences for the host both positive and negative. There are perhaps ten times as many microbial cells in and on us as there are
human cells.)
If one wants to talk about a population, you need to be able to classify that population into groups
(species and genera) and then quantify those groups. Then one would wish to follow fluctuations in those
group populations with changes in the environment of the community. Ultimately, one would wish to know
the contribution of each subgroup to the dynamics of the community, which might include specific
139
metabolism or specific physical interactions. The problem with all of this is that we cannot either name or
enumerate any reasonable fraction of any even moderately complex community - the complexity is simply
too great and the tools are not nearly sufficient. We have developed the ability to monitor the population of
specific groups, using nucleic acid probes, but in an environment of hundreds of distinct species, this
does not tell us much about the community as a whole. The problem is both intellectual and technical and,
while it is being addressed by a substantial number of very sharp people, I believe this will remain one of
the great challenges of biology for a while.
Evolution of pathogens. The issue here is how did pathogens for organisms such as animals evolve
before there were animals to infect? I think there are two general possible answers. One is made in ASM
News66:126[00] by Bernard Dixon, citing arguments by Michael Brown, where he notes that pathogens
might have evolved to infect simpler species, such as protozoa. Indeed, some animal pathogens such as
Legionella pneumophila actually thrive in amoebae (InfectImmun62:3245[94]). Dixon also argues that
many of the cellular functions that are useful in surviving host defense are also involved in tolerating
difficult environments, such as the role of RpoS in host defense as well as survival in stationary phase. I
might posit that another rather different general hypothesis exists: these pathogens didnt evolve to be
pathogens, but simply became the first microbes to successfully occupy that habitat. Essentially they have
evolved much faster than their hosts because of their relatively short generation time etc. Consistent with
this view, many of the genes that appear to be specific for pathogenicity are found to be clustered in
pathogenicity islands, an observation strongly consistent with horizontal transfer and recent exchange
(ARM54:641[00]). It might well be that there are a variety of useful properties for being virulent and, with
horizontal transfer, there is a strong selection for such genes to move through a variety of bacteria.
Novel Bacteria. (Yeh, I know this doesn't go here, but it's neat, so I must at least mention it somewhere)
Throughout this text there are a lot of things that bacteria are not supposed to have or do, including
introns, reverse transcriptase, linear chromosomes, multiple chromosomes, et cetera. One of our other
notions about bacteria is that they are small, and this may also not be true: at least some bacteria in the
Epulopiscium genus (fish symbionts) are up to 600 m in length (PNAS105:6730[08]) and rumors
circulate that mm length bacteria have been detected near some shallow-sea vents. The Epulopiscium
have other amusing features, such as a massive number of genomes and a curious way of forming
multiple (2-7) internal progeny that are released from the dead mother cell. There is another giant
bacterium, Thiomargarita sp., with diameters of up to 0.7 mm, though almost all the volume is in the form
of a vacuole (Science284:493[99]). At the other end of the size spectrum is the claim of free-living
3
archaea with volumes less than 0.006 m (Sci314:1933[06]).
Evolution of eukaryotes. There are two separate issues here: the original evolution of eukaryotes and
their subsequent speciation. The first point is a fairly well known story: the ingestion of bacterial in to
archaeal cells and their subsequent processing toto organelles like mitochondria and chloroplasts. Rather
less well-known is the apparent fact that these events allowed large numbers of bacterial genes to
establish themselves in eukaryotic genomes (TIG14:307[98] and TIG17:431[01]). However, there are
other transfer events, such as that which allowed an animal mitochondrial DNA polymerase subunit from
the Thermus-Deinococcus group to enter the eukaryote lineage, that must have appeared by a completely
different route (TIG17:431[01]).
Subsequent evolution of eukaryotes, especially sexual ones, was substantially different than for
prokaryotes. For one thing, chromosome number, which appears to be altered rather easily in eukaryotes,
becomes a substantial block to mixing species, since homologous chromosomes can no longer pair
properly and aberrant chromosomal segregation will lead to lethal effects. This tends to isolate eukaryotic
lines in a way that does not happen for prokaryotes and their normal methods of gene transfer. However,
the possibility, and of course the demand in most eukaryotes, for sexual mating provides a completely
new route for genetic variation that is not available to prokaryotes, or at least not at that frequency.
Animals complicate things yet more by allowing some choice in mate selection. This can have a
dramatic effect on the direction in which a species evolves, but I am skeptical of the common view that
this mate selection is always appropriate for the good of the species. As I argued before, strongly positive
traits will probably be selected for and strongly negative ones will be selected against, but much of the
specific traits in male/female mate choice might simply be evolutionarily unimportant. Whichever view is
correct, there is no doubt that it is exceedingly difficult to do the critical experiment and determine, for
example, if bright wing patches on a red-winged blackbird imply a level of fitness that is actually important.
Do you believe in evolution? This question reflects the way the evolution debate has been couched for
a number of years and it has always left me uneasy. The problem is in the semantics of the word
140
believe. My dictionary defines it as the acceptance of the truth or reliability of something without proof
and there are aspects of this that make the scientist in me pause. The first point is that I do not think
science deals in truth, but this is the issue raised in the first paragraph of the next LT, so Ill say no more
here. The second issue is that saying I believe in evolution makes it sound like this is intellectually
similar to saying that I believe in Zeus. (The specific example is an effort to avoid offending any current
religious belief.) That is, wouldnt belief in evolution be as beyond rational as any religious belief
(beyond rational is NOT pejorative - virtually by definition, belief in the supernatural is, well, belief in the
beyond natural/rational) ? I think the answer to this depends on the definition of belief. (And I do not
think that beliefs in evolution and Zeus differ because of the existence of proof in the case of evolution.
There is certainly data consistent with evolution, but not exactly proof of it, and we cannot actually rerun
the experiment. I also note that virtually all religious beliefs are consistent with the reality we see, so I
dont think that data is what separates belief in evolution from belief in Zeus.)
I think the fact is that we scientists do not, or at least should not, believe in evolution. Rather we
assume it to be correct. That is, as a scientist, I would readily dump evolution for a new theory if there
were new and compelling data for that theory and against evolution. Now of course I dont expect this to
happen, because we have such a mammoth amount of data consistent with the general view of evolution,
but I certainly accept the possibility of that happening. In rather stark contrast, someone who truly
believes in Zeus, or in any other supernatural notion, cannot have that view changed by new data. This
is again almost by definition. A religious belief is supernatural and mere natural facts cannot change that
belief. So I think it is quite fine for scientists to believe or not believe in Zeus or any other supernatural
notion, but I think that they should not believe in evolution. Rather, they should assume it to be correct
pending more data.
607 Lecture Topic 17.................SCIENCE AND SOCIETY
Science is a way of trying not to fool yourself. The first principle is that you must not fool yourself, and
you are the easiest person to fool. Richard Feynman
And so we have not disproved out hypothesis, we just havent found a way of proving it yet. Anon.
What is science? My definition (with apologies to Popper) is that science is the generation and testing of
falsifiable models of reality that employ natural (as opposed to supernatural) conditions. As such, science
does not deal in truth (descriptions of reality that are correct everywhere and for all time), but in facts (the
products of reproducible experiments). The claim of addressing truth must be left to (religious) philosophy.
Indeed, and this is a nod to the notion of intelligent design, science essentially sets itself to describe the
world only in terms of natural phenomena. Now if it happens that aspects of the world can never be
described in terms of natural phenomena, which is what Intelligent Design essentially asserts, then the
scientific approach will never lead to the truth. This latter view may be completely correct, but it is not at
all scientific. As a proponent of intelligent design has stated "If we've defined science such that it cannot
get to the true answer, we've got a pretty lame definition of science." (NY Times 8.22.05) Fair enough, but
the vast majority of scientists are comfortable with the lame definition for one very good reason. The
intelligent design position essentially says that anything we do not currently understand must be
explained by recourse to the supernatural. By this argument, there is no need to do science because we
then have an explanation for everything, albeit a supernatural one.
Note that, by my crass definition, social science will rarely come under the umbrella of science,
for the simple reason that experiments cannot be reproduced and therefore hypotheses cannot be
falsified. This is not, in and of itself, a criticism of social science, as the field grapples with far more
important issues, in terms of their impact on humanity, than do the natural sciences. It is hardly the fault of
the field that it is stuck with the most difficult experimental animal (man) one could imagine. Still, this little
problem does affect the confidence one can have in the conclusions in the field. I also think that it affects
the very mind set of the participants of the field concerning the nature of proof. In other words, in a field
where nothing is falsifiable, or verifiable, you lose the sense of what proof even means.
Louis Pasteur said that There are no applied sciences....There are only applications of science,
and this is a very different matter...The study of the applications of science is easy for anyone who is
master of the theory of it. As Pasteur noted, science and its applications are very different, even if the
tools are the same. Basic science strives to improve our model of reality, where applied science
(technology) is arguably the effort to bend perceived reality to our will. In basic science any result of a
properly done experiment is interesting. In applied science, the only interesting results are those that bring
you closer to your specific goal. Either type of science can be good or bad (from a scientific point of view)
depending on the creativity and care with which it has been done. The moral question in either case is, to
what extent is science (and scientists) responsible for the technology it renders possible?
141
Science is probably neither culture- nor sex-neutral. I think we can agree that objectivity is neutral
(by definition) in these matters and therefore the interpretation of results should also be. The bias comes
in terms of what questions are being asked. For technology, this bias is of course even more dramatic,
because technology will be driven almost exclusively by economic forces and these can rarely be
accused of being enlightened or objective. Put more baldly, both science and technology do the bidding of
the elements of society controlling the funding. Science therefore tends to support the establishment, the
status quo, the moneyed interests.
Obviously, science produced by companies does not have the best interests of the society as the
main goal, but public science (federally funded) is not necessarily employed in the general good either. In
our society, the direction of the latter effort is a function of the ability of the electorate to understand
technical issues and communicate this to their elected representatives. Unfortunately, when neither the
electorate nor their representatives are technologically literate, the process often fails to serve societal
needs. It is therefore incumbent upon scientists, whatever their official job, to educate the larger society.
Admittedly this is becoming more difficult in a society where anyone with any form of knowledge is
derided as "elitist."
The role of scientists in society. In general, scientists should behave in a way that reflects their
appreciation of the above observations. For example, they should not see themselves as being purveyors
of indisputable truth, but rather as experts in a way of looking critically at reality. The most central element
in science is honesty: the bald desire for as much objectivity as we can muster. This has a variety of
forms.
As scientists we need to be honest and self-critical in the detailed interpretation of our results. We
have a tendency to believe results, however poorly achieved, if they conform to our expectations. This is
true even if we know it has flaws. We assume the flaws were not so critical because we still obtained the
desired result. We do this because we want to have guessed correctly - to prove our own cleverness and
insight. When a result is not the desired one, we have a strong impulse to say that the experiment didn't
work. This challenge to our honesty is a matter that we face virtually every day. To the extent that we
deceive ourselves, we are not only less efficient as scientists (since we will waste time following wrong
leads), we fail as scientists.
Most of the time when we perform experiments, we are examining hypotheses that we ourselves
have created. As normal human beings, we have the problem that we want to be right; we want to be able
to say "I guessed the way nature works. This leads to a severe, but subtle problem in what experiments
we chose to do. When possible, you want experiments that yield interesting and interpretable conclusions
from any result. In point of fact, we often must design experiments where only one of several results is
interpretable. The danger is that we often choose to do experiments where the interpretable result will be
supportive of our model; but where the uninterpretable result is simply that and not a disproof of our
hypothesis. With some frequency we must frame experiments the other way, so that we actually do test
our models. This problem becomes severe as we become increasingly wedded to a particular model. In
one of the clearest descriptions of how not to do science, I once heard a seminar speaker say, "We have
not disproved our hypothesis, we just haven't found a way of proving it.
When one speaks as a scientist, you have an obligation to be critically honest. That is, the public
should expect and receive a objective analysis when a scientist is asked a scientific question. Examples
of failures here abound: scientists minimizing the technical difficulties of Star Wars; industrial scientists
overstating the advantages or understating the problems of a particular drug; academic scientists
returning the desired results on contractual agreements with companies so they will continue to get
contracts; academic scientists overstating the usefulness of their work to funding agencies or the press
(N2-fixing corn is around the corner?). The common thread in these is the corruption of honesty for
personal gain, directly or indirectly.
Publications. The point of scientific publications is to communicate to other scientists. As a
consequence, they should provide the reader with sufficient information to see exactly what was done and
what results were obtained. A good publication is therefore one with clearly stated assumptions, methods
and interpretations. It is not necessary that the conclusions of the publication turn out to be correct.
Similarly, careless and poorly described science yields a bad publication regardless of the correctness of
the conclusions, because it cannot serve to communicate useful information to other scientists. Carefully
done (and described) science will be useful to others, even if the conclusions were incorrect (presumably
because of an error in the assumptions).
There are, at the extremes, two views of the role of publications and therefore the nature of an
appropriate publication. One position holds that the role of publications is to communicate to the
community immediately concerned with the results, to speed the advance of the field. In this view, valid
142
publications are those that contain reproducible, useful observations. Saving data for a larger, more
conclusive and broadly interesting paper means that other scientists wont see the interesting results until
a later date. This extreme view supports rapid publications of relatively modest advances in our
knowledge.
At the other extreme are those that feel that there is simply too much scientific literature already
for very small advances to be very helpful. Instead, these people demand publication of results that are
interpretable to a broader community. In this view, the observations of value to a few specialists are
impenetrable to others. The result is a massive literature where few can tell the wheat from the chaff
outside of their own field. The current response of the scientific community to this problem is to have
different sorts of journals, each with different audiences and therefore with different requirements for
broader interpretability.
Orthodoxy. Science has an interesting dilemma in that the current dogmas are perceived to be more
likely correct than alternative models and therefore results that are consistent with current dogma are held
to a lower standard of evidence than are results that fly in the face of dogma. There is considerable
justification for this, as results in conflict with the prevailing dogma typically are wrong. However, this has
the sinister problem that a low threshold for results supporting the dogma can give the appearance of
broader independent support for the dogma than is correct. This is the reason behind the occasional
catastrophic collapses of certain models and even disciplines: once a dogma is established, it can
become a self-fulfilling prophecy. Worse, it is in a scientist's political interest to play the game. No one will
crucify you for being as blinded as everyone else, and few will support your attempt to overthrow the
current dogma, especially when weighty reputations depend on its correctness.
A similar theme exists with the responsibility of companies to be concerned with the effects of
their technology. It becomes in a company's legal interest to not seek information beyond convention
wisdom, thus allowing a sort of deniability.
Scientific success. To discuss success, we should start by asking why we do science in the first place. I
argue that there are two general motivations: the desire to understand the world and the enjoyment in
participating in the intellectual game of science. While we all want to know things for their own sake, it is a
rare scientist who does not take immense pleasure in explaining their findings to peers, predominantly for
the very human reason that we want people whom we respect to say, "Gee, that was clever!" (This was
perhaps best said by a mathematician who stated, "All I have desired is the grudging respect of a few
colleagues whom I admire.")
With the honest fervor of someone who never has, and doubtless never will, be a recipient of an
award in science, I argue that awards are not a constructive part of science, nor a measure of success.
Science is not the product of great scientists, just as history is not the summation of the deeds of the
occasional king or general. Science is the product of the scientific community and the larger society the
funds it, albeit unknowingly and often for the wrong reasons. Many famous and awarded scientists are
excellent and many unknown scientists are wretched as scientists, but the correlation is not what you
might expect. Much of fame and success is a function of two things, in addition to certain abilities: the luck
of being in the right spot at the right time and the ability to achieve name recognition, especially in
association with an important problem. About the role of great scientists: Ask yourself how many
discoveries are so novel that they would not have been found by someone else within a couple of years. I
suggest that there are very few and probably none in biology.
I think that it is probably true that the most important issue in perceived success is the choice of
project, rather than how good your science actually is. Being extremely good in an area that people don't
care about is a fine way to do science (no competition) but will never get you awards. Moreover, many of
the best-known scientists have been extremely single-minded in their attention to a given problem. This
has allowed them to become associated with that problem in the scientific community's mind. This is
neither wrong nor improper, but scientists with wider intellectual horizons are in a sense punished by their
inability to have a single problem associated with their name. Obviously this also has implications for
awards: awards are often given for the perceived progress in a field and the award goes to the perceived
leader in the field. Awards have several other problems (i) they have a life of their own (the winner of the
prestigious "X" award cannot be a nobody, it must be someone who has already received certain lesser
awards); (ii) as a consequence, awards tend to pile up on a few individuals for years after their great work
(this is a problem because it gives disproportionate credit to a few and ignores the progress made by the
scientific community); and (iii) very little creative science is done by those who are driven by the prospect
of awards. This impetus for awards produces poor science and destructive competition amongst scientists
(see the book The Nobel Duel).
To give an idea how complicated awards are, at least in their relationship to the actual science
143
that they presumably represent, read the paper on Barbara McClintock in TIG17,475[01]. It explains that
she never thought that her work on transposition was so important, but rather was proud of the regulation
aspects of the work. The former gained in significance as it was discovered to be broadly relevant, while
the latter was eventually shown to be substantially incorrect. Now she certainly was a highly creative
important scientist, but what does this say about the process of awards? Remember, however, that ones
view of ones own work is often a function of how hard we had to work or how clever we had to be, and
not necessary how broadly important the results turned out to be.
Luck. You have already seen, as graduate students, that some people enter a lab and start on a project
that is ready to go, and, if they are motivated, they publish massively. Another student might do better
science on a more difficult problem, but where less of the preliminary work has been done, and have
many fewer publications. Success is also affected by the quality and quantity of one's competitors, over
which we have relatively little control. Success and fame are often a function of the breadth of applicability
of the research. In many cases this is intentional (you chose the problem for that reason), but
occasionally, it is serendipity (in fairness, such serendipity happens repeatedly only to the prepared
scientist). Success can even be a function of timing: you've just finished your great work on drosophila
development and entered the job market simultaneously with two other folks similarly successful in the
same area: MIT has an opening in developmental biology and so does SW Utah State and UW Beaver
Falls. It takes little imagination to guess which of these people will make it to the National Academy by
age 40. Heroes almost always prosper, and turkeys almost never do, but the bulk of us are competent,
reasonable folks for whom luck plays a significant factor, a fact that few successful people seem to
understand.
Ability. The organization of science, particularly scientific management (either in academics or industry),
is extremely difficult. It demands both technical competence and managerial skills, so essentially no one
is fully equipped to pull the job off very well. Certainly to be a scientist you should be intelligent,
knowledgeable, honest, hard working, careful and imaginative; but you should also be a good (and rapid)
writer, a clear and interesting speaker, organized and an organizer, a good motivator, and, of course, no
one is. If they were, they would develop a new problem, they would be insufferable. Clearly, you must
concentrate on what you can do well and try to do the other things well enough to be adequate.
As a graduate student, you are here to learn to do science, but more importantly to become a
productive scientist and a satisfied human being in the context of your own particular strengths and
weaknesses.
I might argue that you have now reached the point in your career where, for the very first time,
intelligence no longer matters much. The point is that all of you colleagues and competitors are intelligent
and what matters now is productivity. While intelligence is not irrelevant to productivity, it is not as
important as motivation, organization and maturity, all of which are "acquired skills."
Molecular biology and the evolution of science. Scientific problems evolve and each generation of
scientists chooses and solves their chosen problems, typically those appropriate to the understanding and
tools of the time. Do not confuse methodology with science: better technology looks spiffy, but does not
indicate that today's scientists are any more knowledgeable or insightful than their predecessors.
Continue to ask biological questions. Don't be arrogant concerning the primitive work of our predecessors
or contemptuous of new technologists when they threaten to displace you.
*************************
INDEX
2 plasmids, 98
alkylation, 33
allele number, 3
allostery, 58
Ames test, 39
amplification, 74,77
antisense RNA, 96
anti-, 53
arrays, 62
ARS, 98
attenuation, 54
autoregulation, 51
auxotroph, 3
bacterial artificial chromosomes, 96
bacteriocins, 109
base excision, 33
B-DNA, 8
bent DNA, 9
cell compartments, 63
cell cycle, 124
Chi, 75
144
chromatin, 10
chromosome, 93
cis-dominant, 115
cistrons, 115
complementation, 114, 115
conditional, 3, 41
congression, 119
conjugation, 93, 102
conjugative transposons, 81
cosmid vectors, 64
counter-selections, 89
cruciforms, 8
cryptic plasmids, 95
curing, 99
dam, 34, 84
deletion mapping, 122
deletions, 71
dominant, 3, 114
electroporation, 106
enrichments, 92
error-prone, 36
error-prone repair, 30
expression arrays, 92
expression vector, 60
F factor, 103, 104
FIS, 24, 78
forward mutations, 40
frameshift mutations 32, 42
fusion, 60
GC content, 19
GC-rich, 11
gene, 3, 11
generalized transduction, 111
genetic code, 16
genomes, 18
genotype, 2, 28
gyrase, 72
Hfrs, 121
homologous recombination, 26
homology, 13
horizontal transfer, 133,138,140
hot spots, 31, 39, 133
identity, 13
IF3, 21
IHF, 9, 24, 79
incompatibility, 93, 101
informational suppression, 128
insertion sequences, 79
interference, 120
intragenic complementation, 116
introns, 14
inversions, 78
IVET, 91
kil systems, 100
knockouts, 37
LacZ, 117
leaky, 3
lesion, 29
lethal, 3
linear replicons, 95
linkage, 119
localized mutagenesis, 40, 66
m.o.i., 108
marker, 29
mating types, 124
mismatch repair, 34
missense mutation, 41
mRNA processing, 21, 55, 56
Mu, 80
mutant, 2
mutant frequency, 28
mutation, 2
mutation rate, 28, 133
mutators, 36
negative complementation, 115
non-permissive, 3, 42
nonsense mutations, 42
non-tandem duplications, 76
nucleotide excision, 33
ORF, 11
origin of replication, 97
oriT, 103
overlapping genes, 12
P1, 111
P22, 111
pac sites, 111
palindromes, 8
partitioning, 99, 101
PCR, 66
permissive, 3
phage, 64
phage display, 90
phagemids, 65
phenotype, 2, 28, 43
phenotypic lag, 45
plasmids, 93
point mutation, 29
PolA, 97
polarity, 25
poly-A tail, 20
populations, 59
prions, 107
proofreading, 33
prototroph, 3
pseudohyphae, 123
quorum-sensing, 51
randomization., 67
RecA, 26, 71
RecBC, 27
recessive, 3
recombination, 119
recombinational repair, 35
Rep proteins, 96, 97
repetitive DNA, 9
replica printing, 90
replicon, 93
reporter, 60
reverse transcriptase, 15
reversion, 3
reversion frequency, 28
145
RF1, 23
Rho, 25
RNA world, 132
RNAi, 70, 113
sacB, 73
screens, 87, 90
segregation, 125
selections, 87
Shine-Dalgarno, 16, 20, 21
siblings, 46
signature-tagged mutagenesis, 92
silencing, 52
similarity, 13
site-directed mutagenesis, 68
site-specific recombination, 80
specialized transduction, 112
species, 138
spontaneous, 28, 32
stable, 3
stochastic events, 48
suicide plasmids, 65, 68,120
supercoiling, 5, 52
suppressors, 58,128
synthetic phenotype, 3
T7 promoters, 65
tandem duplications, 73
temperature-conditional, 41
tight, 3
transcription elongation, 24, 53
transcription termination, 25, 54
trans-dominant, 114, 115
transformation, 104
translation initiation, 56
translational attenuation, 57
translational coupling, 21
translational repression, 57
transposition, 79, 82
transposons, 79
tRNA suppressors, 23
two-factor crosses, 120
Ty elements, 82
unstable, 3
UV, 32
viroids, 107
wild type, 2
X-gal, 62
YAC, 64
Z-DNA, 8, 72
, 24, 53
146

Uploaded by

Uploaded by

MICROBIOLOGY/GENETICS 607

copyright 2011, G. Roberts

1 DNA, GENES, AND THE CODE

6 DUPLICATIONS AND INVERSIONS

7 MOBILE GENETIC ELEMENTS

8 SELECTIONS, SCREENS AND ENRICHMENTS

9 PLASMIDS AND CHROMOSOMES

11 VIRUSES AND OTHER INFECTIOUS ELEMENTS

13 GENETIC MAPPING IN PROKARYOTES

17 SCIENCE AND SOCIETY

ORGANIZATION OF THE TEXT

Lecture Topic 1........DNA, GENES, AND THE CODE

substitutions in an immediately adjacent base (Nat286:123[80]).

Polarity. Historically, polarity

607 Lecture Topic 2........ MUTAGENESIS IN VIVO

Clustered spontaneous mutations. In yeast, Sherman has

Figure 2-2. Frameshifts and reversion.

Frameshifts. Frameshifts are defined as the addition or

confidence the real concentrations of both metabolites and proteins.

607 Lecture Topic 3............ REGULATION

keeps track of changes in protein modification.

Regulation at the level of DNA

607 Lecture Topic 5........DELETIONS

607 Lecture Topic 6......... DUPLICATIONS AND INVERSIONS

then subsequently drifted through further mutation.

In general, bacterial duplications are rather

absence of a recombination system (Rec ) stabilizes

duplications fairly well (10 of a Rec

Lets say that you have the

may or may not be incorporated into the new chromosome upon

Figure 7-1. Model for

tend to transfer only themselves,

growth problems on different media.

607 Lecture Topic 8.........SELECTIONS, SCREENS AND ENRICHMENTS

situations where one can be fooled by changes in the genotype of a population.

glucuronidase), phoA (alkaline phosphatase), and xylE (catechol-2,3-dioxygenase).

607 Lecture Topic 9.................PLASMIDS, CHROMOSOMES, and CONJUGATION

conjugation model below).

Figure 9-1. A two-step model for conjugal DNA transport.

607 Lecture Topic 10............. TRANSFORMATION (see ASM2,2449[96])

607 Lecture Topic 11..........VIRUSES AND OTHER INFECTIOUS ELEMENTS

607 Lecture Topic 12...................COMPLEMENTATION

(ii) The mutant gene might cause the

more stable to intramolecular recombination and deletion formation in a Rec background.

Complementation in eukaryotes can be different. Complementation is performed in yeast by mating

requires that you have a gene transfer system

607 Lecture Topic 14.................. YEAST GENETICS

Figure 14-1. The yeast cell cycle.

Yeasts present numerous advantages when

607 Lecture Topic 15.................. SUPPRESSION

For the following cases, metabolites are

607 Lecture Topic 16..............EVOLUTION

You might also like