Computational Methods in Protein Evolution 2019
Computational Methods in Protein Evolution 2019
Computational Methods in Protein Evolution 2019
Computational
Methods in
Protein Evolution
METHODS IN MOLECULAR BIOLOGY
Series Editor
John M. Walker
School of Life and Medical Sciences,
University of Hertfordshire,
Hatfield, Hertfordshire AL10 9AB, UK
Edited by
Tobias Sikosek
GlaxoSmithKline, Cellzome - a GSK company, Meyerhofstrasse 1,
Heidelberg, Baden-Württemberg, Germany
Editor
Tobias Sikosek
GlaxoSmithKline
Cellzome - a GSK company
Meyerhofstrasse 1
Heidelberg, Baden-Württemberg, Germany
© Springer Science+Business Media, LLC, part of Springer Nature 2019, corrected publication 2019
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction
on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations
and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to
be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty,
express or implied, with respect to the material contained herein or for any errors or omissions that may have been made.
The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This Humana Press imprint is published by the registered company Springer Science+Business Media, LLC, part of
Springer Nature.
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.
Preface
Proteins are the most versatile kind of molecule that we know and the result of a long
evolutionary process. During this process, countless rearranging, mutating, and replicating
strands of DNA have managed to both encode and conserve proteins that would allow them
to replicate and stay intact and on the other hand have allowed their proteins to change and
ultimately help them replicate more than other strands of DNA. All cells make proteins in
their protein factories called ribosomes, where the DNA of a gene is translated according to
the ancient genetic code into strings of amino acids which follow the laws of thermodynam-
ics and molecular forces to fold up into specific wobbly three-dimensional shapes. Protein
evolution happens whenever an accidental “typo”—or mutation—in the gene is translated
into a modified protein, and that protein is released into the busy commotion within the cell,
packed within a dense soup of other molecules in water. Whatever this new protein does
differently than its predecessor can determine the fate of that mutation, making it either an
essential innovation, a terrible mistake that gets erased, or something that just stays around
for a while without being noticed, maybe to play a role in the distant future.
This book is a compilation of methods that can be applied to various problems related to
protein sequence and structure. It is a diverse collection of approaches ranging from broad
conceptual (“protein space”) to very specific applications (“antibody modeling”). The term
“evolution” is used slightly differently in various fields of science. While evolutionary
biologists think about the natural process of Darwinian evolution (and other post-
Darwinian forms of evolution of organisms living in populations and environments), bio-
chemists take a more design-oriented approach to evolution, using the evolutionary process
in vitro or in silico to make proteins with certain desired properties. Physicists on the other
hand use the term evolution to describe a continuous process in time that changes a system
from one to another state. While physics plays a significant role in this book, it is the first two
notions of evolution that will be described in the following chapters.
Evolutionary research has made extensive use of computers. While the result of evolu-
tion can be readily studied at the macroscopic, phenotypic level, evolutionary biology has
always had a strong theoretical component, since the actual process had been rare to directly
observe for a long time. The underlying patterns of inheritance and the interplay between
geography and population dynamics have been described in mathematical terms and have
always accompanied the progress made in the Molecular Biology of cells that eventually
elucidated the core mechanisms of inheritance: the information stored in DNA and how it is
replicated and passed on—imperfectly—to future generations. The field of Bioinformatics
was born as soon as the first sequences of genes and proteins had been published at a large
enough quantity to be amenable to direct sequence-to-sequence comparisons. The fields of
Molecular Evolution and Phylogenetics were close companions of this development where
mathematical models and computational algorithms were combined to reconstruct the most
likely evolutionary history given the observed DNA sequences. Protein sequences have been
a free giveaway due to the ready translatability of the amino acid sequence from DNA based
on the almost universal genetic code. DNA sequences became the main source material of
molecular evolution research for quite a while, further spurred by the Human Genome
Project and later the advent of the next-generation sequencing data explosion. Evolutionary
relationships within populations and among species were revealed in ever greater detail.
v
vi Preface
Still, no matter how much genetic sequence data has become available, there still have
been many aspects of how genetics translates to observable (phenotypic) changes that
cannot be understood at that level of description. Network science is another toolkit rooted
in math and computation that is used to study evolution at the genotypic to phenotypic
interface. There are networks representing physical and chemical molecular interactions
within a cell, the flow of information and cell-level “computation” and communication, as
well as more abstract networks describing the relationships and similarities between gene
and protein sequences, including the entire “universe” of known proteins. While biological
network science—often called systems biology—comes close to providing a working model
of the cellular phenotype, the real “gap” in understanding where a mutation in the DNA
sequence makes a difference to the survival and fitness of an entire organism is how physical
interactions, the “edges” or connections in systems biology networks, are a result of
biophysical properties of proteins, which can be altered by mutations. It is this point—
where changes of DNA translate into altered protein structure and function—that most of
the methods in this book are focused on.
While Molecular Evolution has been a backward-facing, almost historical, discipline in
its early days, it has increasingly matured into an “applicable” science due to its intersections
with Biochemistry and Biophysics. Protein evolution is therefore much more than just the
description of evolutionary relationships based on sequence differences. It has become a
powerful tool for interfering with the evolution of pathogens, for devising therapies against
mutation-based diseases such as cancers, and for designing novel enzymes with properties
that can go beyond naturally evolved functions. Methods from evolution can be easily
applied whenever genetic variation is at play, and this variation is what makes all humans
unique and sometimes even determines why diseases and infections affect each of us
differently.
While each chapter in this book is the unique work of its authors and there is no
predefined “narrative” to this book, some common themes become apparent.
The first theme is that of mutations of single amino acids, i.e. point mutations. Predict-
ing their effect on the physical structure of a protein is an important capability that links the
abundance of sequence information with the comparatively few known structures (Chapters
1 and 2). Other mutational mechanisms lead to gene duplication (Chapter 3) and even de
novo emergence of new genes (Chapter 4).
Likewise, the understanding of pairwise correlated mutations can be used to reveal
structure information where none is available because the fates of spatially close (and
physically interacting) amino acids are evolutionarily linked and coevolve (Chapters 5, 6
and 7).
Going back into evolutionary history, the structure and function of proteins can be
reconstructed and used productively, since these may bear similar functions to their extant
descendants yet also may have some new functional properties (Chapters 8 and 9). Many
formerly sequence-based methods such as sequence alignments and phylogenies can be
improved by applying a more structural and biophysical viewpoint (Chapters 10 and 11).
Instead of exploring similar proteins along evolutionary time, one can of course also
compare existing proteins based on their similarity in sequence and structure. A number of
classification schemes for organizing all known proteins exist, and it is possible to explore an
entire “protein universe,” often by breaking full proteins into even smaller building units
called domains (Chapters 12, 13, 14, 15 and 16). Homology modeling makes use of these
similarities by fitting the sequences of proteins without known structure to those known
structures of proteins with similar sequence (Chapter 17). This structure prediction can also
Preface vii
be extended to protein-protein interactions (Chapter 18) and even some structural proper-
ties of proteins lacking a fixed structure, i.e., disordered/unstructured proteins can be
predicted (Chapter 19). Another important aspect related to disorder is the intrinsic
dynamic nature of folded proteins that always exist as an ensemble of conformations,
some of which become favored or disfavored with evolutionary changes (Chapter 20).
Finally, evolutionary principles are at work in shaping such versatile proteins as anti-
bodies or enzymes, which can also be designed to have certain properties in silico by
applying directed evolution, i.e., where the evolutionary endpoint, but not its path, is
determined by the researcher (Chapters 21 and 22).
The book covers a wide range of computational approaches, including the dynamic
programming techniques of sequence alignments, the clustering methods of phylogenies,
physics-based approaches such as molecular dynamics simulations, and a range of statistical,
graph-based, and machine learning methods. While the authors take the time to give some
background and references in the introductory sections, this book is not a textbook, and
more detailed descriptions of underlying theory and algorithms may have to be found
elsewhere. Nevertheless, I think that there is a lot to be learned from this book for an
interdisciplinary readership.
I sincerely hope that this book offers many useful workflows and techniques that help
many researchers and students working with proteins computationally. I also strongly
encourage the reader to go beyond the individual protocol and mix and match the different
methods to come up with new innovative solutions. That’s what evolution would do.
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
ix
x Contents
KELSEY AADLAND Department of Microbiology & Cell Science, Institute for Food and
Agricultural Sciences, University of Florida, Gainesville, FL, USA
MATTEO ALDEGHI Max Planck Institute for Biophysical Chemistry, Computational
Biomolecular Dynamics Group, Göttingen, Germany
BEAT ANTON AMREIN Associate Scientist, Tecan Schweiz AG, M€ a nnedorf, Switzerland
MIGUEL ARENAS Department of Biochemistry, Genetics and Immunology, University of
Vigo, Vigo, Spain
MARIANO AVINO Department of Pathology and Laboratory Medicine, Western University,
London, Canada
UGO BASTOLLA Centre for Molecular Biology, Severo Ochoa (CSIC-UAM), Madrid, Spain
NIR BEN-TAL Department of Biochemistry and Molecular Biology, George S. Wise Faculty of
Life Sciences, Tel Aviv University, Tel Aviv, Israel
MARTINO BERTONI Biozentrum, University of Basel and SIB Swiss Institute of
Bioinformatics, Basel, Switzerland
STEFAN BIENERT Biozentrum, University of Basel and SIB Swiss Institute of Bioinformatics,
Basel, Switzerland
LORENZA BORDOLI Biozentrum, University of Basel and SIB Swiss Institute of
Bioinformatics, Basel, Switzerland
ERICH BORNBERG-BAUER Institute for Evolution and Biodiversity, University of Münster,
Münster, Germany
CARLES CORBI-VERGE Terrence Donnelly Centre for Cellular and Biomolecular Research,
University of Toronto, Toronto, ON, Canada
LILIANA M. DÁVALOS Department of Ecology and Evolution, Stony Brook University, Stony
Brook, NY, USA
CHARLOTTE M. DEANE Department of Statistics, University of Oxford, Oxford, UK
MARIA SILVINA FORNASARI Departamento de Ciencia y Tecnologı́a, Universidad Nacional
de Quilmes, CONICET, Bernal, Argentina
NICHOLAS FURNHAM London School of Hygiene and Tropical Medicine, London, UK
VYTAUTAS GAPSYS Max Planck Institute for Biophysical Chemistry, Computational
Biomolecular Dynamics Group, Göttingen, Germany
NICK V. GRISHIN Department of Biophysics, University of Texas Southwestern Medical
Center, Dallas, TX, USA; Howard Hughes Medical Institute, University of Texas
Southwestern Medical Center, Dallas, TX, USA
BERT L. DE GROOT Max Planck Institute for Biophysical Chemistry, Computational
Biomolecular Dynamics Group, Göttingen, Germany
EMINE GUVEN-MAIOROV Cancer and Inflammation Program, Leidos Biomedical Research,
Inc., Frederick National Laboratory for Cancer Research, National Cancer Institute,
Frederick, MD, USA
JOSEPH L. HERMAN Department of Biomedical Informatics, Harvard Medical School,
Boston, MA, USA
KRISTINA STRAUB Institute of Biophysics and Physical Biochemistry, University of
Regensburg, Regensburg, Germany
xi
xii Contributors
Abstract
The function of a protein is largely determined by its three-dimensional structure and its interactions with
other proteins. Changes to a protein’s amino acid sequence can alter its function by perturbing the energy
landscapes of protein folding and binding. Many tools have been developed to predict the energetic effect
of amino acid changes, utilizing features describing the sequence of a protein, the structure of a protein, or
both. Those tools can have many applications, such as distinguishing between deleterious and benign
mutations and designing proteins and peptides with attractive properties. In this chapter, we describe how
to use one of such tools, ELASPIC, to predict the effect of mutations on the stability of proteins and the
affinity between proteins, in the context of a human protein-protein interaction network. ELASPIC uses a
wide range of sequential and structural features to predict the change in the Gibbs free energy for protein
folding and protein-protein interactions. It can be used both through a web server and as a stand-alone
application. Since ELASPIC was trained using homology models and not crystal structures, it can be
applied to a much broader range of proteins than traditional methods. It can leverage precalculated
sequence alignments, homology models, and other features, in order to drastically lower the amount of
time required to evaluate individual mutations and make tractable the analysis of millions of mutations
affecting the majority of proteins in a genome.
Key words Computational biology, Structural biology, Bioinformatics, Protein stability, Mutations,
Protein engineering
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_1, © Springer Science+Business Media, LLC, part of Springer Nature 2019
1
2 Alexey Strokach et al.
1.3 Combination Several tools have been developed that attempt to combine
of Sequence sequence- and structure-based information in order to make
and Structure more accurate predictions about the deleteriousness [22] and the
structural impact [21, 23, 24] of mutations. Those tools generally
are “meta-predictors” which integrate the results of several
sequence- and structure-based tools using machine learning algo-
rithms trained on an appropriate dataset [25, 26]. Most of those
tools remain limited in their coverage because only a small fraction
of all proteins and protein-protein interactions have an experimen-
tally determined structure [27]. ELASPIC, developed by Berliner
et al. [23], overcomes this limitation by using homology models,
instead of crystal structures, to evaluate the structural impact of
mutations. ELASPIC still achieves relatively high accuracy in pre-
dicting the effect of mutations on protein stability and protein-
protein interaction affinity, but it has much higher coverage, includ-
ing the majority of proteins in the human proteome and hundreds
of thousands of protein-protein interactions.
In this protocol, we describe how to set up and run ELASPIC
on a local machine. We describe how precalculated homology
models and other data can be downloaded and installed in order
to greatly reduce the time taken by ELASPIC to evaluate new
mutations. Finally, we show how to use ELASPIC to perform
alanine scanning of a protein-protein interaction interface and
how to evaluate the structural effect of several thousand mutations
that have been implicated in cancer.
2 Materials
involved and may take several hours. If you do not wish to make
changes to the ELASPIC source code and are planning to run
under a few thousand mutations, using the ELASPIC web server
[28], available at http://elaspic.kimlab.org, is encouraged. The
web server may also be used to verify the results obtained using a
local installation of ELASPIC.
The source code for ELASPIC is available at https://gitlab.
com/kimlab/elaspic/ and is provided under an MIT license. The
documentation for ELASPIC is available at https://kimlab.gitlab.
io/elaspic/. ELASPIC should work on any Linux distribution with
a version of glibc 2.14 (e.g., CentOS 6 or newer, Ubuntu 12.04
or newer). At the moment, it does not work on Windows or
MacOS (although see Notes 1 and 2).
ELASPIC can be run using two different “pipelines,” the data-
base pipeline and the local pipeline, as shown in Fig. 1. The database
pipeline allows us to evaluate the thermodynamic impact of muta-
tions on a proteome-wide scale, without having to specify a struc-
tural template for each protein. This pipeline takes as input the
UniProt ID of the protein being mutated and one or more muta-
tions affecting that protein. At each decision node, the pipeline
queries the database to check whether or not the required informa-
tion has already been calculated. If the required data has not been
calculated, the pipeline executes the appropriate code and stores the
results in the database for later retrieval. The pipeline proceeds until
homology models of all domains in the protein, and all domain-
domain interactions involving the protein, have been calculated and
the ΔΔG has been predicted for every specified mutation. The local
pipeline can be used without downloading and installing a local
copy of the ELASPIC databases but requires a PDB structure or
template to be provided for every protein. The output from this
pipeline is saved as JSON files inside the working directory, rather
than being uploaded to the database, as in the case of the database
pipeline. Both pipelines use the same internal libraries to perform
the majority of the computation.
The ELASPIC database, required by the database pipeline,
includes many external datasets, which are listed in Table 1. The
use of the external datasets is made transparent to the ELASPIC
user, who simply has to load the data from the ELASPIC download
page (http://elaspic.kimlab.org/static/download) into their local
ELASPIC database using the elaspic database load-basic or elaspic
database load-complete commands. The only exception is the
BLAST nr database, which is required by both the database pipeline
and the local pipeline and has to be downloaded separately from the
NCBI website (although see Note 4). This is described in detail in
Subheading 3.
Predicting the Effect of Mutations 5
Input:
Uniprot ID + mutation(s)
Input:
PDB [+ target sequences]
+ mutations
1.
Do we have a no
Run Provean to construct a
multiple sequence
ELASPIC internals
Provean MSA for alignment for the specified
this protein? protein.
DB elaspic_sequence.py
Input: fasta file with domain sequence
yes Create and mutate
Output: provean supporting set
2.
.mutate(mutation): to compute sequence- sequence objects
Do we have Run Modeller to create based features of a mutation.
no
homology models homology models for all
for all domains in domains in this protein.
this protein?
DB
elaspic_model.py
yes
Input: fasta file with target sequences, pdb
file of the template
3.
Run Modeller to create Create and mutate
Do we have
homology models of all pairs Output: Homology model + model
homology models
for all interactions
no
of domains mediating properties model objects
interactions involving this
involving this
protein?
protein. .mutate(mutation): to compute sequence-
DB
based features of a mutation.
yes
4.
Does the specified elaspic_predictor.py
Return None.
mutation fall no ELASPIC only works for
inside a domains
mutations that fall inside
Input: DataFrame of all features, with one
for which we have
a structural
domains. mutation per row (as if pulled out from the Compute ΔΔG
template? database)
yes Output: ΔΔG predictions
Run FoldX and other
5. programs and internal scripts
Have the features to calculate all the features
and ΔΔG values no required by the machine
been calculated learning classifier. Run the
for the specified
mutation(s)?
classifier to predict a value of
ΔΔG for every mutation and
Results
DB
every domain / domain pair.
yes
Success!
Return the predicted ΔΔG caused
by the mutation for all domains and
domain-domain interactions
Fig. 1 Schematic providing a general outline of ELASPIC. ELASPIC provides two different pipelines: a database
pipeline and a local pipeline. The database pipeline takes as input the UniProt ID of a protein and a mutation
and constructs homology models of the domains and domain-domain interactions involving the protein
automatically. The local pipeline takes as input the structure of a protein, or the sequence or a protein and
a structural template, and a mutation. It requires no precalculated data and can run in the absence of the
ELASPIC database. Both pipelines use the same code to perform the majority of the calculation
3 Methods
3.1 Installing 1. First, we should set the environment variables, which are
ELASPIC required for installing and using ELASPIC, in our ~/.bashrc
file. This way, those environment variables will be set whenever
we start a new bash shell. The required environment variables,
6 Alexey Strokach et al.
Table 1
External databases that were used in the construction of the ELASPIC database
optional arguments:
-h, --help show this help message and exit
command:
{run,database,train}
run Run ELASPIC
database Perform database maintenance tasks
train Train the ELASPIC classifiers
3.2 Running ELASPIC The first use case for ELASPIC is to predict the thermodynamic
effect of mutations on a protein or a protein-protein interaction for
3.2.1 Evaluating
which a crystal structure is available (local pipeline in Fig. 1). In this
the Effect of Mutations
case, the crystal structure of the protein can be provided to ELAS-
on a Single Protein (Local
PIC directly, and no homology model needs to be created. In the
Pipeline)
following example, we will show how to use ELASPIC to perform
alanine scanning of the dimerization interface of glutathione S-
transferase.
1. Make sure that the environment variables that we set in step 1
of the ELASPIC installation are available.
$ [[ -z ${BLAST_DB_DIR} ]] && source ~/.bashrc
3.2.2 Evaluating A second use case for ELASPIC is to evaluate the effect of muta-
the Effect of Mutations tions in a large number of proteins and protein-protein interactions
Proteome Wide (Database for which a crystal structure may not be available (database pipeline
Pipeline) in Fig. 1). In the following example, we will show how to use
ELASPIC to predict the effect of missense mutations found in the
OncoKB database [29] on protein stability and protein-protein
interaction affinity. OncoKB is a database of mutations in known
cancer genes with well-established clinical ramifications.
1. Make sure that the environment variables that we set in step 1
of the ELASPIC installation are available.
$ [[ -z ${BLAST_DB_DIR} || -z ${ELASPIC_DB_STRING} || -z
${ELASPIC_ARCHIVE_DIR} ]] && source ~/.bashrc
4 Notes
Acknowledgments
References
1. Rockah-Shmuel L, Tóth-Petróczy Á, Tawfik investigate the energetics of biomolecular rec-
DS (2015) Systematic mapping of protein ognition. J Mol Recognit 12:3–18
mutational space by prolonged drift reveals 6. Sahni N, Yi S, Taipale M et al (2015) Wide-
the deleterious effects of seemingly neutral spread macromolecular interaction perturba-
mutations. PLoS Comput Biol 11:e1004421 tions in human genetic disorders. Cell
2. Huber CD, Kim BY, Marsden CD, Lohmueller 161:647–660
KE (2017) Determining the factors driving 7. Sun MGF, Seo M-H, Nim S et al (2016) Pro-
selective effects of new nonsynonymous muta- tein engineering by highly parallel screening of
tions. Proc Natl Acad Sci U S A computationally designed variants. Sci Adv 2:
114:4465–4470 e1600692
3. Brender JR, Zhang Y (2015) Predicting the 8. Weile J, Sun S, Cote AG, et al (2017) Expand-
effect of mutations on protein-protein binding ing the atlas of functional missense variation for
interactions through structure-based interface human genes. BioRxiv 166595
profiles. PLoS Comput Biol 11:e1004494 9. Ng PC, Henikoff S (2003) SIFT: predicting
4. Albanaz ATS, Rodrigues CHM, Pires DEV, amino acid changes that affect protein func-
Ascher DB (2017) Combating mutations tion. Nucleic Acids Res 31:3812–3814
in genetic disease and drug resistance: under- 10. Adzhubei I, Jordan DM, Sunyaev SR (2013)
standing molecular mechanisms to guide Predicting functional effect of human missense
drug design. Expert Opin Drug Discov mutations using PolyPhen-2. Curr Protoc
12:553–563 Hum Genet Chapter 7: Unit 7.20
5. Jelesarov I, Bosshard HR (1999) Isothermal 11. Li B, Krishnan VG, Mort ME et al (2009)
titration calorimetry and differential scanning Automated inference of molecular mechanisms
calorimetry as complementary tools to
16 Alexey Strokach et al.
of disease from amino acid substitutions. Bio- 24. Li M, Simonetti FL, Goncearenco A, Pan-
informatics 25:2744–2750 chenko AR (2016) MutaBind estimates and
12. Kircher M, Witten DM, Jain P et al (2014) A interprets the effects of sequence variants on
general framework for estimating the relative protein-protein interactions. Nucleic Acids
pathogenicity of human genetic variants. Nat Res 44:W494–W501
Genet 46:310–315 25. Kumar MDS, Bava KA, Gromiha MM et al
13. Shihab HA, Gough J, Mort M et al (2014) (2006) ProTherm and ProNIT: thermody-
Ranking non-synonymous single nucleotide namic databases for proteins and protein–nu-
polymorphisms based on disease concepts. cleic acid interactions. Nucleic Acids Res 34:
Hum Genomics 8:11 D204–D206
14. Choi Y, Sims GE, Murphy S et al (2012) Pre- 26. Moal IH, Fernández-Recio J (2012) SKEMPI:
dicting the functional effect of amino acid sub- a structural kinetic and energetic database of
stitutions and indels. PLoS One 7:e46688 mutant protein interactions and its use in
15. Dorfman R, Nalpathamkalam T, Taylor C et al empirical models. Bioinformatics
(2010) Do common in silico tools predict the 28:2600–2607
clinical consequences of amino-acid substitu- 27. Rose PW, Prlić A, Altunkaya A et al (2017) The
tions in the CFTR gene? Clin Genet RCSB protein data bank: integrative view of
77:464–473 protein, gene and 3D structural information.
16. Shirts M, Mobley D (2013) An introduction to Nucleic Acids Res 45:D271–D281
best practices in free energy calculations. In: 28. Witvliet DK, Strokach A, Giraldo-Forero AF
Monticelli L, Salonen E (eds) Biomolecular et al (2016) ELASPIC web-server: proteome-
simulations, Methods in molecular biology. wide structure-based prediction of mutation
Humana Press, Totowa, NJ, pp 271–311 effects on protein stability and binding affinity.
17. Benedix A, Becker CM, de Groot BL et al Bioinformatics 32:1589–1591
(2009) Predicting free energy changes using 29. Chakravarty D, Gao J, Phillips SM et al (2017)
structural ensembles. Nat Methods 6:3–4 OncoKB: a precision oncology knowledge
18. Pires DEV, Ascher DB, Blundell TL (2014) base. JCO Precis Oncol 2017. https://doi.
mCSM: predicting the effects of mutations in org/10.1200/PO.17.00011
proteins using graph-based signatures. Bioin- 30. Das R, Baker D (2008) Macromolecular mod-
formatics 30:335–342 eling with rosetta. Annu Rev Biochem
19. Laimer J, Hofer H, Fritz M et al (2015) MAE- 77:363–382
STRO - multi agent stability prediction 31. Moult J, Fidelis K, Kryshtafovych A et al
upon point mutations. BMC Bioinformatics (2014) Critical assessment of methods of pro-
16:116 tein structure prediction (CASP)--round
20. Petukh M, Li M, Alexov E (2015) Predicting x. Proteins 82(Suppl 2):1–6
binding free energy change caused by point 32. McGibbon RT, Beauchamp KA, Harrigan MP
mutations with knowledge-modified et al (2015) MDTraj: a modern open library for
MM/PBSA method. PLoS Comput Biol 11: the analysis of molecular dynamics trajectories.
e1004276 Biophys J 109:1528–1532
21. Dehouck Y, Grosfils A, Folch B et al (2009) 33. Consortium TU (2015) UniProt: a hub for
Fast and accurate predictions of protein stabil- protein information. Nucleic Acids Res 43:
ity changes upon mutations using statistical D204–D212
potentials and neural networks: PoPMuSiC- 34. Calderone A, Castagnoli L, Cesareni G (2013)
2.0. Bioinformatics 25:2537–2543 mentha: a resource for browsing integrated
22. Baugh EH, Simmons-Edler R, Müller CL et al protein-interaction networks. Nat Methods
(2016) Robust classification of protein varia- 10:690–691
tion using structural modelling and large-scale 35. McGinnis S, Madden TL (2004) BLAST: at
data integration. Nucleic Acids Res the core of a powerful and diverse set of
44:2501–2513 sequence analysis tools. Nucleic Acids Res 32:
23. Berliner N, Teyra J, Çolak R et al (2014) Com- W20–W25
bining structural modeling with ensemble 36. Webb B, Sali A (2016) Comparative protein
machine learning to accurately predict protein structure modeling using MODELLER. Curr
fold stability and binding affinity effects upon Protoc Bioinformatics 54:5.6.1–5.6.37
mutation. PLoS One 9:e107353
Predicting the Effect of Mutations 17
37. Choi Y (2012) A fast computation of pairwise 39. Sanner MF, Olson AJ, Spehner J (1996)
sequence alignment scores between a protein Reduced surface: an efficient way to compute
and a set of single-locus variants of another molecular surfaces. Biopolymers 38:305–320
protein. In: Proceedings of the ACM Confer- 40. Heinig M, Frishman D (2004) STRIDE: a web
ence on Bioinformatics, Computational Biol- server for secondary structure assignment from
ogy and Biomedicine - BCB ’12. ACM, known atomic coordinates of proteins. Nucleic
New York, NY. Acids Res 32:W500–W502
38. Schymkowitz J, Borg J, Stricher F et al (2005)
The FoldX web server: an online force field.
Nucleic Acids Res 33:W382–W388
Chapter 2
Abstract
Molecular dynamics based free energy calculations allow for a robust and accurate evaluation of free energy
changes upon amino acid mutation in proteins. In this chapter we cover the basic theoretical concepts
important for the use of calculations utilizing the non-equilibrium alchemical switching methodology. We
further provide a detailed step-by-step protocol for estimating the effect of a single amino acid mutation on
protein thermostability. In addition, the potential caveats and solutions to some frequently encountered
issues concerning the non-equilibrium alchemical free energy calculations are discussed. The protocol
comprises details for the hybrid structure/topology generation required for alchemical transitions, equilib-
rium simulation setup, and description of the fast non-equilibrium switching. Subsequently, the analysis of
the obtained results is described. The steps in the protocol are complemented with an illustrative practical
application: a destabilizing mutation in the Trp cage mini protein. The concepts that are described are
generally applicable. The shown example makes use of the pmx software package for the free energy
calculations using Gromacs as a molecular dynamics engine. Finally, we discuss how the current protocol
can readily be adapted to carry out charge-changing or multiple mutations at once, as well as large-scale
mutational scans.
Key words Molecular dynamics, free energy calculations, alchemistry, amino acid mutation, pmx,
hybrid structure, hybrid topology, non-equilibrium transitions
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_2, © Springer Science+Business Media, LLC, part of Springer Nature 2019
19
20 Matteo Aldeghi et al.
[8, 9]. Engineered stable proteins with high affinity and specificity
toward their binding targets may also serve as biopharmaceuticals
[10, 11]. Accurate and robust estimation of the free energy differ-
ences between protein sequence variants, thus, is crucial to the
successful design of proteins with the desired thermodynamic
features.
Different approaches have thus been developed that can return
an estimate of free energy changes that relate to the different
stabilities or binding affinities of wild-type and mutant proteins.
These include fast scoring methods [12–16], implicit-solvent
approaches based on the post-processing of molecular dynamics
(MD) simulations [17–19], and the computationally more expen-
sive but theoretically rigorous (from a statistical mechanics view-
point) alchemical free energy methods [1, 20]. In this chapter, we
focus on the latter category of calculations, which are based on
all-atom computer simulations that correctly sample the Boltz-
mann distribution of microstates and inherently take into account
entropic and discrete solvent effects.
In alchemical free energy calculations, an amino acid can be
transformed into another one via a non-physical path, hence the
name that is reminiscent of the ancient practice that aimed at the
transmutation of lead into gold. The amino acid transformation can
be carried out reversibly, in what are referred to as equilibrium free
energy calculations, or irreversibly, in non-equilibrium calculations
[21]. In both cases, the amount of work needed for the transfor-
mation and free energy difference between the initial and final states
can be recovered. However, the setup of the calculations differs. In
this chapter, we discuss non-equilibrium approaches that carry out
this transformation irreversibly and describe protocols that can be
used for the accurate estimation of free energy changes upon amino
acid mutation. In the text, we use the prediction of protein stability
changes upon an amino acid mutation as an example application.
The methodology and protocol presented here are of generic char-
acter and can be applied to study other biophysical processes,
assuming a suitable thermodynamic cycle can be built, e.g., changes
in protein–protein, protein–DNA, or protein–ligand binding
affinities.
In this chapter, we first provide some background concepts that
are at the foundation of the non-equilibrium alchemical free energy
method; for a more detailed description we give references to more
specialized literature sources. Further, we concentrate on the
description of the practical steps involved in preparing and subse-
quently carrying out the free energy calculations following a gen-
eral protocol. As an example, we use a Trp cage mini protein [22]
that provides a real case on which we illustrate setting up and
running alchemical free energy calculations of protein mutation.
We assume the reader is familiar with the general principles of
molecular dynamics simulations. Throughout this chapter, we
Accurate Calculation of Free Energy Changes upon Amino Acid Mutation 21
discuss the potential caveats and solutions for some of the fre-
quently encountered issues. In the last section of the chapter, we
describe how the protocol can be easily modified and expanded to
perform large-scale mutational scans or to calculate other free
energy changes of interest, such as changes in protein–protein or
protein–ligand affinities upon protein mutation. Finally, in the
Notes section, we provide a few technical remarks that may prove
helpful when setting up alchemical free energy calculations using
Gromacs 2016 [23] and the pmx python library with the specialized
set of scripts [24].
2 Theory
2.1 Definition of Free The free energy surface of a system determines its thermodynamic
Energy and and kinetic properties and, as such, it provides access to under-
Irreversible Work standing biophysical processes, including protein folding, ligand
binding, protein–protein association, etc. For instance, a polypep-
tide chain in solution may be found in many disordered conforma-
tions, or in ordered conformations with well-defined secondary and
tertiary structure. We can define the set of disordered conforma-
tions as the unfolded state of the system (state A), and the set of
ordered conformations as the folded state (state B). It is rarely
possible to sample the whole phase space of a protein, which
would require observing all the folded and unfolded conformations
multiple times. However, in practice free energy differences rather
than free energies are typically of interest. The difference between
the free energy of state A and B alone will give the relative equilib-
rium probability of finding the protein in its unfolded form with
respect to the folded form; i.e., the free energy difference ΔG is
proportional to the ratio of probabilities of finding the system in
state A or B:
pA e βG A
¼ ¼ e βðG A G B Þ ð1Þ
pB e βG B
22 Matteo Aldeghi et al.
pA
ΔG ¼ G A G B ¼ kB T ln ð2Þ
pB
where G is the free energy of the whole phase space of the system for
an ensemble with a fixed number of particles, constant pressure and
temperature (T), i.e., isothermal–isobaric conditions. GA is the free
energy of the unfolded state, GB is the free energy of the folded
state, and β ¼ 1/kBT, with kB is the Boltzmann constant with
T denoting the absolute temperature.
This free energy difference also determines the maximum
amount of work that can be extracted from the closed system
during a thermodynamic process, which can only be achieved in
the limit of reversibility. During a reversible process, the system is
always in thermodynamic equilibrium, which implies that only
infinitesimal changes are applied to it and the transformation is
infinitely slow. However, for any finite time interval τ, the system
will be driven out of equilibrium, resulting in heat dissipation and
hysteresis effects, so that the process will be irreversible. In fact, in
accordance to the second law of thermodynamics, the work done
during a process is on average equal or larger, due to dissipative
work, than the free energy difference between the initial and final
state:
hW ðτÞi ΔG ð3Þ
2.2 Estimating Free From the considerations above, it is possible to derive estimators
Energy Differences that allow calculating free energy differences from equilibrium and
from Non-equilibrium non-equilibrium simulations. Both, the Zwanzig’s formula [38],
Simulations which lies at the basis of free energy perturbation (FEP)
approaches, and thermodynamic integration (TI) [39] make use
of ensemble averages obtained from equilibrium simulations for the
Accurate Calculation of Free Energy Changes upon Amino Acid Mutation 23
2.2.1 Jarzynski’s The equality derived by Jarzynski in 1997 [30, 40] relates the
Equality uni-directional non-equilibrium work average to the equilibrium
free energy difference:
From Eq. 5 one can directly estimate the free energy difference
as follows, with N being the number of non-equilibrium trajec-
tories sampled:
" #
1 X
N
d ¼ kB T ln βW
ΔG e i
ð6Þ
N i
2.2.2 Crooks Fluctuation Jarzynski’s equality considers the transitions in one direction only,
Theorem e.g., from λ ¼ 0 to λ ¼ 1. The Crooks Fluctuation Theorem (CFT)
takes into account the work values obtained from performing the
process in both forward (λ: 0 ! 1) and reverse (λ: 1 ! 0) direc-
tions. According to the CFT, the forward and reverse work distri-
butions relate to the free energy difference as follows:
P f ðW Þ
¼ e βðW ΔGÞ ð7Þ
P r ðW Þ
where Pf (W) and Pr (W) are the normalized probability distri-
butions of work values obtained from the forward and reverse
transformation paths. Note that Jarzynski’s equality can be derived
from Eq. 7 by integration over W [21]. With enough overlap
between the forward and reverse work distributions, the free energy
difference can be estimated directly from Eq. 7 as follows:
d ¼ W þ kB T ln P f ðW Þ
ΔG ð8Þ
P r ðW Þ
d ¼ W at the intersection of the work distributions. How-
with ΔG
ever, this approach has known limitations: firstly, for certain paths it
might be difficult to obtain substantial overlap between Pf (W) and
Pr (W). Secondly, mainly the tails of the distributions, which are
defined by rare events of low work dissipation, will contribute to
the free energy difference.
To partly alleviate these problems, one can approximate the
work distributions with an analytical function [43]. One such strat-
egy, which leads to accurate free energy estimates, was proposed by
Goette and Grubmüller [29]. By using a Gaussian approximation, a
Crooks Gaussian Intersection (CGI) estimator was derived:
Accurate Calculation of Free Energy Changes upon Amino Acid Mutation 25
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
hW f i hW r i
σ21σ2 hW f i þ hW f i þ 2 σ12 σ12 ln σσfr
2
σ 2f
σ 2r f r f r
d¼
ΔG ð9Þ
1
σ 2f
1
σ 2r
where σ f and σ r are the variances of the forward and reverse work
distributions. Note that the accuracy of this estimator relies on the
Gaussian approximation. Thus, it is advised to check this assumption
by, for instance, using a statistical test like the Kolmogorov–Smirnov
test [44]. The CGI estimator does not have an analytical error
estimate, but the error can be estimated by the bootstrap
approach [45].
Another ΔG estimator, termed BAR (Bennet’s Acceptance
Ratio), does not require an analytical approximation for the work
distributions. Originally, the BAR relation was derived in 1976 by
Bennet for a system sampling two states at equilibrium and
performing instantaneous transformations between the states. Ben-
net showed that the information from the forward and reverse
distributions of the potential energy difference ( ΔU ) could be
combined in order to obtain an optimal estimate of the free energy
difference [46]. For a non-equilibrium process carried out during a
finite amount of time, the same derivation holds by substituting
ΔU with the non-equilibrium work W. In 2003, Shirts and cow-
orkers showed how the same estimator can be derived starting from
the Crooks Fluctuation Theorem using maximum-likelihood argu-
ments [47]. The BAR estimates the free energy difference by
satisfying the following relation:
Nf
X X
Nr
1 1
¼ ð10Þ
1 þ N fr e βðW i Δc c
N GÞ N r βðW j Δ GÞ
i¼1 j ¼1 1þN f
e
2.3 Free Energy To calculate a free energy difference, firstly we need to define the
Differences Upon initial and final states of interest, and secondly the path connecting
Protein Mutation: The them. If we consider the folding example already used, then the
Alchemical Path initial state would be the unfolded protein and the final state would
be the folded protein, with the free energy difference we want to
calculate being the protein folding free energy. If the structure of
26 Matteo Aldeghi et al.
2.3.1 The As shown in Fig. 1, one can define a cycle where for both the initial
Thermodynamic Cycle (unfolded) and final (folded) states the wild-type protein is trans-
formed into a mutant of interest via a non-physical path. The free
energy difference of protein folding upon an amino acid mutation
( ΔΔG Mutation
Folding ) can be recovered by following both, the physical
paths of folding the WT and mutant protein (ΔG Mut Folding ΔG Folding),
WT
and the alchemical paths of morphing the amino acids in the folded
Folded ΔG Unf olded ).
and unfolded states (ΔG Mutation Mutation
2.3.2 Single and Dual We have described how alchemical transformations can be used to
Topology build thermodynamic cycles that allow one to calculate changes in
free energy differences upon an amino acid mutation. However,
how can one alchemically mutate one residue into another during a
simulation? Given the separate Hamiltonians at the two end states,
it is necessary to define a hybrid topology that contains both
physical states. In the specific case of mutating an amino acid into
another one, the residue being mutated must be able to represent
both the wild-type and mutant residue. This is typically achieved
using the single or dual topology approach [55–57].
28 Matteo Aldeghi et al.
Fig. 2 Example of the single and dual topology setup for the mutation of valine
into serine. Dummy atoms in the three-dimensional rendering are shown as
transparent balls and sticks, whereas in the chemical structure drawings they
are shown in grey. In the single topology approach, a methyl part of valine’s side
chain is transformed into serine’s hydroxyl group, with a carbon becoming an
oxygen, while two hydrogens are turned into non-interacting dummy particles;
all hydrogens of the second methyl are decoupled as well, while the carbon
becomes a Cβ hydrogen. In the dual topology approach, no element mutation
occurs, because both valine and serine side chains are present in both states,
where, however, only one of the two is coupled to the system, with the other one
being non-interacting
states is not equal, thus not all atoms of the states A and B can be
matched. Therefore, non-interacting particles are used either in
state A or B. These dummy atoms do not have electrostatic and
van der Waals (vdW) interactions with the system; however, they
maintain their bonded interactions, so that they effectively are in a
vacuum-like state. In the example in Fig. 2, five of valine’s hydro-
gen atoms are turned into dummy atoms.
In the dual topology approach, atoms that are different
between the two end states are not morphed directly, but rather
transformed into dummy particles [26, 56, 57]. For amino acids,
this effectively means that the side chains of both residues are
present at the same time. However, at λ ¼ 0 the side chain of the
initial state is interacting with the system and the side chain of the
final state is present as non-interacting particles. On the other hand,
at λ ¼ 1 the side chain of the final state is interacting and that of the
initial state is turned into non-interacting dummy atoms. This can
be seen in Fig. 2: in the initial state, the methanol side chain of
serine is decoupled, whereas in the final state it is the propyl side
chain of valine being turned off.
In practice, there does not need to be a clear separation
between a single and dual topology setup. While some atoms may
be morphed between the states following a single topology
approach, other atoms in the same system may be turned into
dummies according to a dual topology approach.
It is important to bear in mind that the free energy change (ΔG)
of the mutation differs depending on whether the single or dual
topology approach is used. This is due to the fact that the end states
are effectively different due to different dummy atom construc-
tions. In addition, in the single topology approach there is a con-
tribution to the free energy difference from the change in bond
lengths. However, the contributions to the free energy difference
resulting from the details of the atom mapping between the end
states cancel out in a thermodynamic cycle like the one in Fig. 1,
such that the final ΔΔG value is independent of how the hybrid
topology is implemented [57, 59].
Using dummy particles in alchemical transitions requires intro-
duction and annihilation of particles into the system. Such trans-
formations impose a large perturbation, e.g., creating a particle
interacting with the environment in a place of a non-interacting
dummy atom results in strong van der Waals repulsions and Cou-
lombic interactions. In turn, large forces are exerted on the atoms
which leads to instabilities in dynamics and integration artifacts. To
circumvent these issues, it is a common practice to modify,
“soften,” the non-bonded interactions during the alchemical trans-
formations. A number of functional forms and parameter sets to
such soft-cored interactions have been proposed [60–64]. Altering
the non-bonded interactions along the alchemical pathway does
not affect the final free energy estimates, because the physical end
30 Matteo Aldeghi et al.
3.1 Setting Up pmx pmx is a python library that allows the convenient manipulation of
biomolecular structure and topology files. Within the framework of
pmx, a number of scripts have been developed and specifically
designed to prepare and analyze alchemical free energy calculations.
pmx generates topology files that are compatible with the Gromacs
simulation engine.
Mutations in a number of contemporary molecular mechanics
force fields are supported. This is achieved by means of
pre-generated mutation libraries compatible with the Gromacs
force field organization. After installing Gromacs and pmx, the
GMXLIB environmental variable needs to be set to specify the
path to the mutation libraries that come with the pmx package (see
Note 2).
3.2 Hybrid Structure The first step in the setup comprises the generation of the hybrid
structure for the amino acid to be mutated (Fig. 3). The only file
required for this step is the protein structure in .pdb or .gro format.
The protein structure needs to be complete, i.e. all heavy and
hydrogen atoms need to be present. In order to add missing
heavy atoms, external software needs to be used, e.g., Rosetta
[15], Modeller [66], or PyMol [67]. Furthermore, given that
structures resolved by means of X-ray crystallography usually con-
tain no hydrogen atoms, these need to be added as well. Various
software packages, like WhatIf [68] or Rosetta, offer assignment of
hydrogen coordinates for protein structures. The Gromacs tool
pdb2gmx can do this too. In fact, it is convenient to pre-process
a .pdb file with pdb2gmx because it produces a structure file with
atom names already compatible with the Gromacs internal atom
naming given the selected force field. pdb2gmx also identifies
whether any heavy atoms in a protein are missing, so that the tool
can be used to identify incomplete residues. While pdb2gmx will not
model missing heavy atoms, it will inform about such deficiencies.
Note that pdb2gmx will fail if the input structure contains molecules
that are not readily recognized by Gromacs. Therefore, molecules
that are not present in the force field file have to be removed from
the structure at this stage and processed independently.
For the Trp cage model system we use an NMR structure
(PDB-ID 1L2Y) [22] that was deposited with 38 conformers.
After manually extracting conformer #2, we pre-process the struc-
ture by running it through pdb2gmx:
32 Matteo Aldeghi et al.
3.3 Topology At this point we use the hybrid structure from the previous step
(“mut.pdb”) as an input to pdb2gmx (Fig. 3). This time we want to
obtain the topology file containing all the information needed by
Gromacs to run the simulations. The topology file will also include
the description of the hybrid mutated residue, however, parameters
only for one physical state (state A) are defined in the output
topology file. It is also important to note that at this step the
“-ignh” flag should not be set, since the hydrogen atoms have
already been added in the previous step.
gmx pdb2gmx -f mut.pdb -o mut_pdb2gmx.pdb -ff amber99sb-star-ildn-mut
-water tip3p -p topol.top
3.4 Hybrid Topology The generated topology file (“topol.top”) has the hybrid residue
W2F incorporated. However, it is a non-standard hybrid amino
acid with two physical states (A and B). While state A is included
in the topology, state B still needs to be included explicitly. The
Accurate Calculation of Free Energy Changes upon Amino Acid Mutation 33
3.5 Webserver The procedure detailed above (and summarized in Fig. 3) can also
be executed via a webserver interface: http://pmx.mpibpc.mpg.de.
Provided with a protein structure file, the pmx webserver will
perform a user-selected mutation in one of the supported molecu-
lar mechanics force fields.
The webserver runs a number of additional structure
pre-processing steps that simplify the setup procedure. While bro-
ken or incomplete proteins will not be repaired, a number of other
useful modifications are applied: residue and atom names are
matched to the force field nomenclature, terminal residues are
dealt with, and if needed hydrogen atoms may be added via
pdb2gmx. Optionally, the structure may be checked before the
mutation is performed, so that the user is informed about any
potential deficiencies in the input file. In addition, the setup offered
by the webserver is not limited to single amino acid mutations, but
also allows to prepare files for mutation scans over selected protein
chains.
3.6 Alchemical The hybrid structures and topologies we just obtained can readily
Simulations be used for MD simulations and to calculate free energy differences.
Numerous protocols for relative alchemical free energy calculations
are currently available: equilibrium approaches (TI, FEP) as well as
non-equilibrium methods. Here, we employ non-equilibrium cal-
culations based on the Crooks Fluctuation Theorem.
3.6.1 System Preparation Firstly, the hybrid structure and topology are used in preparing the
system for molecular dynamics simulations following a standard
procedure. The protein needs to be placed in a simulation box
and solvated. Then ions need to be added to neutralize the system
and, optionally, reach a desired salt concentration. These are con-
ventional steps used to prepare an ordinary MD simulation: for a
more detailed description of this procedure in Gromacs we refer the
reader to a specialized protocol [72].
3.6.2 Equilibrium Next, we set up two equilibrium simulations: one for the WT Trp
Simulations cage (W6, state A, λ ¼ 0) and another for the mutated protein (F6,
state B, λ ¼ 1) (Fig. 4). We start with an energy minimization
performed on both states separately. The parameters for the energy
minimization (.mdp) are the same as those used in non-alchemical
simulations, with the exception of two flags. The free-energy
flag has to be set to yes. This indicates that the free energy code in
34 Matteo Aldeghi et al.
Fig. 4 The procedure of non-equilibrium alchemical simulations for one leg of the
thermodynamic cycle: mutation in the folded state of a protein. Two independent
equilibrium simulations are performed by keeping the system in its physical
states: WT (λ ¼ 0) and mutant (λ ¼ 1). These simulations need to sufficiently
sample the end state ensembles, as the accuracy of the free energy estimate will
depend on the convergence of the equilibrium sampling. Typically, the equilib-
rium simulations are in the nanosecond to microsecond time range. From the
generated trajectories, snapshots are selected to start fast (typically 10–200 ps)
transitions driving the system in the forward (λ: 0 ! 1) and reverse (λ: 1 ! 0)
directions. The work values required to perform these transitions are collected
and the Crooks Fluctuation Theorem is used to calculate the free energy
difference between the two states
Gromacs will be activated for those interactions that have two sets
of parameters (states A and B) in the topology file. In addition, the
init-lambda flag has to be set to 0 for the simulation in state
A (WT Trp cage) and to 1 for state B (mutated Trp cage).
Accurate Calculation of Free Energy Changes upon Amino Acid Mutation 35
After the energy minimization runs for the A and B states are
complete, MD simulations can be started from the energy mini-
mized conformations. Similarly to the energy minimization, the
simulation parameters are identical to the conventional MD runs,
except for setting the free-energy and init-lambda flags for
the simulations in state A and B, respectively (Fig. 4). These equi-
librium runs are used to sample the relevant phase space volumes,
i.e., the conformational changes in the WT and mutated variant of
Trp cage. Therefore, the ensembles generated during the equilib-
rium runs will define how accurately the free energy difference will
be estimated. This consideration dictates the sampling time: the
simulation time should be sufficient to sample the transitions that
are considered to be relevant. For example, if a protein is known to
undergo large-scale conformational changes and the introduced
mutation may be affecting the populations of these conformers,
the simulation time has to be long enough to properly sample such
transitions. Equilibrium simulations in this case could require
microseconds or longer to converge. On the other hand, it is
often important to estimate the free energy difference for a struc-
ture that would remain close to its experimentally resolved struc-
ture. In this scenario, it is sufficient to sample smaller changes in
rotameric states of the side chains and minor backbone motions. In
previous large-scale amino acid scans investigating protein thermo-
dynamic stabilities, we have observed good agreement with experi-
mental data when using 10–20 ns of equilibrium sampling [1, 20].
Another issue to consider when choosing the sampling time is
the definition of states for which the free energy difference will be
calculated. In the Trp cage example, we are aiming to estimate the
mutation-induced free energy difference in folding free energy.
This implies that one of the end states that we need to simulate
needs to be the folded state, while the other needs to be the
unfolded state. If we were to introduce a destabilizing mutation
(in fact W6F has been shown to strongly destabilize Trp cage
[73, 74]), over a longer simulation time the protein would unfold.
Thus, the definition of the folded state used in the free energy
calculation would be violated, rendering the calculated free energy
differences inaccurate. For the Trp cage W6F mutation example,
we will use equilibrium simulations of 10 ns: short enough such
that no spontaneous unfolding occurs.
3.6.3 Non-equilibrium Once the equilibrium simulations are completed, we can proceed to
Transitions the non-equilibrium part of the simulation protocol. Fast
non-equilibrium transitions serve the purpose of connecting the
two physical states (A and B) and allow obtaining the free energy
difference between them. These transitions are started from snap-
shots extracted from the two equilibrium trajectories. From each
equilibrium trajectory, we discard the first 2 ns as equilibration
36 Matteo Aldeghi et al.
nsteps = 25000
nstcalcenergy = 1
nstdhdl = 1
free-energy = yes
init-lambda = 0
delta-lambda = 4e-5
sc-alpha = 0.3
sc-sigma = 0.25
sc-power = 1
sc-coul = yes
init-lambda = 1
delta-lambda = -4e-5
3.7 Analysis The integration over the ∂H/∂λ curves and the free energy difference
estimation can be performed with the pmx script analyze_dhdl.py:
python analyze_dhdl.py -fA stateA/dhdl*.xvg -fB
stateB/dhdl*.xvg
The script will output the summary of results in a text file
containing the estimate of the free energy difference using three
estimators: Crooks Gaussian Intersection (CGI), Bennet’s Accep-
tance Ratio (BAR), and Jarzynski’s equality. While CGI and BAR
use the work distributions generated in both, forward and reverse,
directions, Jarzynski’s estimator is one-directional. We recommend
using the BAR estimation for the ΔG value, as it utilizes all the
available work values from both directions and makes no assump-
tions about the shape of the work distributions. Conveniently, the
script also generates plots of the work values over time and of their
distributions (Fig. 5), which are useful to detect potential sampling
or lack of the work distribution overlap issues.
The convergence of the results can be assessed in various ways.
Firstly, if a systematic drift of the work values over time is observed,
it usually indicates lack of convergence during the equilibrium
sampling stage. The work values are likely to drift due to a confor-
mational change and it may be important to thoroughly sample the
significant conformational motions in the protein. Lack of conver-
gence may also be deduced from the error values provided together
with the free energy estimates. The uncertainties of the CGI and
BAR estimators are sensitive to the lack of the overlap between the
forward and reverse work distributions (see Note 5). A large uncer-
tainty in the ΔG estimate indicates that the overlap between the
work distributions might be insufficient. Slower transitions keep the
system closer to equilibrium, so that less work is dissipated along the
path and the overlap between work distributions generally increases.
Running more non-equilibrium transitions increases the probability
of observing work values with low dissipation, which also contri-
butes toward good overlap of the work distributions.
38 Matteo Aldeghi et al.
The most reliable way to assess the precision of the free energy
estimates obtained is to repeat the whole procedure, including
equilibrium and non-equilibrium simulations, multiple times. The
calculated ΔG values and their spread obtained from multiple
independent calculations more accurately capture under-sampling
issues. For the Trp cage W6F mutation, we have obtained a ΔG
value of 4.290.63 kJ/mol (Fig. 5) from a single calculation.
Then, we repeated the whole calculation five times, from the system
preparation to the equilibrium and non-equilibrium simulations.
The average free energy value we obtained was of 3.73 kJ/mol
with a standard error of 0.88 kJ/mol. This result confirms that the
ΔG estimate obtained can be considered to be reliable.
3.8 Double Free So far we have calculated the free energy difference for one leg of
Energy Difference the thermodynamic cycle (Fig. 1): mutation in a folded protein. To
obtain the final double free energy difference the same procedure
needs to be performed for the unfolded Trp cage peptide. It has
been demonstrated that in the context of the alchemical free energy
calculations the unfolded state can be approximated by a capped
tripeptide with the residue of interest surrounded by two
glycines [20].
Accurate Calculation of Free Energy Changes upon Amino Acid Mutation 39
Thus, in our Trp cage example, the ΔΔG of folding for the W6F
mutation is estimated to be 13.670.71 kJ/mol. This calculated
estimate closely matches the experimentally measured destabiliza-
tion of 12.50.6 kJ/mol [73, 74]. A previous large-scale study
compared calculated and experimental ΔΔG values for protein
thermostability changes upon mutation for the proteins barnase
and Staphylococcal nuclease [1]. It was found that the mean
unsigned error in the predictions was of approximately 4 kJ/mol,
with the uncertainty due to finite sampling, the force field, and the
experimental error equally contributing to the discrepancy between
calculated and experimental ΔΔG values. Therefore, the calculated
ΔΔG value for the Trp cage W6F mutation falls well within the range
of the expected accuracy.
4.2 More Than One The protocol in this chapter described an example of a single amino
Mutation at Once acid mutation in a protein. pmx, however, also allows introducing
multiple mutations at once as well. This can be done either by
interactively selecting more than one mutation to be applied or by
providing an external file with every mutation defined in a new line
of a text file. The pmx webserver also provides the option to
introduce multiple mutations.
The caveat of performing an alchemical transformation for
several amino acid mutations at once is a slower convergence of
the free energy estimate. Having more mutations imposes a larger
perturbation to the system. Hence, more work will be dissipated
Accurate Calculation of Free Energy Changes upon Amino Acid Mutation 41
along the path and the free energy estimate will become less accu-
rate. In such a case, performing the non-equilibrium transitions
slower may be necessary.
Another way to calculate the effect of multiple mutations is to
perform the mutations sequentially. For example, the free energy
difference of introducing the mutations X and Y at once is equal to
the combined ΔG of performing the mutation X first and in a
separate setup calculating ΔG for the Y mutation in a system where
the X mutation is already present. In fact, since free energy is a state
function, the sequence of introducing the mutations does not
influence the final ΔG estimate, thus the mutation Y can be
performed first and then the mutation X can follow. The free
energy differences calculated in all three scenarios (X and Y at
once, first X then Y, first Y then X) ought to yield the same
estimate. Therefore, the spread of these three ΔG values could
serve as an indicator of the uncertainty in the calculations.
5 Summary
6 Notes
References
molecule biosensors in eukaryotes. eLife 13. Pires DEV, Ascher DB, Blundell TL (2014)
4:323–329 mCSM: predicting the effects of mutations in
4. Zhou L, Bosscher M, Zhang C, Özçubukçu S, proteins using graph-based signatures. Bioin-
Zhang L, Zhang W, Li CJ, Liu J, Jensen MP, formatics 30(3):335–342
Lai L, He C (2014) A protein engineered to 14. Schymkowitz J, Borg J, Stricher F, Nys R,
bind uranyl selectively and with femtomolar Rousseau F, Serrano L (2005) The FoldX web
affinity. Nat Chem 6(3):236–241 server: an online force field. Nucleic Acids Res
5. Correia BE, Bates JT, Loomis RJ, Baneyx G, 33(Suppl 2):W382–W388
Carrico C, Jardine JG, Rupert P, Correnti C, 15. Kortemme T, Baker D (2002) A simple physi-
Kalyuzhniy O, Vittal V, Connell MJ, Ste- cal model for binding energy hot spots in
vens E, Schroeter A, Chen M, MacPherson S, protein-protein complexes. Proc Natl Acad Sci
Serra AM, Adachi Y, Holmes MA, Li Y, Klevit USA 99(22):14116–14121
RE, Graham BS, Wyatt RT, Baker D, Strong 16. Leaver-Fay A, Tyka M, Lewis SM, Lange OF,
RK, Crowe JE, Johnson PR, Schief WR (2014) Thompson J, Jacak R, Kaufman K, Renfrew
Proof of principle for epitope-focused vaccine PD, Smith CA, Sheffler W, Davis IW, Coop-
design. Nature 507(7491):201–206 er S, Treuille A, Mandell DJ, Richter F, Ban
6. Koday MT, Nelson J, Chevalier A, Koday M, YEA, Fleishman SJ, Corn JE, Kim DE, Lys-
Kalinoski H, Stewart L, Carter L, Nieusma T, kov S, Berrondo M, Mentzer S, Popović Z,
Lee PS, Ward AB, Wilson IA, Dagley A, Smee Havranek JJ, Karanicolas J, Das R, Meiler J,
DF, Baker D, Fuller DH (2016) A computa- Kortemme T, Gray JJ, Kuhlman B, Baker D,
tionally designed hemagglutinin stem-binding Bradley P (2011) Rosetta3: an object-oriented
protein provides in vivo protection from influ- software suite for the simulation and design of
enza independent of a host immune response. macromolecules. Methods Enzymol 487
PLoS Pathog 12(2):e1005409 (C):545–574
7. Clark AJ, Gindin T, Zhang B, Wang L, 17. Petukh M, Li M, Alexov E (2015) Predicting
Abel R, Murret CS, Xu F, Bao A, Lu NJ, binding free energy change caused by point
Zhou T, Kwong PD, Shapiro L, Honig B, mutations with knowledge-modified
Friesner RA (2017) Free energy perturbation MM/PBSA method. PLoS Comput Biol 11
calculation of relative binding free energy (7):e1004276
between broadly neutralizing antibodies and 18. Beard H, Cholleti A, Pearlman D, Sherman W,
the gp120 glycoprotein of HIV-1. J Mol Biol Loving KA (2013) Applying physics-based
429(7):930–947 scoring to calculate free energies of binding
8. Fowler PW, Cole K, Gordon NC, Kearns AM, for single amino acid mutations in protein-
Llewelyn MJ, Peto TEA, Crook DW, Walker protein complexes. PLoS ONE 8(12):e82849
AS (2018) Robust prediction of resistance to 19. Moreira IS, Fernandes PA, Ramos MJ (2007)
trimethoprim in Staphylococcus aureus. Cell Computational alanine scanning mutagenesis -
Chem Biol 25:339–349 An improved methodological approach. J
9. Hauser K, Negron C, Albanese SK, Ray S, Comput Chem 28(3):644–654
Steinbrecher T, Abel R, Chodera JD, Wang L 20. Seeliger D, de Groot BL (2010) Protein ther-
(2018) Predicting resistance of clinical Abl mostability calculations using alchemical free
mutations to targeted kinase inhibitors using energy simulations. Biophys J 98
alchemical free-energy calculations. Commun (10):2309–2316
Biol 1:70 21. Chipot C, Pohorille A (eds) (2007) Free
10. Tinberg CE, Khare SD, Dou J, Doyle L, energy calculations: theory and applications in
Nelson JW, Schena A, Jankowski W, Kalodi- chemistry and biology, vol 86. Springer, Berlin
mos CG, Johnsson K, Stoddard BL, Baker D 22. Neidigh JW, Fesinmeyer RM, Andersen NH
(2013) Computational design of ligand- (2002) Designing a 20-residue protein. Nat
binding proteins with high affinity and selectiv- Struct Mol Biol 9(6):425–430
ity. Nature 501(7466):212
23. Abraham MJ, Murtola T, Schulz R, Páll S,
11. Yang W, Lai L (2017) Computational design Smith JC, Hess B, Lindahl E (2015) GRO-
of ligand-binding proteins. Curr Opin Struct MACS: high performance molecular simula-
Biol 45:67–73 tions through multi-level parallelism from
12. Brender JR, Zhang Y (2015) Predicting the laptops to supercomputers. SoftwareX 2:1–7
effect of mutations on protein-protein binding 24. Gapsys V, Michielssens S, Seeliger D, de Groot
interactions through structure-based interface BL (2015) pmx: automated protein structure
profiles. PLoS Comput Biol 11(10):e1004494 and topology generation for alchemical pertur-
bations. J Comput Chem 36(5):348–354
46 Matteo Aldeghi et al.
25. Chipot C (2014) Frontiers in free-energy cal- 41. Wood RH, Mühlbauer WCF, Thompson PT
culations of biological systems. Wiley Interdis- (1991) Systematic errors in free energy pertur-
cip Rev Comput Mol Sci 4(1):71–89 bation calculations due to a finite sample of
26. Gapsys V, Michielssens S, Peters JH, de Groot configuration space: sample-size hysteresis. J
BL, Leonov H (2015) Molecular modeling of Phys Chem 95(17):6670–6675
proteins, vol 1215. Humana Press, New York 42. Gore J, Ritort F, Bustamante C (2003) Bias
27. Pohorille A, Jarzynski C, Chipot C (2010) and error in estimates of equilibrium free-
Good practices in free-energy calculations. J energy differences from nonequilibrium mea-
Phys Chem B 114(32):10235–10253 surements. Proc Natl Acad Sci USA 100
28. Hansen N, van Gunsteren WF (2014) Practical (22):12564–12569
aspects of free-energy calculations: a review. J 43. Nanda H, Lu N, Woolf TB (2005) Using
Chem Theory Comput 10(7):2632–2647 non-Gaussian density functional fits to improve
29. Goette M, Grubmüller H (2009) Accuracy relative free energy calculations. J Chem Phys
and convergence of free energy differences cal- 122(13):134110
culated from nonequilibrium switching pro- 44. Massey FJ Jr (1951) Kolmogorov-Smirnov test
cesses. J Comput Chem 30(3):447–456 for goodness of fit. Test 46(253):68– 78
30. Jarzynski C (1997) Nonequilibrium equality 45. Efron B, Tibshirani RJ (1994) An introduction
for free energy differences. Phys Rev Lett 78 to the bootstrap, vol 5, 1st edn. Chapman and
(14):2690–2693 Hall/CRC, London/West Palm Beach
31. Jarzynski C (1997) Equilibrium free-energy 46. Bennett CH (1976) Efficient estimation of free
differences from nonequilibrium measure- energy differences from Monte Carlo data. J
ments: A master-equation approach. Phys Rev Comput Phys 22(2):245–268
E 56:5018–5035 47. Shirts MR, Bair E, Hooker G, Pande VS
32. Crooks GE (1998) Nonequilibrium measure- (2003) Equilibrium free energies from non-
ments of free energy differences for microscop- equilibrium measurements using maximum-
ically reversible Markovian systems. J Stat Phys likelihood methods. Phys Rev Lett 91
90(5/6):1481–1487 (14):140601
33. Crooks GE (1999) Entropy production fluctu- 48. Nelder JA, Mead R (1964) A simplex method
ation theorem and the nonequilibrium work for function minimization. Comput J 7
relation for free energy differences. Phys Rev (4):308–313
E 60(3):2721–2726 49. Hahn AM, Then H (2010) Measuring the
34. Crooks GE (2000) Path-ensemble averages in convergence of Monte Carlo free-energy calcu-
systems driven far from equilibrium. Phys Rev lations. Phys Rev E Stat Nonlinear Soft Matter
E 61(3):2361–2366 Phys 81(4):041117
35. Hummer G, Szabo A (2001) Free energy 50. Lindorff-Larsen K, Trbovic N, Maragakis P,
reconstruction from nonequilibrium single- Piana S, Shaw DE (2012) Structure and
molecule pulling experiments. Proc Natl Acad dynamics of an unfolded protein examined by
Sci USA 98(7):3658–3661 molecular dynamics simulation. J Am Chem
36. Hummer G (2001) Fast-growth thermody- Soc 134(8):3787–3791
namic integration: error and efficiency analysis. 51. Rauscher S, Gapsys V, Gajda MJ, Zweckstet-
J Chem Phys 114(17):7330–7337 ter M, de Groot BL, Grubmüller H (2015)
37. Hummer G, Szabo A (2005) Free energy sur- Structural ensembles of intrinsically disordered
faces from single-molecule force spectroscopy. proteins depend strongly on force field: a com-
Acc Chem Res 38(7):504–513 parison to experiment. J Chem Theory Com-
put 11(11):5513–5524
38. Zwanzig RW (1954) High-temperature equa-
tion of state by a perturbation method. 52. Prevost M, Wodak SJ, Tidor B, Karplus M
I. nonpolar gases. J Chem Phys 22 (1991) Contribution of the hydrophobic effect
(8):1420–1426 to protein stability: analysis based on simula-
tions of the Ile-96 ! Ala mutation in barnase.
39. Kirkwood JG (1935) Statistical mechanics of Proc Natl Acad Sci USA 88(23):10880–10884
fluid mixtures. J Chem Phys 3(5):300–313
53. Sneddon SF, Tobias DJ (1992) The role of
40. Cuendet MA (2006) The Jarzynski identity packing interactions in stabilizing folded pro-
derived from general Hamiltonian or teins. Biochemistry 31(10):2842–2846
non-Hamiltonian dynamics reproducing NVT
or NPT ensembles. J Chem Phys 125 54. Pitera JW, Kollman PA (2000) Exhaustive
(14):144109 mutagenesis in silico: multicoordinate free
Accurate Calculation of Free Energy Changes upon Amino Acid Mutation 47
energy calculations on proteins and peptides. 68. Vriend G (1990) WHAT IF: a molecular mod-
Proteins Struct Funct Bioinf 41(3):385–397 eling and drug design program. J Mol Graph 8
55. Pearlman DA, Kollman PA (1991) The over- (1):52–56
looked bond-stretching contribution in free 69. Hornak V, Abel R, Okur A, Strockbine B,
energy perturbation calculations. J Chem Roitberg A, Simmerling C (2006) Compari-
Phys 94(6):4532 son of multiple amber force fields and develop-
56. Pearlman DA (1994) A comparison of alterna- ment of improved protein backbone
tive approaches to free energy calculations. J parameters. Proteins Struct Funct Bioinf 65
Phys Chem 98(5):1487–1493 (3):712–725
57. Boresch S, Karplus M (1999) The role of 70. Best RB, Hummer G (2009) Optimized
bonded terms in free energy simulations: molecular dynamics force fields applied to the
1. Theoretical analysis. J Phys Chem A 103 helix-coil transition of polypeptides. J Phys
(1):103–118 Chem B 113(26):9004–9015
58. Boresch S, Karplus M (1996) The Jacobian 71. Lindorff-Larsen K, Piana S, Palmo K, Mar-
factor in free energy simulations. J Chem Phys agakis P, Klepeis JL, Dror RO, Shaw DE
105(12):5145–5154 (2010) Improved side-chain torsion potentials
59. Boresch S, Karplus M (1999) The role of for the Amber ff99SB protein force field. Pro-
bonded terms in free energy simulations. teins Struct Funct Bioinf 78(8):1950–1958
2. Calculation of their influence on free energy 72. Lindahl E (2015) Molecular dynamics simula-
differences of solvation. J Phys Chem A 103 tions. In: Molecular modeling of proteins.
(1):119–136 Springer, Berlin, pp 3–26
60. Beutler TC, Mark AE, van Schaik RC, Gerber 73. Barua B, Andersen NH (2001) Determinants
PR, van Gunsteren WF (1994) Avoiding sin- of miniprotein stability: can anything replace a
gularities and numerical instabilities in free buried H-bonded Trp sidechain? Lett Pept Sci
energy calculations based on molecular simula- 8(3–5):221–226
tions. Chem Phys Lett 222(6):529–539 74. Barua B, Lin JC, Williams VD, Kummler P,
61. Zacharias M, Straatsma TP, McCammon JA Neidigh JW, Andersen NH (2008) The
(1994) Separation-shifted scaling, a new scal- Trp-cage: optimizing the stability of a globular
ing method for Lennard-Jones interactions in miniprotein. Protein Eng Des Sel 21
thermodynamic integration. J Chem Phys (3):171–185
100:9025–9031 75. Darden T, York D, Pedersen L (1993) Particle
62. Pham TT, Shirts MR (2011) Identifying low mesh Ewald: an Nlog(N) method for Ewald
variance pathways for free energy calculations sums in large systems. J Chem Phys 98
of molecular transformations in solution phase. (12):10089–10092
J Chem Phys 135(3):034114 76. Essmann U, Perera L, Berkowitz ML, Dar-
63. Gapsys V, Seeliger D, de Groot BL (2012) den T, Lee H, Pedersen LG (1995) A smooth
New soft-core potential function for molecular particle mesh Ewald method. J Chem Phys 103
dynamics based alchemical free energy calcula- (19):8577–8593
tions. J Chem Theory Comput 8 77. Rocklin GJ, Mobley DL, Dill KA, Hünenber-
(7):2373–2382 ger PH (2013) Calculating the binding free
64. Buelens FP, Grubmüller H (2012) Linear- energies of charged species based on explicit-
scaling soft-core scheme for alchemical free solvent simulations employing lattice-sum
energy calculations. J Comput Chem 33 methods: an accurate correction scheme for
(1):25–33 electrostatic finite-size effects. J Chem Phys
65. Gapsys V, de Groot BL (2017) pmx Webserver: 139(18):184103
a user friendly interface for alchemistry. J Chem 78. Lin Y-L, Aleksandrov A, Simonson T, Roux B
Inf Model 57(2):109–114 (2014) An overview of electrostatic free energy
66. Šali A, Blundell TL (1993) Comparative pro- computations for solutions and proteins. J
tein modelling by satisfaction of spatial Chem Theory Comput 10(7):2690–2709
restraints. J Mol Biol 234(3):779–815 79. Hub JS, de Groot BL, Grubmüller H, Groen-
67. Schrödinger, LLC (2015) The PyMOL molec- hof G (2014) Quantifying artifacts in Ewald
ular graphics system, version 1.8, November simulations of inhomogeneous systems with a
2015 net charge. J Chem Theory Comput 10
(1):381–390
Chapter 3
Abstract
Gene duplication is an important process in the evolution of gene content in eukaryotic genomes.
Understanding when gene duplicates contribute new molecular functions to genomes through molecular
adaptation is one important goal in comparative genomics. In large gene families, however, characterizing
adaptation and neofunctionalization across species is challenging, as models have traditionally quantified
the timing of duplications without considering underlying gene trees. This protocol combines multiple
approaches to detect adaptation in protein duplicates at a phylogenetic scale. We include a description of
models for gene tree-species tree reconciliation that enable different types of inference, as well as a practical
guide to their use. Although simulation-based approaches successfully detect shifts in the rate of duplica-
tion/retention, the conflation between the duplication and retention processes, the distinct trajectories of
duplicates under non-, sub-, and neofunctionalization, as well as dosage effects offer hitherto unexplored
analytical avenues. We introduce mathematical descriptions of these probabilities and offer a road map to
computational implementation whose starting point is parsimony reconciliation. Sequence evolution
information based on the ratio of nonsynonymous to synonymous nucleotide substitution rates (dN/dS)
can be combined with duplicate survival probabilities to better predict the emergence of new molecular
functions in retained duplicates. Together, these methods enable characterization of potentially
adaptive candidate duplicates whose neofunctionalization may contribute to phenotypic divergence across
species.
Key words Gene duplication, Gene tree, Birth-death models, Molecular evolution, dN/dS
1 Introduction
1.1 Gene Duplication The evolutionary mechanisms for generating novelty are key to
and Membrane understanding variation in phenotypic and taxonomic diversity
Proteins across the Tree of Life. Identifying the genetic mechanisms behind
the origin and maintenance of phenotypic diversity is therefore a
fundamental objective of evolutionary genetics. While base pair
substitutions provide a means for understanding the novel function
of existing genes, the duplication of entire genes and genomes
offers a source of new variation for functional diversification. Dupli-
cations are primary sources of innovation, from large-scale whole-
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_3, © Springer Science+Business Media, LLC, part of Springer Nature 2019
49
50 Laurel R. Yohe et al.
COMPLETE DUPLICATION
dosage effect?
pseudogenization? neofunctionalization?
neofunctionalization?
Fig. 1 Theoretical model of single-copy gene duplication and mechanisms for how a duplicate is fixed or lost in
a population (top). Different patterns indicate different fixed amino acid differences. Grayed genes indicate
loss of function. Note changes can also happen in regulatory regions, but are not shown here. The species-
level model (bottom) is a cartoon of hypothetical scenarios that may be observed across species and their
potential mechanisms. Figure adapted from [11]
2 Methods
2.1 Approaches There are two major approaches to investigating the evolutionary
and Limitations process of gene duplication among species: birth-death models fit to
to Studying Gene a species tree and gene tree-species tree reconciliation. Several meth-
Duplication odologies have been published using gene tree-species tree recon-
ciliation [18–22]. This approach allows detection of branches in
52 Laurel R. Yohe et al.
2.1.1 Parsimony-Based One early and common approach to gene tree-species tree recon-
Reconciliation ciliation is to use the principle of parsimony to minimize either the
duplication or the loss cost associated with mapping lineages of
gene trees to branches of the species tree. This approach provides a
valuable preliminary analysis for identifying discordance between
the gene tree topology and the species tree (when the species tree
relationship is not recovered within the gene family). Early
approaches required the gene and species trees to be fully resolved
with binary nodes, but subsequent approaches relaxed this assump-
tion (see [27] for a review). As in parsimony-based tree reconstruc-
tion, the insensitivity of parsimony to duplication rates on branches
with different lengths is a potential problem. A previous study has
evaluated the relationship of different costs of accounting for gene
Molecular Evolution of Gene Duplication 53
2.1.2 Birth-Death Models Early models for gene duplication were traditional birth and death
of Gene Duplication models. In these, the number of duplicate copies evolves through a
stochastic birth-death process in which retention and loss are
modeled with an exponential distribution [32]. Key parameters
estimated in birth-death models are the birth and death rates of
the genes, as well as the number of gene copies at each internal
node. These models set up a statistical framework that describes
how rates of gene duplication and loss may vary in different parts of
the tree.
In the context of our example with the APC transporters in
hemipteran insects, the parsimony inference suggests there may be
an increased rate of gene duplication in Sternorrhyncha compared
to other insects in the order. Likelihood-based birth-death models
54 Laurel R. Yohe et al.
9 citrus mealybug
potato psyllid
whitefly
fruitfly
10 human body louse
12 potato psyllid
whitefly
Sternorrhyncha
+6 cicada
kissing bug
18 pea aphid citrus mealybug
pea aphid
12 potato psyllid
whitefly
cicada
fruitfly
+2 human body louse
kissing bug
pea aphid
Fig. 2 (a) Species tree for Hemiptera insect order, denoted with the Sternorrhyncha sap-feeding insect
suborder. The human body louse is an outgroup. The fruit fly (Drosophila melanogaster) was omitted from the
species tree for clarity. Gray boxes indicate the number of gene copies inferred for each species and at each
ancestral node. Branch labels indicate the number of duplications (+) or losses () inferred to have occurred
at each respective branch as inferred using parsimony. (b) Gene tree of the APC amino acid transporter family.
Each tip is a unique gene copy belonging to the species labeled at the tip
A)
8 human body louse
-3 7 kissing bug B) C)
10
10 cicada p < 10=4 p = 0.24
600 600
-1
9 citrus mealybug
10 -2
Replicate
10
Sternorrhyncha
+8 400 400
18 pea aphid
12
+2 200 200
12 12 white fly
+1
13 potato psyllid 0 2 4 6 8 10 0 2 4 6
Duplication Loss
200 150 100 50 0
Ma
Fig. 3 Likelihood-based reconciliation of the APC transport proteins in Hemiptera. (a) Duplications and losses
labeled on branches were inferred from reconciliation analyses in DupliPHY-ML v. 1.2 [31]. Gray boxes are
number of APC transporter gene copies in each species or inferred at the ancestral node. (b) Simulation of
expected number of duplications for Sternorrhyncha under a null birth-death process. The dotted line is the
cumulative number of duplications observed from the DupliPHY-ML results. (c) Simulation of expected number
of losses for Sternorrhyncha under a null birth-death process. The dotted line is the cumulative number of
losses observed from the DupliPHY-ML results. P-values test whether the observed value is significantly
different than the null distribution. Simulations were performed using GenPhyloData within the JPrIME v. 0.3.6
software [21]. Code for simulations is available in the supplementary material of [30]
Table 1
Hemiptera APC transporter gene family parameter estimates of likelihood-based birth-death model
and likelihood ratio test results of model comparisons between a null model of a single birth rate (b)
for the entire tree or two rates of b, one for the background branches and one for sternorrhynchans
2.2 Modeling Several biological models have been proposed to depict the
Different Fates of Gene mechanisms that lead to different evolutionary fates for a gene
Duplicates: Integrating duplicate (Fig. 1), including pseudogenization, neofunctionaliza-
Reconciliation and tion, subfunctionalization, or dosage effect. These mechanisms
Birth-Death give rise to quite different retention dynamics that can lead to a
time-dependent loss rate of gene duplicates, expressed as a function
λ(t). For nonfunctionalization, the loss rate is constant over time.
In contrast, the loss rates of neofunctionalization and subfunctio-
nalization decline over time and have been described with a Weibull
hazard function [8]. For dosage effect, the rate of loss increases
over time unless dosage effects are combined with subsequent
neofunctionalization or subfunctionalization [33]. Alternative for-
mulations with very similar dynamics have also been proposed
[13]. Figure 4 depicts the shapes of these hazard functions under
different scenarios.
From Reconciliation Probabilities to Birth-Death Models
In most birth-death model frameworks, the time-dependent
loss rates have been incorporated in a generalized birth-death
process to model the fate of gene duplicates. This means the
evolution of the gene copies in a gene family is modeled as a pure
birth process with a time-dependent birth rate, which is a function
of the loss and birth rates in the original birth-death process. Since
the loss rate characterizes the underlying retention mechanisms, the
inference of the loss rates can identify either nonfunctionalization,
subfunctionalization, dosage, or neofunctionalization as responsi-
ble for the observed site patterns of gene family data. However, an
important caveat of all time-dependent models is that any rate of
loss that is computed is a function of time along branches of the
Molecular Evolution of Gene Duplication 57
nonfunctionalization
neofunctionalization
subfunctionalization
dosage
λ(t)
time
Fig. 4 Shape of the hazard function through time representing the rate of gene
loss under the four different gene retention scenarios. Figure modified from [39]
and the probability that the number of copies stays the same
!
X
nt
∗
P ðntþΔt ¼ nt Þ ¼ 1 nt b þ λ ti Δt þ oðΔt Þ:
i¼1
The parameter b is the birth rate; n is the number of gene copies
at the present time; λ t ∗i is the loss rate of gene copy i at age t ∗
i .
The three equations lead to a stochastic differential equation char-
acterizing the age-dependent birth-death process. When the loss
rate is constant (nonfunctionalization), the age-dependent birth-
death model is identical to the time-dependent birth-death model
derived from the reconstructed process (see [34] for derivation).
For neofunctionalization and subfunctionalization, it has been
demonstrated by simulation that the likelihood function of the
58 Laurel R. Yohe et al.
duplication
loss
t3 t3
E1
Ge
e
Tre
g1
ne
t2 E2
ies
Tr
ee
ec
t1
(E4)2 t2
Sp
t 1.2
E3 t 1.1
t0
E5 g2
D C B A D C B1 A1 B2 A2 A3
Fig. 5 Cartoon of species tree-gene tree reconciliation. Speciation times (ti) and
gene divergence times (gi) are noted on nodes. E4 is squared because it is
counting both branches from time t1. Event probabilities are listed in Table 2
Molecular Evolution of Gene Duplication 59
Table 2
Events and probabilities of Fig. 5
R g 1 t 1
E2 Retain duplicate
g 1 t 2
λðt Þdt
e
R g1
E3 Lose duplicate λðt Þdt
1 e g 1 t 2
R g1
E4 Retain duplicate λðt Þdt
e g 1 t 1
R g2
E5 Duplication and retention λðt Þdt
e 0
The probability of the reconciled tree in Fig. 5 is the product of all event probabilities.
Gray arrows indicate probabilities that do not include a speciation event. The branch
length-dependent birth rate can also be incorporated, when relevant.
duplicates in the gene tree. The best-fit hazard function model can
be determined by model selection using the Akaike or Bayesian
Information Criterion.
It should be emphasized that this example only accounts for a
single set of events for one proposed reconciliation solution, as
opposed to multiple hidden events that may have also occurred.
Integrating over all possible reconciliation histories is, in theory, the
only way to account for all possible hidden events. However, this is
not a feasible solution given the possible number of hidden events
that may have occurred. A more tractable solution is to begin with a
parsimonious reconciliation and iteratively consider hidden events
and alternative reconciliations according to a branch and bound-
style approach. In this regard, a finite set of events (such as those
shown in Table 2) for each reconciliation history can be compared
with one another, and the most probable solution among this finite
set of specific histories can be calculated.
2.3 Combining For each outcome in Fig. 1, there is an expected behavior of the
Survival Probabilities ratio rates of nonsynonymous (dN) to synonymous (dS) substitu-
with dN/dS tions (dN/dS or ω) for the gene copy (Fig. 6). The behaviors of this
ratio can reveal biologically meaningful interpretations relevant to
molecular adaptation. For example, analyses of mammalian olfac-
tory receptors, a hyperdiverse gene family that encodes G-protein-
coupled chemosensory receptors, have shown that some particular
orthologous gene groups have undergone rapid expansions and
have high dN/dS relative to the median, suggesting functional
diversification of these receptor types [35]. However, dN/dS is
not currently modeled in any methodology used to study gene
duplication, despite predictable functions under different gene
retention scenarios. When genes are initially redundant following
60 Laurel R. Yohe et al.
nonfunctionalization
neofunctionalization
subfunctionalization
dosage
dN/dS
1
birth time
3 Concluding Thoughts
Acknowledgements
References
1. Hoegg S, Brinkmann H, Taylor JS et al (2004) 6. Hollister JD (2015) Polyploidy: adaptation to
Phylogenetic timing of the fish-specific the genomic environment. New Phytol
genome duplication correlates with the diversi- 205:1034–1039
fication of teleost fish. J Mol Evol 59:190–203 7. Liebeskind BJ, Hillis DM, Zakon HH (2015)
2. Jaillon O, Aury J-M, Brunet F et al (2004) Convergence of ion channel genome content
Genome duplication in the teleost fish Tetra- in early animal evolution. Proc Natl Acad Sci U
odon nigroviridis reveals the early vertebrate S A 112:E846–E851
proto-karyotype. Nature 431:946–957 8. Konrad A, Teufel AI, Grahnen JA et al (2011)
3. Lien S, Koop BF, Sandve SR et al (2016) The Toward a general model for the evolutionary
Atlantic salmon genome provides insights into dynamics of gene duplicates. Genome Biol
rediploidization. Nature 533:200–205 Evol 3:1197–1209
4. The Arabidopsis Genome Initiative (2000) 9. Hughes T, Liberles DA (2007) The pattern of
Analysis of the genome sequence of the flower- evolution of smaller-scale gene duplicates in
ing plant Arabidopsis thaliana. Nature mammalian genomes is more consistent with
408:796–815 neo- than subfunctionalisation. J Mol Evol
5. De Bodt S, Maere S, Van De Peer Y (2005) 65:574–588
Genome duplication and the origin of angios- 10. Hahn MW (2009) Distinguishing among evo-
perms. Trends Ecol Evol 20:591–597 lutionary models for the maintenance of gene
duplicates. J Hered 100:605–617
62 Laurel R. Yohe et al.
11. Sikosek T, Bornberg-Bauer E (2010) Evolu- annotation using CAFE 3. Mol Biol Evol
tion after and before gene duplication? In: 30:1987–1997
Dittmar K, Liberles D (eds) Evolution after 27. Eulenstein O, Huzurbazar S, Liberles DA
gene duplication. Wiley-Blackwell, Hoboken, (2010) Reconciling phylogenetic trees. In:
NJ, pp 105–131 Dittmar K, Liberles D (eds) Evolution after
12. Zhao J, Teufel AI, Liberles DA et al (2015) A gene duplication. Wiley-Blackwell, Hoboken,
generalized birth and death process for model- NJ, pp 185–206
ing the fates of gene duplication. BMC Evol 28. Górecki P, Eulenstein O (2014) Refining dis-
Biol 15:275 cordant gene trees. BMC Bioinformatics 15:S3
13. Teufel A, Zhao J, O’Reilly M et al (2014) On 29. Duncan RP, Husnik F, Van LJT et al (2014)
mechanistic modeling of gene content evolu- Dynamic recruitment of amino acid transpor-
tion: Birth-death models and mechanisms of ters to the insect/symbiont interface. Mol Ecol
gene birth and gene retention. Computation 23:1608–1623
2:112–130 30. Dahan RA, Duncan RP, Wilson AC et al
14. Chothia C, Gough J, Vogel C et al (2003) (2015) Amino acid transporter expansions
Evolution of the protein repertoire. Science associated with the evolution of obligate endo-
300:1701–1703 symbiosis in sap-feeding insects (Hemiptera:
15. von Heijne G (2006) Membrane-protein Sternorrhyncha). BMC Evol Biol 15:52
topology. Nat Rev Mol Cell Biol 7:909–918 31. Ames RM, Money D, Ghatge VP et al (2012)
16. Poolman B, Geertsma ER, Slotboom D-J Determining the evolutionary history of gene
(2007) A missing link in membrane protein families. Bioinformatics 28:48–55
evolution. Science 315:1229–1231 32. Arvestad L, Lagergren J, Sennblad B (2009)
17. Nei M, Rooney AP (2005) Concerted and The gene evolution model and computing its
birth-and-death evolution of multigene associated probabilities. J ACM 56(7):44
families. Annu Rev Genet 39:121–152 33. Teufel AI, Liu L, Liberles DA (2016) Models
18. Chen K, Durand D, Farach-colton M (2000) for gene duplication when dosage balance
NOTUNG: a program for dating gene duplica- works as a transition state to subsequent
tions. J Comput Biol 7:429–447 neo-or sub-functionalization. BMC Evol Biol
19. Berglund-Sonnhammer AC, Steffansson P, 16:45
Betts MJ et al (2006) Optimal gene trees 34. Nee S, May RM, Harvey PH (1994) The
from sequences and species trees using a soft reconstructed evolutionary process. Philos
interpretation of parsimony. J Mol Evol Trans R Soc Lond Ser B Biol Sci 344:305–311
63:240–250 35. Niimura Y, Matsui A, Touhara K (2014)
20. Doyon JP, Ranwez V, Daubin V et al (2011) Extreme expansion of the olfactory receptor
Models, algorithms and programs for phylog- gene repertoire in African elephants and evolu-
eny reconciliation. Brief Bioinform tionary dynamics of orthologous gene groups
12:392–400 in 13 placental mammals. Genome Res
21. Sjöstrand J, Sennblad B, Arvestad L et al 24:1485–1496
(2012) DLRS: gene tree evolution in light of 36. Pegueroles C, Laurie S, Albà MM (2013)
a species tree. Bioinformatics 28:2994–2995 Accelerated evolution after gene duplication: a
22. Hermansen RA, Hvidsten TR, Sandve SR et al time-dependent process affecting just one
(2016) Extracting functional trends from copy. Mol Biol Evol 30:1830–1842
whole genome duplication events using com- 37. Spielman SJ, Wilke CO (2015) The relation-
parative genomics. Biol Proced Online 18:11 ship between dN/dS and scaled selection coef-
23. Bielawski JP, Yang Z (2003) Maximum likeli- ficients. Mol Biol Evol 32:1097–1108
hood methods for detecting adaptive evolution 38. Mugal CF, Wolf JBW, Kaj I (2014) Why time
after gene duplication. J Struct Funct Genom matters: codon evolution and the temporal
3:201–212 dynamics of dN/dS. Mol Biol Evol
24. Hahn MW, De Bie T, Stajich JE et al (2005) 31:212–231
Estimating the tempo and mode of gene family 39. Liberles DA, Teufel AI, Liu L et al (2013) On
evolution from comparative genomic data. the need for mechanistic models in computa-
Genome Res 15:1153–1160 tional genomics and metagenomics. Genome
25. Liu L, Yu L, Kalavacharla V et al (2011) A Biol Evol 5:2008–2018
Bayesian model for gene family evolution. 40. De Bie T, Cristianini N, Demuth JP et al
BMC Bioinformatics 12:426 (2006) CAFE: A computational tool for the
26. Han MV, Thomas GWC, Lugo-Martinez J et al study of gene family evolution. Bioinformatics
(2013) Estimating gene gain and loss rates in 22:1269–1271
the presence of error in genome assembly and
Chapter 4
Abstract
De novo genes, that is, protein-coding genes originating from previously noncoding sequence, have gone
from being considered impossibly unlikely to being recognized as an important source of genetic novelty in
eukaryotic genomes. It is clear that de novo gene evolution is a rare but consistent feature of eukaryotic
genomes, being detected in every genome studied. However, different studies often use different compu-
tational methods, and the numbers and identities of the detected genes vary greatly. Here we present a
coherent protocol for the computational identification of de novo genes by comparative genomics. The
method described uses homology searches, identification of syntenic regions, and ancestral sequence
reconstruction to produce high-confidence candidates with robust evidence of de novo emergence. It is
designed to be easily applicable given the basic knowledge of bioinformatic tools and scalable so that it can
be applied on large and small datasets.
Key words De novo genes, Gene birth, New gene evolution, Novel genes, ORF formation, Protein-
coding genes, Genome-wide detection, Genome evolution
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_4, © Springer Science+Business Media, LLC, part of Springer Nature 2019
63
64 Nikolaos Vakirlis and Aoife McLysaght
cellular functions (see [5] for a complete review). More and more,
researchers are coming to recognize de novo emergence as a uni-
versal evolutionary phenomenon and to appreciate its potential as a
mechanism of rapid phenotypic innovation [6] and as a genome-
shaping force [7]. As the interest around de novo genes grows, so
does the need for their accurate identification. This, however, is not
a trivial task. De novo genes are a subset of “orphan genes” also
known as “ORFans” or species-specific genes. These are genes that
are found only in a single genome (or in a closely related group of
genomes, in which case the term taxonomically restricted gene is
used) and lack homologues in any other organism. Disentangling
the evolutionary origins of orphan genes can be challenging [8],
and the results highly depend upon the employed methodology.
The initial de novo gene studies necessarily followed a stringent,
painstaking approach involving a substantial amount of manual
curation and multiple lines of evidence [9–12]. The goal was to
provide solid proof that a functional species-specific gene had
emerged from an ancestrally noncoding or nonfunctional region.
Since then, multiple studies have adopted a different type of
approach, with more relaxed criteria, but its advantages and pitfalls
are still a matter of debate [13–17].
In this chapter, we will present what can be considered as a
stringent best practice for the identification of all protein-coding
genes that have emerged entirely de novo in a single genome.
Conceptually this is the same as identifying genes that have origi-
nated de novo on a particular branch of a tree with the only
difference being that the novel gene will be present in the organ-
isms descended from that branch, and not in any outgroups. The
methodology described here can be easily adapted to that type of
study. The evolution of a novel gene requires, at the very minimum,
the origin of an open reading frame (ORF) and regulatory signals
for transcription and translation. Here we will mostly focus on the
emergence of the ORF. Starting with the complete set of annotated
protein-coding genes, we remove the ones with significant similar-
ity to genes in other genomes, resulting in a set of species-specific
genes. This set is then further reduced to the ones with identifiable
sequence similarity to their orthologous noncoding regions in
closely related outgroup genomes, from which an ancestral
sequence can be inferred. Finally, the ones for which the inferred
ancestral sequence can be shown to lack coding potential are
retained as de novo gene candidates (see Fig. 1 for a complete
outline). The approach is designed to (1) err on the side of caution
(i.e., we endeavor to avoid false positives), (2) be applicable as
widely and easily as possible, and (3) be scalable so that it can
work for large and small datasets. It is for this reason that the
method we are describing here is command-line oriented and
specific commands are provided. Nevertheless, the choice of para-
meters that one has to set throughout the application of this
Prediction of De Novo Genes 65
2 Materials
12. The GNU parallel command line tool available from https://
www.gnu.org/software/parallel/.
13. The faSomeRecords and faSize tools available from http://
hgdownload.cse.ucsc.edu/admin/exe/.
14. The SAMtools suite of programs available from http://
samtools.sourceforge.net/.
15. The EMBOSS suite of programs available from http://emboss.
sourceforge.net/download/.
16. The phyml [23] phylogenetic reconstruction program available
from http://www.atgc-montpellier.fr/phyml/binaries.php.
17. The tantan [24] tool for low-complexity masking in biological
sequences.
3 Methods
3.1 Retrieve the Data The first step is to download the necessary data to a local machine,
where all the subsequent computations will take place. To start the
analysis, we need the genomic sequences in FASTA format, anno-
tation files in one of the commonly used formats (GenBank,
EMBL, GFF), and the amino-acid and coding DNA sequences
(CDS) for all annotated protein-coding genes for the focal genome
and the outgroup genomes. There exist multiple sources for
genome data, and the choice depends on the genome being inves-
tigated. NCBI’s Genome Resource (https://www.ncbi.nlm.nih.
68 Nikolaos Vakirlis and Aoife McLysaght
3.2 Identify All By definition, de novo genes are derived from previously noncod-
Species-Specific ing sequence. If we are considering recently evolved de novo genes,
Genes then they will be a subset of species-specific genes, having no
homologs outside the focal genome. Thus the first step is to iden-
tify all species-specific genes in our focal genome. De novo genes
can be categorized according to whether or not they contain any
genetic material that is copied or descended from a pre-existing
gene [3]. The most intuitive cases are type I de novo genes which
are completely derived from noncoding sequence. However,
depending on the purpose of the study, one may also be interested
in de novo genes with small or large portions derived from
sequences previously under selection (type II and type III de
novo genes, respectively), and the similarly search criteria can be
adjusted accordingly.
3.2.1 Similarity Search in We will first perform a similarity search of all the protein-coding
NCBI’s NR Database genes in the focal genome against NCBI’s NR database using the
blastp executable from the BLAST suite of programs. A commonly
used E-value threshold is 103, generally accepted to result in a
good trade-off between sensitivity and specificity. Using more per-
missive thresholds than 103 is very likely to produce a lot of false
hits. The NR database is relatively large; the total size of the
uncompressed files of the preformatted version, as of this writing,
is 106 GB. To speed up the search, we can use BLAST’s multi-
threading option with an appropriate number of threads
(-num_threads X) to parallelize the alignment step, as well as the
GNU parallel command line tool to further accelerate the search.
Let’s assume that we have downloaded the NR database prefor-
matted files along with the taxonomy information file (taxdb.tar.
gz) and we have uncompressed them and placed them in the
NR_DIRECTORY directory. The command to execute is the
following:
$ cat focal_aa.fsa | parallel --GNU --block 100k --recstart ’>’ --pipe ’blastp
-query - -db [NR_DIRECTORY]/nr -outfmt "6 std slen qlen stitle staxids sscinames"
-max_target_seqs 500 -num_threads [NUM_OF_CPUS] -evalue 0.001’ > focal_nr_out.txt
The time that this command will take to complete will depend
on the number and size of the query protein sequences. The
“-max_target_seqs 500” argument is used to limit the number of
target sequences reported, since we are only interested in the
general presence or absence thereof, and not in the sequences
themselves.
We then need to parse the blastp output and store in a file a list
of all the genes without hits, notwithstanding self-hits. This list will
then be compared to the full list of genes to extract a list of genes
without any BLAST hit. We then select the protein sequences of
these genes and store them in a separate file; the same is also done
70 Nikolaos Vakirlis and Aoife McLysaght
All the genes with at least one significant match outside of the
focal taxon are now stored in the file “focal_found_genes.txt.”
Then we remove them from the list of all the genes:
3.2.2 Similarity Search in Using the species-specific genes, we will perform a similarity search
Outgroup Genomes in the outgroup genomes’ sequences. This is needed to ensure that
no homologous, unannotated genes exist in the outgroup species
and is also required for other downstream steps. First, it is a good
idea to mask low-complexity segments on the chromosome
sequences so that we avoid spurious matches. This can be achieved
with the tantan program:
$ export GBLAST_PATH="/users/User/Documents/tools/genBlast_v138_mac_os_X/"
At this stage, we will also use the tfasty executable from the
fasta suite of programs, to do a similarity search using as query the
protein sequences of the species-specific genes and the masked
chromosome sequences from the four outgroup species as subject.
The command is run twice to get two different output formats
(controlled by the “-m” argument): the tabular one which is useful
for parsing and the detailed one which is useful for visual inspection
and manual curation (see Note 1):
Next, we need to filter out low identity and low coverage hits.
These thresholds can vary, but here we will apply a percentage
identity threshold of 50% and a protein coverage threshold of 50%.
To apply a coverage percentage threshold, we first need to use
the faSize utility to calculate the length of each species-specific
sequence:
3.3 Showing that the The most robust evidence that a gene emerged de novo is provided
Ancestral Sequence when its ancestral sequence can be shown to have lacked protein-
Lacked Protein-Coding coding potential. In order to achieve that, one needs to detect the
Potential candidate gene’s orthologous genomic sequences in at least two
closely related outgroup species. At this point it is crucial to note
that while this step is relatively straightforward for single-exon
genes and genes with simple gene structures, it necessitates signifi-
cantly more manual curation and involves more uncertainty in the
case of complex gene structures. For simplicity and because in the
majority of cases young genes are short and have very simple gene
structures, we will consider only the single-exon gene case. To
extend to multiple exons, each exon would have to be searched
separately during the tfasty step described in Subheading 3.2.2.
That would then allow the manual “stitching together” of the
orthologous region of each exon into a single putative CDS
which can then be aligned and inspected as described in the follow-
ing subsections (see Note 2).
3.3.1 Retrieving the Check if orthology or synteny information between the focal
Orthologous Regions in genome and the neighboring genomes exists in comparative geno-
Outgroup Genomes mic resource databases, a list of which can be found at the Quest for
Orthologs website https://questfororthologs.org/orthology_
databases. The goal at this step is to locate, if possible, the ortho-
logous region of the candidate genes in closely related outgroup
genomes (see Fig. 2), and for this we will need lists of orthologous
pairs of genes, which can be extracted from one of the aforemen-
tioned databases. If orthology information is not available, the
orthologous pairs need to be computed from scratch using a dedi-
cated tool.
By combining the results of Subheading 3.2.2 and the orthol-
ogy information, we can build multiple alignments of each candi-
date gene and its orthologous sequences. These MSAs will be used
in the next step.
Prediction of De Novo Genes 73
Conserved orthologs
Fig. 2 Graphical representation of the configuration described in Subheading 3.1, in a scenario with four
closely related outgroup species. The regions of interest are highlighted in green. Note that in actual cases,
neighboring genes and regions might overlap, and so the region of interest might not be as clearly defined as
in the example here
#!/bin/shcat focal_ss_names_final.txt |\
while read f_line ;
do
echo $f_line > faIn_temp ;
faSomeRecords focal_ss_cds_final.fsa faIn_temp ${f_line}_ortho.fsa ;
grep $f_line chrom/*tabular_with_lengths_filtered.txt |\
while read line ;
do
START=$(echo $line | cut -f 9 -d ’ ’) ;
END=$(echo $line | cut -f 10 -d ’ ’) ;
REV_FLAG=0 ;
if [ $START -gt $END ] ;
then
read START END <<< "$END $START" ;
REV_FLAG=1 ;
fi ;
FILE_NAME=$(echo $line | cut -f 1 -d ’:’) ;
CHROM_NAME=$(echo $line | cut -f 2 -d ’ ’) ;
samtools faidx
chrom/${FILE_NAME%%\.*}.chrom.fsa.masked ${CHROM_NAME}:$START-$END
| tr ’:’ ’_’ > temp ;
74 Nikolaos Vakirlis and Aoife McLysaght
if [ $REV_FLAG -gt 0 ] ;
then
revseq -tag -sequence temp -outseq temp_rev ;
mv temp_rev temp;
fi ;
cat temp | sed ’s/^>\(.*\)/>\1 OUTGR/’ | tr -d ’:’ >> ${f_line}_
ortho.fsa ;
done ;
done
To execute, after replacing with the correct file names, copy and
paste this code into a file called parse_candidate_gen_hits.sh. Then,
permit that file to be executed by running the following:
$ chmod +x parse_candidate_gen_hits.sh
3.3.2 Reconstruction of Next, we infer the state of the ancestral sequence. Almost always in
the Ancestral Sequence the literature, this step would be performed manually, by “walking
along” the alignment and trying to identify “shared disablers.”
Simply put, this consists of manually inspecting the alignments
and identifying common ORF-disrupting mutations in the out-
group orthologous sequences. The ancestral state of these positions
can then be parsimoniously inferred, revealing whether the ances-
tral sequence had an intact ORF. We consider the ORF as ances-
trally not “intact” when it’s shorter than 70% of the de novo
candidate gene’s length.
The first step is to align the sequences. The alignments can be
generated using the linsi executable of MAFFT, as follows
(as before, replace [CPU_NUM] by the number of cores in your
machine):
The aligned sequences are now stored in the files with the
extension “_ort‘ho.aln.” By opening the alignment files with an
alignment viewer, we can detect frameshift mutations (see Fig. 3a)
and stop codons (Fig. 3b) in the orthologous sequences. It is
important that we follow any change of reading frame that occurs
Prediction of De Novo Genes 75
Fig. 3 Two hypothetical examples of shared disablers. (a) A single-nucleotide deletion (highlighted in yellow)
that occurred along the terminal branch of the focal genome results in a frameshift, making available a
different potential translation of the sequence that avoids the TGA stop codon that is in frame for the potential
ORF in other species. (b) Two base substitutions in the focal genome lineage lead to the removal of two stop
codons (nonsense-to-sense mutations, highlighted in blue) leading to the formation of a longer ORF
Then, we can build trees, one for each alignment, that follow
the species topology by executing the following command:
Fig. 4 Inferring de novo emergence for a hypothetical example alignment combining both frameshifts and
nonsense-to-sense mutations of Fig. 3, using ancestral reconstruction. (a) The phylogenetic tree generated by
PRANK, the same as the input guide tree but with the assigned ancestor identifiers. (b) The PRANK alignment
containing the extant and ancestral sequences. The positions of interest are highlighted as in Fig. 3. The
ancestral states at these positions confirm the results of the manual inference. (c) The “de-aligned,”
translated ancestral proteins and the focal extant one (see below for relevant commands), allowing to verify
that no intact ORF existed before the focal leaf of the tree
The final results are stored in the file results.txt. The candidates
tagged with “keep” are the ones that do not have an intact ORF
78 Nikolaos Vakirlis and Aoife McLysaght
3.4 Showing that the By default, young de novo genes are only present in a single
Candidate Genes Are genome. They therefore lack one of the main lines of evidence
Protein-Coding/ that is put forth to prove that a piece of DNA is indeed protein-
Functional coding or functional, namely, conservation due to purifying selec-
tion. It is thus necessary to provide some evidence that the putative
de novo gene expresses a functional protein and is not simply a
spurious result. This is a difficult issue which depends on the very
definition of a functional protein-coding gene, complicated further
by pervasive transcription [25] and pervasive translation [26]. Con-
sequently, what constitutes sufficient evidence of “coding-ness”
and functionality will depend on the context and the assumptions
of the study.
In the absence of specific functional annotation or identifica-
tion of the protein by other means, experimental evidence for its
expression can be provided if proteomics or ribosome profiling data
are available. Major repositories of results from mass spectrometry
proteomic experiments include PRIDE (https://www.ebi.ac.uk/
pride/archive/) and PeptideAtlas (http://www.peptideatlas.org/)
where one can check whether peptides matching a de novo candi-
date have been experimentally detected (see [27] for additional
information on mass spectrometry proteomic resources). Ribo-
some profiling result databases include GWIPS (http://gwips.ucc.
ie/) and RFPDB (http://sysbio.sysu.edu.cn/rpfdb/index.html)
(see [28] for a more complete list of resources). Alternatively, one
can calculate what are referred to as “coding scores” based on the
intrinsic sequence composition of the candidates [17, 29]. This is
sometimes done as part of the initial genome annotation, as is the
case, for example, in the Saccharomyces [30]. One possible solution
is the CPAT tool [31] which has been developed to be applied on
entire transcripts but can work on single ORFs as well. It involves
training the model on data of known coding and noncoding
sequences first (unless your genome is one of human, fly, mouse,
or zebrafish, already available at http://lilab.research.bcm.edu/
cpat/) and so will not be covered in detail here. At any rate, a
sequence annotated as coding remains at the very least a candidate,
even if no other evidence exists of its functionality.
Prediction of De Novo Genes 79
4 Notes
4. In order for the last two commands to work, the names at the
leaves of the tree files and the names of the sequences in the
FASTA files must match. That means we first must remove the
part of the header following the species name in the de novo
candidate’s record in the FASTA file and accordingly remove any
extra information from the header of the orthologous matches.
80 Nikolaos Vakirlis and Aoife McLysaght
References
24. Frith MC (2011) A new repeat-masking database to Web server and software. Brief
method enables specific detection of homolo- Bioinform. https://doi.org/10.1093/bib/
gous sequences. Nucleic Acids Res 39:e23–e23 bbx093
25. Clark MB, Amaral PP, Schlesinger FJ et al 29. Ruiz-Orera J, Messeguer X, Subirana JA et al
(2011) The reality of pervasive transcription. (2014) Long non-coding RNAs as a source of
PLoS Biol 9:e1000625 new peptides. Elife 3:e03523
26. Ingolia NT, Lareau LF, Weissman JS (2011) 30. Scannell DR, Zill OA, Rokas A et al (2011)
Ribosome profiling of mouse embryonic stem The awesome power of yeast evolutionary
cells reveals the complexity and dynamics of genetics: new genome sequences and strain
mammalian proteomes. Cell 147:789–802 resources for the Saccharomyces sensu stricto
27. Chen T, Zhao J, Ma J et al (2015) Web genus. G3 (Bethesda) 1:11–25
resources for mass spectrometry-based proteo- 31. Wang L, Park HJ, Dasari S et al (2013) CPAT:
mics. Genomics Proteomics Bioinformatics coding-potential assessment tool using an
13:36–39 alignment-free logistic regression model.
28. Wang H, Wang Y, Xie Z (2017) Computa- Nucleic Acids Res 41:e74
tional resources for ribosome profiling: from
Chapter 5
Abstract
The analysis of coevolutionary signals from families of evolutionarily related sequences is a recent concep-
tual framework that provides valuable information about unique intramolecular interactions and, therefore,
can assist in the elucidation of biomolecular conformations. It is based on the idea that compensatory
mutations at specific residue positions in a sequence help preserve stability of protein architecture and
function and leave a statistical signature related to residue-residue interactions in the 3D structure of the
protein. Consequently, statistical analysis of these correlated mutations in subsets of protein sequence
alignments can be used to predict which residue pairs should be in spatial proximity in the native functional
protein fold. These predicted signals can be then used to guide molecular dynamics (MD) simulations to
predict the three-dimensional coordinates of a functional amino acid chain. In this chapter, we introduce a
general and efficient methodology to perform coevolutionary analysis on protein sequences and to use this
information in combination with computational physical models to predict the native 3D conformation of
functional polypeptides. We present a step-by-step methodology that includes the description and applica-
tion of software tools and databases required to infer tertiary structures of a protein fold. The general
pipeline includes instructions on (1) how to obtain direct amino acid couplings from protein sequences
using direct coupling analysis (DCA), (2) how to incorporate such signals as interaction potentials in Cα
structure-based models (SBMs) to drive protein-folding MD simulations, (3) a procedure to estimate
secondary structure and how to include such estimates in the topology files required in the MD simulations,
and (4) how to build full atomic models based on the top Cα candidates selected in the pipeline. The
information presented in this chapter is self-contained and sufficient to allow a computational scientist to
predict structures of proteins using publicly available algorithms and databases.
Key words Coevolution, Structure-based model, Energy landscapes, Molecular dynamics, Pro-
tein Folding, Structure prediction
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_5, © Springer Science+Business Media, LLC, part of Springer Nature 2019
83
84 Ricardo Nascimento dos Santos et al.
2 Materials
2.1 UniProt Server UniProt is a comprehensive genomic sequence and analysis data-
base containing a large dataset of protein sequences, accompanied
by diverse biological annotations for biological function, domain
composition, subcellular location, and possible molecular interac-
tions [28]. This database is freely accessible at www.uniprot.org.
2.3 Direct Coupling Correlations in amino acid mutations can be identified by applying
Analysis (DCA) Server statistical inference in MSAs [31]. Several algorithms and refine-
ments to perform this analysis have been developed by our group
and others, with good performance [1, 4, 11]. Direct coupling
analysis is an efficient technique to compute coevolutionary signals
that is able to disentangle indirect and direct correlations that are
hard to distinguish by usual correlation analysis such as mutual
information [1, 5]. Identification of direct couplings is especially
important in structure predictions, since they can be interpreted as
regions that are physically interacting in the functional state of
macromolecules. A web server for estimation of direct correlations
in residue pairs using the DCA approach is available at http://dca.
rice.edu/ and http://morcoslab.org. Moreover, a standalone ver-
sion of DCA is also available in the same web page. A more detailed
description of DCA usage can be found in another publication [5].
86 Ricardo Nascimento dos Santos et al.
2.4 HMMER Profiling HMMER is a software framework developed for the identification
of homologous sequences in biological databases and to perform
efficient multiple sequence alignments. Its methodology employs
probabilistic Hidden Markov Models for pattern recognition.
HMMER is comprised of several tools such as hmmscan for query
sequence against homologs databases and hmmbuild, to generate a
MSA. This software is freely accessible and can be downloaded at
http://hmmer.org/.
2.6 Gnuplot Gnuplot is an open source and portable software for graphical
visualization of data and mathematical functions. It is very intuitive
and native to most Linux distributions, in addition to running in all
major operating systems. To check if Gnuplot is already included in
your operating system, type gnuplot in an operating system’s termi-
nal. Details about download and usage can be found at http://
gnuplot.sourceforge.net/.
2.7 Jpred Server While recognition of tertiary structure is still challenging and meth-
odologies are under development, tools for estimation of local
secondary motifs are very mature and display good accuracy
[34, 35]. One of the most recent implementations of secondary
structure prediction methods, which was selected for this protocol,
is Jpred version 4 [36]. This approach uses multilayered neural
networks that are trained to identify secondary motif patterns
from primary sequence as input. Sequence queries can be submit-
ted to the Jpred server at http://www.compbio.dundee.ac.uk/
jpred/index.html. Alternatively, other prediction methodologies
can also be used, such as PSIPRED, Spider2, and PredictProtein
[37–39].
Coevolution and SBMs for Protein Structure Prediction 87
2.9 Protein Protein Data Bank (PDB) is a structural database containing the 3D
Data Bank relative coordinates of thousands of biological macromolecules
obtained from experimental techniques such as X-ray and electron
crystallography, nuclear magnetic resonance (NMR), and cryo-
electron microscopy (cryo-EM). Most of the database comprises
protein structures, but also nucleic acid structures are found
[42, 43]. This source of structural genomic information is in con-
tinuous growth, with more than 130, 000 entries by August 2017.
It can be accessed at https://www.rcsb.org/pdb/home/home.do.
2.10 UCSF Chimera UCSF Chimera is a free-of-charge software for visualization and
analysis of molecular structures. It is very flexible on supporting a
wide range of file formats and provides many tools for data analysis,
such as sequence alignment, charge distribution and energy opti-
mizations, interpolation of structure conformations, and solvation.
Moreover, Chimera also provides a platform to generate high-
quality scientific images and animations. It can be downloaded at
https://www.cgl.ucsf.edu/chimera/download.html.
2.12 REMO Server REconstruct atomic MOdel (REMO) is a program that generates
full atomic coordinates of proteins from Cα models. This recon-
struction process employs an algorithm to optimize the network of
88 Ricardo Nascimento dos Santos et al.
3 Methods
3.1 Protein Sequence As a first example of protein folding prediction, we will work with
and Family the human transmembrane protein aquaporin-1. This macromole-
cule is part of a large family of transport proteins that controls the
flow of water through the cell [48–50]. The presence of water
channels allows cells to increase and decrease intracellular water
content at a faster rate than diffusion through membranes, and
functional defects are related to diseases [48]. Moreover, structural
characterization of transmembrane proteins such as aquaporin-1 is
particularly challenging for experimental studies, due to limitations
of available techniques. Therefore, computational methodologies
like the one presented here in are particularly useful. Also,
while many full-atom ab initio methods have been successful in
predicting the conformation of small proteins (<100 residues),
the study of larger systems such as aquaporin-1 is still intangible
even when using high-performance computation. Together, these
attributes justify aquaporin-1 as an excellent example for folding
studies that is compatible to the complexity of real-case research
problems.
In order to infer coevolutionary signals for a given system, we
need the primary amino acid sequence of the target. To perform
this task, we will access the UniProt server (Subheading 2.1) and
type human aquaporin 1 in the main query field at the top of the
page, with the option UniProtKB selected. A list of entries for
aquaporins organized by relevance will be provided. We should
select the top relevant entry that corresponds to our target of
study (entry P29972, AQP1_HUMAN). By clicking in the link of
this entry, another page will display all information already anno-
tated about this molecule (such as biological function and molecu-
lar interactions). In order to perform coevolutionary analysis, we
need the amino acid sequence and the Pfam family of this protein.
In section Sequences of the aquaporin-1 UniProt page, we can
Coevolution and SBMs for Protein Structure Prediction 89
Fig. 1 Parameters to download MSA for a specific family in Pfam server. The generated format is compatible
with DCA
90 Ricardo Nascimento dos Santos et al.
This step will remove insertions from the original MSA file
PF00230_full.txt downloaded from Pfam and rewrite this informa-
tion in a file named PF00230_full_clean. Furthermore, in order to
identify the region and specific residues from the human aquaporin-
1 sequence that are part of MIP family, we also need to know the
profile used to generate the MSA (Pfam employs Hidden Markov
Models for multiple alignments). We can download the model
named MIP.hmm at the Curation & model tab using the download
link provided at the bottom.
3.2 Coevolutionary From the MSA of the MIP family, we can now perform a statistical
Information from analysis over many sequences using direct coupling analysis (DCA).
Direct Coupling This analysis will provide a quantitative measure of the direct cou-
Analysis plings of each possible pair of positions in the MSA, which can be
used to infer physical contacts in aquaporin-1. To perform this step,
go to the DCA server website (see Subheading 2.3), and use the full
MSA for MIP family obtained (PF00230_full_clean) as input. Use
the DCA button on the Workbench tab. Choose a job name in the
first blank field, and upload the MSA file in FASTA_IN option. Set
the relative_pseudo_count option to 1.0 and the homolog_radius
parameter to 0.8. Details about the influence of these parameters
to DCA performance are provided in the same submission page as
well as in [1]. Finally, run the coevolutionary analysis with the
button Start Job.
After finishing the calculations, the DCA server will provide a
preliminary heatmap with an overview of highly ranked correlation
distribution in MSA. Furthermore, a file containing the list of direct
correlations for each pair of position in the family can be down-
loaded from the link named DI_values.DI. Save this file to your
computer, and open it using any text editor of choice. Check that
three columns are provided for each line, corresponding to the pair
positions in MSA and to the direct correlation level named as direct
information (DI), respectively. Notice that lines in the DCA list are
sorted by pair numbering starting at 0, but we want to distinguish
only the pairs with maximum DI values (third column). Moreover,
since adjacent positions in MSA usually correspond to neighboring
residues in the backbone, then we expect an intrinsic and trivial
high correlation among those residues. In order to sort this list
based on DI values and to filter neighboring positions with local
correlations, we will open a terminal in the folder containing the
downloaded files and execute the following command to generate a
new list or pairs (see Note 3):
hmmpress MIP.hmm
This script will generate two files: (i) a new list of top coevolu-
tionary couplings corresponding to residue pairs in the target pro-
tein sequence and (ii) a reference table for each matching position
in the MSA and the protein sequence (see Note 4). You can check
the creation of these files by typing the command “ls” in the
terminal. Open the first file generated (named as P29972_MIP_s-
can_ranked_matched.DI), and compare with the previous MSA list
obtained from DCA (PF00230_full_ranked.DI). Next, check the
generated reference file for mapping (P29972_MIP_scan_reference.
txt), and observe that some insertions can occur in the map list
(represented by “-”) that are not in the original MSA of assigned
family and vice versa. These instances demonstrate why a detailed
pair matching is necessary instead of only knowing the
absolute beginning and end of the location of a family in a protein
(see Note 5).
Moreover, to get an idea about the level of interaction diversity
in our predicted coevolutionary data, we can visualize the
top-ranked couplings using residue-residue plots. To do this, we
should open a Gnuplot terminal (Subheading 2.6) by typing gnu-
plot in the working shell terminal and plot the first 200 obtained
92 Ricardo Nascimento dos Santos et al.
Fig. 2 Representation of top 200 DCA contacts for human aquaporin-1 (UniProt
code: P29972). The lower right distribution corresponds to the native contacts
from the reference structure PDB: 4CSK
where the file DCA_top226 has the top 226 coupled residue pairs
separated by more than 4 residues in the amino acid chain.
3.3 Secondary After quantifying coevolutionary couplings for our protein of study
Structure Prediction aquaporin-1, we can now use this information for folding predic-
tion. But first, we need to describe how neighbor residues are
locally organized and packed. Clues about local information (sec-
ondary structure) can also be inferred from coevolution analysis;
however, several mature and more accurate methodologies are
available. These approaches use statistical, knowledge-based, or
machine-learning techniques (or combinations of them) to
predict packing patterns. One robust tool for secondary structure
prediction that was selected for this methodology is Jpred
(Subheading 2.7).
In order to predict the secondary structure of aquaporin-1, go
to the Jpred server website (Subheading 2.7), delete the sample
sequence in the query field, paste the amino acid sequence of the
protein obtained in Subheading 3.1 and saved in a .fasta file (copy
the text inside P29972.fasta and paste in query field). Run the
analysis with the button Make prediction. In the following page
returned after job submission, a message should appear showing
validated structures matching this query sequence. Since we are
performing this practice as a validation example for the method,
94 Ricardo Nascimento dos Santos et al.
3.4 Structure-Based Once we have obtained sources of data for secondary structure and
Models from coevolutionary couplings as proxy for secondary- and tertiary-fold
Coevolution levels, respectively, we can merge these data and use it as input for
structure prediction. First, in order to run folding with molecular
dynamic simulations using structure-based models, we need to
generate: (i) an initial unfolded model for aquaporin-1 and (ii) a
topology file containing all details about the physical properties of
the system, such as the mass of atoms, covalent bonds, interaction
potentials between specific pairs obtained from DCA, and energies
involved in conformational movements (variations in bond lengths,
angles, and dihedrals). Further general information about physical
models and the approach of molecular dynamic simulations can be
found elsewhere [26, 54–56]. To generate these files, we should
use a python script provided at the link described in Subheading 2.5
(file named as dcasbm.tar.gz). Extract the file provided in a folder
inside the working directory using the following command:
Now, run the script to generate the protein model and topol-
ogy files including DCA signals using the following command:
3.5 Folding With the initial coordinates and topological description files for
Simulations aquaporin-1 at hand, we can now run folding simulations. This
procedure will drive the association of protein residue pairs identi-
fied as highly coevolving by the application of a combination of
repulsive and attractive energy potentials (SBMs) in molecular
simulations [19, 25, 26, 32]. In this process, the system tempera-
ture is gradually reduced, in an approach known as simulated
annealing, until the total conformational energy of the system
achieves a minimum where most predicted interactions are satisfied
yielding to a native-like folded conformation [6, 25, 33].
A GROMACS version with support for Gaussian potentials
(see Subheading 2.8) is required. We should use the files created
in the last section as input for the simulation using the generated
topology and coordinate files for aquaporin-1 and the file with the
simulation parameters downloaded as part of the package provided
in Subheading 2.5 (file named sbm_calpha_SA.mdp) (see Note 6).
This procedure can be done using the following script:
3.6 Analysis After developing a folding simulation for aquaporin-1, we can now
of Predicted Models check the predicted conformation and compare with respective
experimental models available in the literature. We can look for
96 Ricardo Nascimento dos Santos et al.
Fig. 3 Visual comparison of predicted and experimental conformations of human aquaporin-1 using UCSF
chimera
The -p1 and -p2 options define the two structures to be aligned,
and the -seqnum option assures that the sequence alignment will
match exactly to the corresponding residue numbering in the PDB
files, which is the case here since the two proteins share the same
98 Ricardo Nascimento dos Santos et al.
3.7 Rebuilding In this step, we will generate an all-atom protein structure from our
All-Atom Protein predicted Cα folded model. To perform this step, we will make use
Structures of the REMO server [45]. Go to the web page of REMO
Coevolution and SBMs for Protein Structure Prediction 99
Fig. 4 All-atom model for aquaporin-1 generated by REMO using the coarse-
grained predicted fold model. The full protein TM-score between predicted
(purple) and experimental (white) models is 0.68, considering Cα carbons
(Subheading 2.12), and upload the final folded model obtained for
aquaporin-1 converted to pdb format (run.pdb) using the
Browse. . . button. Fill the e-mail form, and submit the process
using the button run REMO in the bottom of the page. After a
couple of minutes, you should receive the link with REMO results
in the e-mail that you provided. Download the generated model
and visualize it using Chimera (Fig. 4).
3.8 Additional For additional test cases, we suggest the reader to try the same
Examples protocol for folding prediction using other interesting biological
systems. Some suggestions are provided:
1. The human small G protein RAP2A. UniProt code P10114 and
PDB code 1KAO.
2. The bacterial ABL transporter, a larger transmembrane protein.
UniProt code P06609 and PDB code 1L7V.
3. The receiver domain of DesR from Bacillus subtilis, a transcrip-
tional regulatory protein. UniProt code O34723 and PDB
code 4LE1.
In this chapter we discussed a convenient methodology to
predict 3D coordinates of folded protein structures based on
coevolutionary information and molecular dynamics. We provide
resources and software tools that are free to access and instructions
on how to use them to elucidate structures with high TM-scores.
The information contained in this chapter is general and can be
used to study and infer structures of many proteins for which no
structural information is available. With the advent of sequencing
100 Ricardo Nascimento dos Santos et al.
4 Notes
Acknowledgments
The authors thank financial support from the São Paulo Research
Foundation (FAPESP) (Grants 2015/13667-9, 2010/16947-9,
2013/05475-7, and 2013/08293-7) and funding from the Uni-
versity of Texas at Dallas.
References
1. Morcos F, Pagnani A, Lunt B et al (2011) 12. Hayat S, Sander C, Marks DS, Elofsson A
Direct-coupling analysis of residue coevolution (2015) All-atom 3D structure prediction of
captures native contacts across many protein transmembrane β-barrel proteins from
families. Proc Natl Acad Sci U S A 108: sequences. Proc Natl Acad Sci U S A
E1293–E1301 112:5413–5418
2. Hamilton N, Burrage K, Ragan MA, Huber T 13. Marks DS, Hopf TA, Sander C (2012) Protein
(2004) Protein contact prediction using pat- structure prediction from sequence variation.
terns of correlation. Proteins 56:679–684 Nat Biotechnol 30:1072–1080
3. Ivankov DN, Finkelstein AV, Kondrashov FA 14. Jones DT, Singh T, Kosciolek T, Tetchner S
(2014) A structural perspective of compensa- (2015) MetaPSICOV: combining coevolution
tory evolution. Curr Opin Struct Biol methods for accurate prediction of contacts
26:104–112 and long range hydrogen bonding in proteins.
4. de Juan D, Pazos F, Valencia A (2013) Bioinformatics 31:999–1006
Emerging methods in protein co-evolution. 15. Sadowski MI, Taylor WR (2013) Prediction of
Nat Rev Genet 14:249–261 protein contacts from correlated sequence sub-
5. Morcos F, Hwa T, Onuchic JN, Weigt M stitutions. Sci Prog 96:33–42
(2014) Direct coupling analysis for protein 16. Hopf TA, Morinaga S, Ihara S et al (2015)
contact prediction. Methods Mol Biol Amino acid coevolution reveals three-
1137:55–70 dimensional structure and functional domains
6. Sulkowska JI, Morcos F, Weigt M et al (2012) of insect odorant receptors. Nat Commun
Genomics-aided structure prediction. Proc 6:6077
Natl Acad Sci 109:10340–10345 17. Schug A, Weigt M, Onuchic JN et al (2009)
7. Hopf TA, Colwell LJ, Sheridan R et al (2012) High-resolution protein complexes from inte-
Three-dimensional structures of membrane grating genomic information with molecular
proteins from genomic sequencing. Cell simulation. Proc Natl Acad Sci U S A
149:1607–1621 106:22124–22129
8. Ovchinnikov S, Kamisetty H, Baker D (2014) 18. Tamir S, Rotem-Bamberger S, Katz C et al
Robust and accurate prediction of residue- (2014) Integrated strategy reveals the protein
residue interactions across protein interfaces interface between cancer targets Bcl-2 and
using evolutionary information. Elife 3: NAF-1. Proc Natl Acad Sci U S A
e02030 111:5177–5182
9. Kamisetty H, Ovchinnikov S, Baker D (2013) 19. dos Santos RN, Morcos F, Jana B et al (2015)
Assessing the utility of coevolution-based resi- Dimeric interactions and complex formation
due-residue contact predictions in a sequence- using direct coevolutionary couplings. Sci Rep
and structure-rich era. Proc Natl Acad Sci U S 5:13652
A 110:15674–15679 20. Morcos F, Schafer NP, Cheng RR et al (2014)
10. Skwark MJ, Abdel-Rehim A, Elofsson A Coevolutionary information, protein folding
(2013) PconsC: combination of direct infor- landscapes, and the thermodynamics of natural
mation methods and alignments improves con- selection. Proc Natl Acad Sci U S A
tact prediction. Bioinformatics 29:1815–1816 111:12408–12413
11. Ekeberg M, Lövkvist C, Lan Y et al (2013) 21. Mallik S, Kundu S (2015) Co-evolutionary
Improved contact prediction in proteins: constraints of globular proteins correlate with
using pseudolikelihoods to infer Potts models. their folding rates. FEBS Lett 589:2179–2185
Phys Rev E Stat Nonlinear Soft Matter Phys 22. Morcos F, Jana B, Hwa T, Onuchic JN (2013)
87:012707 Coevolutionary signals across protein lineages
102 Ricardo Nascimento dos Santos et al.
help capture multiple protein conformations. prediction server. Nucleic Acids Res 43:
Proc Natl Acad Sci U S A 110:20533–20538 W389–W394
23. Sfriso P, Duran-Frigola M, Mosca R et al 37. Yachdav G, Kloppmann E, Kajan L et al (2014)
(2016) Residues coevolution guides the sys- PredictProtein—an open resource for online
tematic identification of alternative functional prediction of protein structural and functional
conformations in proteins. Structure features. Nucleic Acids Res 42:W337–W343
24:116–126 38. Buchan DWA, Minneci F, Nugent TCO et al
24. Cheng RR, Morcos F, Levine H, Onuchic JN (2013) Scalable web services for the PSIPRED
(2014) Toward rationally redesigning bacterial Protein Analysis Workbench. Nucleic Acids Res
two-component signaling systems using coevo- 41:W349–W357
lutionary information. Proc Natl Acad Sci U S 39. Heffernan R, Paliwal K, Lyons J et al (2015)
A 111:E563–E571 Improving prediction of secondary structure,
25. Jana B, Morcos F, Onuchic JN (2014) From local backbone angles, and solvent accessible
structure to function: the convergence of struc- surface area of proteins by iterative deep
ture based models and co-evolutionary infor- learning. Sci Rep 5:11476
mation. Phys Chem Chem Phys 40. Pronk S, Páll S, Schulz R et al (2013) GRO-
16:6496–6507 MACS 4.5: a high-throughput and highly par-
26. Noel JK, Levi M, Raghunathan M et al (2016) allel open source molecular simulation toolkit.
SMOG 2: a versatile software package for gen- Bioinformatics 29:845–854
erating structure-based models. PLoS Comput 41. Kutzner C, Páll S, Fechner M et al (2015) Best
Biol 12:e1004794 bang for your buck: GPU nodes for GRO-
27. Noel JK, Whitford PC, Sanbonmatsu KY, MACS biomolecular simulations. J Comput
Onuchic JN (2010) SMOG@ctbp: simplified Chem 36:1990–2008
deployment of structure-based models in 42. Meyer EE (1997) The first years of the Protein
GROMACS. Nucleic Acids Res 38: Data Bank. Protein Sci 6:1591–1597
W657–W661 43. Young J, RCSB PDBj PDBe Protein Data Bank
28. UniProt Consortium (2015) UniProt: a hub (2009) Annotation and curation of the Protein
for protein information. Nucleic Acids Res 43: Data Bank. Nat Preced. https://doi.org/10.
D204–D212 1038/npre.2009.3379.1
29. Bateman A (2000) The Pfam protein families 44. Martı́nez L, Andreani R, Martı́nez JM (2007)
database. Nucleic Acids Res 28:263–266 Convergent algorithms for protein structural
30. Finn RD, Coggill P, Eberhardt RY et al (2016) alignment. BMC Bioinformatics 8:306
The Pfam protein families database: towards a 45. Li Y, Zhang Y (2009) REMO: a new protocol
more sustainable future. Nucleic Acids Res 44: to refine full atomic protein models from
D279–D285 C-alpha traces by optimizing hydrogen-
31. Göbel U, Sander C, Schneider R, Valencia A bonding networks. Proteins 76:665–676
(1994) Correlated mutations and residue con- 46. Maupetit J, Gautier R, Tufféry P (2006) SAB-
tacts in proteins. Proteins Struct Funct Genet BAC: online Structural Alphabet-based protein
18:309–317 BackBone reconstruction from Alpha-Carbon
32. Lammert H, Schug A, Onuchic JN (2009) trace. Nucleic Acids Res 34:W147–W151
Robustness and generalization of structure- 47. Rotkiewicz P, Skolnick J (2008) Fast procedure
based models for protein folding and function. for reconstruction of full-atom protein models
Proteins 77:881–891 from reduced representations. J Comput Chem
33. Onuchic JN, Luthey-Schulten Z, Wolynes PG 29:1460–1465
(1997) Theory of protein folding: the energy 48. Agre P (2006) The aquaporin water channels.
landscape perspective. Annu Rev Phys Chem Proc Am Thorac Soc 3:5–13
48:545–600 49. Ishibashi K, Sasaki S (1997) Aquaporin water
34. Pirovano W, Heringa J (2010) Protein second- channels in mammals. Clin Exp Nephrol
ary structure prediction. Methods Mol Biol 1:247–253
609:327–348 50. Agre P, Kozono D (2003) Aquaporin water
35. Yang Y, Gao J, Wang J et al (2018) Sixty-five channels: molecular mechanisms for human
years of the long march in protein secondary diseases1. FEBS Lett 555:72–78
structure prediction: the final stretch? Brief 51. Marks DS, Colwell LJ, Sheridan R et al (2011)
Bioinform 19:482–494. https://doi.org/10. Protein 3D structure computed from evolu-
1093/bib/bbw129 tionary sequence variation. PLoS One 6:
36. Drozdetskiy A, Cole C, Procter J, Barton GJ e28766
(2015) JPred4: a protein secondary structure
Coevolution and SBMs for Protein Structure Prediction 103
52. Ash RB (2012) Information theory. Courier molecular simulation techniques. Annu Rev
Corporation, Dover Publications Inc, Mineola, Phys Chem 58:57–83
NY 57. Ruiz Carrillo D, To Yiu Ying J, Darwis D et al
53. Freedman D, Pisani R, Purves R (2007) Statis- (2014) Crystallization and preliminary crystal-
tics: fourth international student edition. lographic analysis of human aquaporin 1 at a
W. W. Norton & Company, New York, NY resolution of 3.28 Å. Acta Crystallogr F Struct
54. Rapaport DC (2004) The art of molecular Biol Commun 70:1657–1663
dynamics simulation. Cambridge University 58. Subbiah S (1996) Protein motions. Springer,
Press, New York, NY Berlin
55. Karplus M, Kuriyan J (2005) Molecular 59. Zhang Y, Skolnick J (2004) Scoring function
dynamics and protein function. Proc Natl for automated assessment of protein structure
Acad Sci U S A 102:6679–6685 template quality. Proteins 57:702–710
56. Scheraga HA, Khalili M, Liwo A (2007)
Protein-folding dynamics: overview of
Chapter 6
Abstract
The comparative study of homologous proteins can provide abundant information about the functional and
structural constraints on protein evolution. For example, an amino acid substitution that is deleterious may
become permissive in the presence of another substitution at a second site of the protein. A popular
approach for detecting coevolving residues is by looking for correlated substitution events on branches of
the molecular phylogeny relating the protein-coding sequences. Here we describe a machine learning
method (Bayesian graphical models) implemented in the open-source phylogenetic software package
HyPhy, http://hyphy.org, for extracting a network of coevolving residues from a sequence alignment.
Key words amino acid coevolution, Bayesian graphical model, hepatitis C virus, HyPhy, epistasis
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_6, © Springer Science+Business Media, LLC, part of Springer Nature 2019
105
106 Mariano Avino and Art F. Y. Poon
1.1 Correlated The simplest approach to detect interactions between amino acids
Substitutions from the comparative study of protein sequences is to look for
correlations between different positions in the protein with respect
to the biochemical properties of residues [13], empirical substitu-
tion rates [7, 14] or the occurrence of specific amino acids
[15]. Correlations are often measured using mutual information
[16–18] or extensions of this approach that incorporate other
information [19, 20]. One of the major confounding factors affect-
ing the comparative study of protein sequences is that the amino
acids in different sequences are not independent observations;
instead, they are copies that descend from a common ancestor
such that shared genotypes may be due to identity by descent. In
the worst case scenario, a comparative method may predict a false
interaction between residues at different positions of a protein
because of two ancestral substitution events that have been propa-
gated to the observed descendants with no further evolution
[21]. This confounding due to common ancestry has been inde-
pendently recognized across fields and a large number of techni-
ques have been proposed to resolve it, e.g., [22–25]. One common
approach is to change the focus from the observed characteristics of
residues at different sites to the amino acid substitutions that have
accumulated in the evolutionary history of the protein sequences
[22]. Hence, we are looking for substitutions at different sites to
occur on the same branch of a phylogenetic tree, which implies
causality by proximity in time. This transformation of the data can
Detecting Amino Acid Coevolution with Bayesian Graphical Models 107
2 Program Usage
2.1 Obtaining HyPhy HyPhy binaries can be downloaded for free at http://hyphy.org. If you
and Scripts want to compile the software package, the source code can be
obtained at http://github.com/veg/hyphy. Alternatively, a POSIX-
threaded HyPhy binary (hyphymp) for Linux can be obtained with a
package manager; for example, in Ubuntu: sudo apt install
hyphy-pt. The scripts and data used here are available at http://
github.com/PoonLab/comet-prot. If you are running HyPhy from
the command line, then all commands should specify the path to your
local installation, e.g.: HYPHYMPBASEPATH¼/usr/local/lib/
hyphy <path to script>.
1
The scripts in this chapter were tested with HyPhy version 2.220170201beta and release 2.2.7. HyPhy is a large
and complex software package that is constantly undergoing development by a small team of researchers and
programmers, and some of the more specialized features such as BGMs may temporarily break as newer versions
are released. If you compiled HyPhy from source, make sure that you are using a single-threaded (HYPHYSP) or
multiprocessing-enabled (HYPHYMP) build and not a message passing interface (MPI)-enabled (HYPHYMPI)
build; at the time of writing, there were residual issues in the source code related to MPI processing. If you
encounter any other problems, please submit an issue at https://github.com/veg/hyphy/issues and we will
attend to it as soon as possible.
Detecting Amino Acid Coevolution with Bayesian Graphical Models 109
Otherwise, you can run the scripts through the graphical user
interface by opening the file through the file selection dialog ( -O
on macOS, Ctrl-E on Windows, or File > Open > Open Batch
File. . .).
2.2 Preparing To run this analysis, you need to have a codon sequence alignment
Input Data and a phylogenetic tree relating these sequences. A codon sequence
has a single reading frame, excluding any frameshifts or stop
codons. In other words, the first three bases should map to a
codon, and so on. It does not have to cover the entire gene. Any
stop codons need to be replaced with gaps (interpreted as missing
data); otherwise, the entire codon site will be stripped from the
alignment, throwing out useful data and making it difficult to inter-
pret the end result of the analysis. HyPhy also has strict requirements
on sequence names, which cannot contain any characters other than
the alphanumeric characters and the underscore character “_”. This
name restriction also applies to tip labels in the tree, so it is often
more convenient to reconstruct a tree after the following step.
A convenient tool for adjusting sequence names and simulta-
neously replacing stop codons with gap characters is provided in the
HyPhy standard library. In the GUI, you can open the Standard
Analysis menu by pressing -E (macOS), Ctrl-E (Windows), or
selecting (Analysis > Standard Analyses. . .), expanding the “Data
File Tools” tab, and then selecting CleanStopCodons.bf. From
the command line, you can launch an interactive menu by calling
the HyPhy executable (HYPHYMP or hyphymp if you used a package
manager) and then select the options (4) Data File Tools and
then (6) to run the same script. You will be prompted to specify a
genetic code and codon data file (see below). The last query is
whether to discard duplicate sequences and/or codon sites that are
entirely gaps. Duplicate sequences cannot be separated in a phylog-
eny, so unless you will be using a tree relating these sequences based
on additional information, there is no reason to retain all copies for
the analysis. Similarly, entirely gapped codon positions are not phy-
logenetically informative and may be dropped unless they are
needed to preserve the coordinate system of the alignment.
A phylogenetic tree can be reconstructed from the sequence
alignment using any standard maximum likelihood program such as
RAxML [35] (https://github.com/stamatak/standard-RAxML)
or PhyML [36] (https://github.com/stephaneguindon/phyml).2
2
For this type of analysis, we prefer using maximum likelihood (ML) methods to reconstruct trees. If it is not
feasible to use ML methods due to excessive numbers of sequence and/or sequence lengths, we suggest using the
approximate ML program FastTree 2 [37], which can be orders of magnitude faster than the standard ML
programs. Neighbor-joining (NJ) methods also scale favorably with larger alignments, but tend to be less accurate
for reconstructing branch lengths. While there are NJ and ML tree reconstruction methods implemented in HyPhy,
they are not as efficient as these specialized programs and we do not recommend using them for larger data sets.
110 Mariano Avino and Art F. Y. Poon
2.3 Fit Codon Model The first step in our analysis pipeline is to fit a codon substitution model
[39] to the sequence alignment by running the script fit_codon_-
model.bf (which depends on the utility file fit_codon_model.
4
ibf). Although there are standard methods for this task in the default
HyPhy menu, we implemented a customized method that constrains
the branch lengths in the input tree to be rescaled by a global factor.5
This confers a significant savings in computing time, since we don’t
need to re-estimate the length of every branch in the tree.
2.3.1 Choose In most cases, we will select option 1, the universal genetic code.
a Genetic Code However, there is a large selection of genetic codes available in
HyPhy, and selecting an appropriate code is important for this
analysis because it will determine how nucleotide substitutions are
interpreted as missense, nonsense, or silent mutations.
2.3.2 Specify a Codon Enter a relative or absolute path6 to the file containing the cleaned
Data File sequence alignment, or if using the GUI, use the filesystem dialog
to navigate to the file. Again, we assume that this alignment com-
prises codon sequences with a consistent reading frame. The pres-
ence of frameshifts due to alignment errors or actual sequence
insertions/deletions will prevent HyPhy from correctly reconstruct-
ing non-synonymous and synonymous substitutions.
2.3.3 Model Options This option determines how the model parameters are distributed
across branches in the tree. The “Local” option assigns an instance
of each parameter to every branch in the tree. For example, if we are
fitting a model with a transition/transversion bias parameter, then
this bias will be estimated independently for every branch. While
this results in a more flexible model, there is a greater danger of
3
A bootstrap support value is an empirical measure of confidence in a specific clade given the data. Most
phylogeny reconstruction programs should have an option to omit these values. If you already have a Newick
tree file and you just need to remove the support values, you can use the following UNIX command: sed -E s/
)[0-9.]+:/):/g [input] > [output].
4
From this point onward, we assume that you are using the command-line interface. Unfortunately, this script
may not work properly with the GUI because of how HyPhy handles file paths. Even on the command line, this is
not straight-forward. For example, we used the following invocation in the macOS Terminal: HYPHYMP
BASEPATH¼/usr/local/lib/hyphy/ pwd/fit_codon_model.bf If you want to take
advantage of a multi-core CPU, you can add the argument CPU¼[number of cores] immediately after
HYPHYMP. Note that not all steps in this analysis are able to utilize multiple threads.
5
If you want to examine this scaling factor, you can find it in the serialized likelihood function generated by this
script by searching for the parameter name scalingB.
6
If you’re using an operating system with a desktop environment, it’s often easier to drag the icon representing
your file into the terminal window instead of typing out the corresponding path. This works when running HyPhy
on the command line, but you need to use backspace to remove the space that is automatically appended to end of
the path. HyPhy won’t be able to locate the file otherwise.
Detecting Amino Acid Coevolution with Bayesian Graphical Models 111
2.3.4 Nucleotide Model The codon substitution model implemented in these scripts has a
nested model of nucleotide substitution8 that needs to be specified
by the user. This step uses the 6-digit PAUP*-style model specifi-
cation string [42], which defines equality constraints for the six
symmetric substitution rates in alphabetical order: A $ C, A $ G,
A $ T, C $ G, C $ T, and G $ T. For example, the Tamura-Nei
model [43] is specified by the string 010020—all the nucleotide
transversions share a single rate identifier (0). The most appropriate
nucleotide model can be determined using a model selection
method such as ModelTest [44].
2.3.5 Specify a Tree File At the prompt, enter a relative or absolute path to the file contain-
ing the reconstructed phylogeny in a Newick tree string format.
The tip labels in this tree need to correspond one-to-one with the
sequence labels in your alignment file.9
2.3.6 Fit a Likelihood Finally, you are prompted to specify a relative or absolute path to a
Function file to write a serialized likelihood function, which encodes the data,
model, and parameter estimates.10 After providing a file path, the
analysis will run and eventually converge to the maximum likeli-
hood estimates of the model parameters. It is usually a good idea to
open the likelihood function output in a text editor and inspect the
7
Prior to version 2.3.4, the text in HyPhy implies that these options allow rates to vary among branches, not sites:
“. . .branch lengths come from a user-chosen distribution.” We have revised this help text as of version 2.3.4 to
indicate that the distributions are used to model rate variation across sites, not branches.
8
A standard codon model is described by a 61-by-61 transition rate matrix and a single parameter R that
corresponds to the ratio of non-synonymous and synonymous substitution rates. The model assumes that the
system moves from one codon to another by single nucleotide substitutions; codon substitutions that require
more than one nucleotide change are not allowed.
9
Some phylogeny reconstruction programs truncate sequence labels and cause an error at this stage—for
example, neither RAxML or FastTree2 will read sequence labels beyond a whitespace character. A quick fix in
this situation is to replace all whitespace characters with underscores in a text editor or with sed.
10
By convention, we use the file extension .lf and keep the same basename as the codon data file. This makes it
easier to track files that belong to the same workflow.
112 Mariano Avino and Art F. Y. Poon
2.4 Map The next step in our pipeline is to reconstruct ancestral sequences in
Substitutions the tree based on the maximum likelihood parameter estimates of
to the Tree the model [46]. If the descendant sequence has a different codon
than its ancestor, then we infer that at least one substitution has
occurred along the intervening branch in the tree [47]. This step is
implemented by the script MapMutationsToTree.bf. Upon run-
ning the script, the user is prompted to provide a relative or abso-
lute path to the file containing the serialized likelihood function
from the previous step.
2.4.1 Select HyPhy implements the fast joint ancestral reconstruction algorithm
Reconstruction Option formulated by Tal Pupko and colleagues [48]. Our script prompts
the user to decide whether to sample ancestral sequences from the
posterior distributions at each node of the tree. Sampling enables us
to accommodate the uncertainty in reconstructing ancestral states,
which is exacerbated for ancestral nodes that are further back in
time relative to the observed sequences. On the other hand, each
sample will comprise a set of ancestral sequences that compounds
the number of replicate analyses to be performed further along the
pipeline. We recommend using your discretion for this step: if it is
likely that the most recent common ancestor is separated from all
the observed sequences by excessive amounts of evolutionary time,
then it may be important to sample ancestral states for a more
robust but time-consuming analysis.
2.4.2 Output Options This script was designed to generate two different kinds of outputs.
The first option is to generate a binary matrix where each row
corresponds to a branch of the tree, and each column corresponds
to a codon site in the alignment. This matrix is written to the
output file in a comma-separated tabular (CSV) format. A 1 indi-
cates that a non-synonymous substitution was mapped to the
respective branch and site. This matrix output is the raw material
for a BGM analysis, where each codon site is a potential node in the
graph and the branches represent independent observations. The
second option is to output a tab-delimited tabular file where each
11
NEXUS is a widespread format with known issues with standardization and usability, and has been implemen-
ted in diverse and often incompatible ways by multiple programs.
Detecting Amino Acid Coevolution with Bayesian Graphical Models 113
2.5 BGM Analysis A Bayesian graphical model (BGM) analysis is implemented in the
script bayesgraph.bf, which depends on the utility (“include”)
file bayesgraph.ibf. These scripts were designed to emulate the
workflow provided by the Spidermonkey application on our
datamonkey.org webserver. We note here, however, that BGM ana-
lyses in HyPhy are more versatile than our example demonstrates.14
2.5.1 Input Data Matrix First, the user is prompted for a relative or absolute path to the CSV
file containing the substitution map matrix that was produced by
the MapMutationsToTree.bf script. If the CSV does not contain
a header row with column labels that indicate what each variable
represents, then they will be assigned integer values. It is preferable
to use the ancestral residue labels generated by the MapMutation-
sToTree.bf script because we are going to filter out columns
based on the number of inferred non-synonymous substitutions.
2.5.2 Filter Sites Next, the program will prompt you for the minimum number of
substitutions for a site to be included in the BGM model. This
cutoff cannot be less than 1 because sites without any
non-synonymous substitutions contain no information for infer-
ring conditional dependencies. The script automatically determines
the maximum cutoff based on the largest number of substitutions
mapped at any single codon site. Once the user selects a number in
this range, the script will filter sites that do not meet this cutoff
from the data set and populate a BGM model with the remaining
variables.15
12
We have previously found this list output to be a more convenient format for debugging the script. It’s usually a
good idea to manually compare entries in this list against your sequence alignment to make sure that things make
sense.
13
Most phylogenetic tree reconstruction methods, such as maximum likelihood or neighbor-joining, will output
an unrooted tree. For an unrooted tree, the labels will be generated for the deepest internal node.
14
For example, you can customize on a node-by-node basis the number of “parental” nodes on which a given
node can be conditionally dependent. You can also load a serialized BGM from a XML Bayesian Interchange
format file and use this model to simulate additional data sets. For more details, please refer to the file bayes-
graph.ibf and the batch file tests/hbltests/BayesianGraphicalModels/TestBGM.
bf in the HyPhy source code distribution.
15
As a general rule of thumb, we try to not build a BGM model that has many more nodes than observations. The
number of substitutions provides a meaningful criterion for reducing the dimensionality of our data.
114 Mariano Avino and Art F. Y. Poon
2.5.3 MCMC Settings There are four settings that the user needs to specify for running a
Markov chain Monte Carlo (MCMC) sample. First, the user has to
specify the maximum number of parents that will be allowed per
node. This determines the complexity of the analysis. An analysis
with a one-parent maximum per node will run very fast and scales
easily with large numbers of variables, but loses the sensitivity to
detect complex interactions among nodes. Conversely, an analysis
that allows many more parents per node is far more computation-
ally complex.16 Second, the user needs to indicate the number of
steps to discard as a “burn-in” period. This budgets an amount of
time that one estimates it will require for this random walk to travel
from its initial point to a “reasonable” area of model space. Third,
the user needs to specify the number of steps to run the chain
sample following the “burn-in” period. This length sets an upper
limit to the effective sample size, which will almost surely be much
smaller because of the highly autocorrelated nature of MCMC.
Lastly, the user must specify the number of steps to extract from
this chain sample. Because of autocorrelation in the chain, there is
usually no benefit in retaining every step. To reduce the output file
sizes and increase the efficiency of post-processing, it is standard
practice to reduce the chain by sub-sampling at regular intervals.
The script defaults to a sub-sample of 100 steps, which results in
gaps of 1000 steps (see Fig. 1). The user should adjust the size of
the sub-sample roughly in proportion to the length of the post-
burn-in chain sample.
2.5.4 Output Settings The bayesgraph.bf script generally produces three kinds of out-
puts. The script will prompt the user for only one relative or
absolute path for an output file, and paths for the other output
files will automatically be generated based on this first path. First,
the script will output the marginal posterior probabilities for
directed edges as a CSV formatted file. This is the raw material for
assembling the consensus BGM. Next, the script will write this
consensus BGM using the network visualization language DOT,
which can be converted into an image by several programs such as
GraphViz [49], Cytoscape [50], and Gephi [51]. Finally, the script
will record the posterior probability trace for all steps sub-sampled
from the original MCMC sample. This is important information for
assessing the convergence of the chain sample to the posterior
distribution (e.g., Fig. 1).
16
This is where the ability to customize the analysis implemented in the bayesgraph.bf script can be very
useful. If you have prior information that a subset of codon sites are involved in a large number of interactions, the
computational complexity of increasing the number of parents can be greatly reduced by modifying this parameter
for only these sites.
Detecting Amino Acid Coevolution with Bayesian Graphical Models 115
1.0
11950
Posterior probability
0.8
11960
0.6
11970
Autocorrelation
0.4
6000 6500 7000 7500 8000
MCMC step
0.2
0.0
0.2
3 Example
18
(In an MCMC run, we observe autocorrelation when we sample parameter values that are very close in the
parameter space and unrepresentative of the true underlying posterior distribution. Therefore, we try to decrease
autocorrelation so that the MCMC sample provides a more precise estimate of the posterior sample. One way to
accomplish this is by down-sampling to every n-th step).
116 Mariano Avino and Art F. Y. Poon
19
We have provided most of the data files in this example on our GitHub repository at https://github.com/
PoonLab/comet-prot/tree/master/data.
Detecting Amino Acid Coevolution with Bayesian Graphical Models 117
Fig. 2 Excerpt of phylogenetic tree reconstructed from the HCV sequence data using PhyML. The tree was
rooted at the midpoint (the halfway point on the longest path separating two tips in the tree). This image was
generated using the R package ggtree [63]. Branches are colored with respect to the number of
non-synonymous substitutions (increasing from blue to red). The shape of the tree is generally consistent
with the sequences belonging to a single HCV subtype. However, the tree also contains some clusters (inset) of
highly related sequences that may represent multiple sequences from the same individuals, or recent
transmission outbreaks of HCV
20
To generate an amino acid sequence from the column labels, we used the regular expression “[0-9]+,*” to
replace all instances with an empty string. In Python, this can be achieved with the re module: seq ¼ re.sub
([0-9]+,*, , header.strip()), where header is a string variable containing the first line of the CSV file.
21
This can be accomplished with the following R commands:
require(coda)
chain1 <- read.csv("chain1.trace.csv", header¼F)
chain2 <- read.csv("chain2.trace.csv", header¼F)
chains <- mcmc.list(mcmc(chain1$V1), mcmc(chain2$V1))
gelman.diag(chains, autoburnin¼F)
Fig. 3 Visualization of a consensus Bayesian graphical model of residue–residue interactions in HCV1b NS5b
proteins. Each node represents a codon site in the NS5b protein, labelled with the ancestral residue and
position. The size of the node is scaled to the log-transformed number of non-synonymous substitutions
reconstructed at the respective codon sites. Nodes are colored with respect to the NS5b domains: fingers
(red), palm (green), and thumb (blue); nodes representing residues in the C-terminal tail are left uncolored.
Arrows (edges) are drawn to represent inferred coevolution between the respective codon positions. The
edges are annotated with the marginal posterior probabilities (MPP, %). Only edges with an MPP value
exceeding 90% were included in this graph
Acknowledgements
References
1. Kihara D (2005) The effect of long-range 12. Ivankov DN, Finkelstein AV, Kondrashov FA
interactions on the secondary structure forma- (2014) A structural perspective of compensa-
tion of proteins. Protein Sci 14(8):1955–1963 tory evolution. Curr Opin Struct Biol
2. Sprinzak E, Margalit H (2001) Correlated 26:104–112
sequence-signatures as markers of protein- 13. Neher E (1994) How frequent are correlated
protein interaction. J Mol Biol 311 changes in families of protein sequences? Proc
(4):681–692 Natl Acad Sci 91(1):98–102
3. Horner DS, Pirovano W, Pesole G (2007) Cor- 14. Olmea O, Rost B, Valencia A (1999) Effective
related substitution analysis and the prediction use of sequence correlation and conservation in
of amino acid structural contacts. Brief Bioin- fold recognition. J Mol Biol 293
form 9(1):46–56 (5):1221–1239
4. Taylor WR, Hamilton RS, Sadowski MI (2013) 15. Atchley WR, Wollenberg KR, Fitch WM,
Prediction of contacts from correlated Terhalle W, Dress AW (2000) Correlations
sequence substitutions. Curr Opin Struct Biol among amino acid sites in bHLH protein
23(3):473–479 domains: an information theoretic analysis.
5. Marks DS, Hopf TA, Sander C (2012) Protein Mol Biol Evol 17(1):164–178
structure prediction from sequence variation. 16. Tillier ER, Lui TW (2003) Using multiple
Nat Biotechnol 30(11):1072–1080 interdependency to separate functional from
6. De Juan D, Pazos F, Valencia A (2013) phylogenetic correlations in protein align-
Emerging methods in protein co-evolution. ments. Bioinformatics 19(6):750–755
Nat Rev Genet 14(4):249 17. Martin L, Gloor GB, Dunn S, Wahl LM (2005)
7. Göbel U, Sander C, Schneider R, Valencia A Using information theory to search for
(1994) Correlated mutations and residue con- co-evolving residues in proteins. Bioinformat-
tacts in proteins. Proteins Struct Funct Bioinf ics 21(22):4116–4124
18(4):309–317 18. Gouveia-Oliveira R, Pedersen AG (2007)
8. Korber B, Farber RM, Wolpert DH, Lapedes Finding coevolving amino acid residues using
AS (1993) Covariation of mutations in the V3 row and column weighting of mutual informa-
loop of human immunodeficiency virus type tion and multi-dimensional amino acid repre-
1 envelope protein: an information theoretic sentation. Algorithms Mol Biol 2(1):12
analysis. Proc Natl Acad Sci 90 19. Fernandes AD, Gloor GB (2010) Mutual
(15):7176–7180 information is critically dependent on prior
9. Hirschhorn JN, Lohmueller K, Byrne E, assumptions: would the correct estimate of
Hirschhorn K (2002) A comprehensive review mutual information please identify itself? Bio-
of genetic association studies. Genet Med 4 informatics 26(9):1135–1139
(2):45–61 20. Jeong CS, Kim D (2012) Reliable and robust
10. Kowarsch A, Fuchs A, Frishman D, Pagel P detection of coevolving protein residues. Pro-
(2010) Correlated mutations: a hallmark of tein Eng Des Sel 25(11):705–713
phenotypic amino acid substitutions. PLoS 21. Felsenstein J (1985) Phylogenies and the com-
Comput Biol 6(9):e1000923 parative method. Am Nat 125(1):1–15
11. Weinreich DM, Delaney NF, DePristo MA, 22. Shindyalov IN, Kolchanov NA, Sander C
Hartl DL (2006) Darwinian evolution can fol- (1994) Can three-dimensional contacts in pro-
low only very few mutational paths to fitter tein structures be predicted by analysis of cor-
proteins. Science 312(5770):111–114 related mutations? Protein Eng 7(3):349–358
Detecting Amino Acid Coevolution with Bayesian Graphical Models 121
Abstract
Defining the extent of epistasis—the nonindependence of the effects of mutations—is essential for under-
standing the relationship of genotype, phenotype, and fitness in biological systems. The applications cover
many areas of biological research, including biochemistry, genomics, protein and systems engineering,
medicine, and evolutionary biology. However, the quantitative definitions of epistasis vary among fields,
and the analysis beyond just pairwise effects can be problematic. Here, we demonstrate the application of a
particular mathematical formalism, the weighted Walsh-Hadamard transform, which unifies a number of
different definitions of epistasis. We provide a computational implementation of such analysis using a
computer-generated higher-order mutational dataset. We discuss general considerations regarding the
null hypothesis for independent mutational effects, which then allows a quantitative identification of
epistasis in an experimental dataset.
Key words Epistasis, Higher-order epistasis, Context-dependent mutations, Amino acid interactions,
Evolutionary biology, Fitness, Combinatorial mutagenesis
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_7, © Springer Science+Business Media, LLC, part of Springer Nature 2019
123
124 Frank J. Poelwijk
1.1 Walsh-Hadamard In this protocol we will calculate the epistasis present in a complete
Transform combinatorial mutant dataset, i.e., a set that contains all mutants
and Epistasis that can be made by recombination of two parental protein
sequences that differ at N positions. More precisely, we start with
a vector y containing the phenotypic measurements for all 2N
combinations of mutations that can be generated at N positions,
where each position has two states. Calculating epistasis in such a
dataset consists of a linear mapping of vector of 2N phenotypes y
Context-Dependent Mutation Effects in Proteins 125
ω ¼ Ω y ð1Þ
ω ¼ VH y ð2Þ
1.2 Overall Most experimental assays will exhibit overall nonlinearities, mean-
Nonlinearities ing that our observation of a quality of interest x is “distorted”
and the Null according to some nonlinear function f ðxÞ (Fig. 1a, b). This can be
Hypothesis specific to the measurement, for example, a limited linear range of
fluorescence detection in a flow cytometer or a limited linear con-
centration range in a binding assay due to non-specific binding.
Additionally, it can be inherent to the biological system, for exam-
ple, a saturating dependence of protein expression on an activator’s
binding affinity. If such nonlinearities are not taken into account,
the empirical dataset may appear more epistatic than it actually is
(Fig. 1b). To meaningfully quantify epistasis, an explicit null
hypothesis needs to be expressed, defining what it means for muta-
tions to act independently. Note that the null model also addresses
the question of whether mutational independence implies additiv-
ity or multiplicativity of effects: in fact, additivity in a quantity of
interest in φ can appear in the dataset as multiplicativity if the assay
measures a quantity with φ in the exponent, for example, when we
measure equilibrium dissociation constants but are interested in
epistasis with respect to the binding free energy. If we have suffi-
cient knowledge about the system, we can directly choose the
applicable nonlinear scaling (in the case of dissociation constants,
this would be their logarithm) and define independence as additiv-
ity. In general, especially when the system is more complex, we can
remove (part of) the overall nonlinearities using a linear-nonlinear
optimization (see [16, 17] for similar approaches). Here, the vector
y containing the observables is transformed using a nonlinear
function g ðyÞ that only has a small number of free parameters,
after which we attempt to optimize those parameters by maximiz-
ing the variance captured with first-order or low-order epistatic
Context-Dependent Mutation Effects in Proteins 127
a
latent variables
b
f (x)
fa+b xa+b = xa + xb
fb no epistasis in x
fa
fa+b π fa + fb
epistasis in f
0 xa xb xa+b x
Fig. 1 Latent variables and observables. (a) Here the biological system of interest
is represented by variable x that can be decomposed into its epistatic compo-
nents ω. However, x is latent, and we can only observe its effects after some
nonlinear transformation f ðxÞ has occurred, that may indicate saturation in the
experimental assay, or an instrument function. Experimental noise is modeled
through a random variable η, so that the observed phenotypic data are given by
y ¼ f ðxÞ þ η. (b) Calculating epistasis requires an explicit null model. Trivial
nonlinearities in the transfer function f ðxÞ can result in apparent epistasis in the
observable y without epistasis in the underlying latent variable x
Here, from right to left after the equal sign, first, all epistatic
coefficients are calculated by multiplication with VH. Then this
vector is multiplied by matrix S1, the identity matrix with entries
S1, ii ¼ 0 at positions that do not pertain to first-order (linear)
terms, so that first-order epistatic terms are kept intact, but every-
thing else is set to zero. Lastly, the inverse transformation H1V1
reconstructs the data using only the information contained in the
first-order terms. The linear-nonlinear optimization procedure
now consists of finding the values for the parameters in function
g that minimize the quantity
varðg ðyÞ g 1 ðyÞÞ
h¼ ð6Þ
varðg ðyÞÞ
128 Frank J. Poelwijk
which is the sum of squares of the residuals divided by the total sum
of squares. For this approach to be successful, there are a number of
requirements for the form of the nonlinear function, which will be
discussed in the protocol steps and the Notes.
1.3 Error After finding the nonlinear transformation that optimally removes
Propagation the overall nonlinearities, epistatic coefficients can be determined
and Significant Terms according to
ω ¼ VH g ðyÞ ð7Þ
Since this is a one-to-one mapping from measurements to
epistasis, the full set of 2N epistatic coefficients in ω also captures
any measurement noise that is present in the transformed dataset
g ðyÞ. As the error for epistatic terms propagates exponentially, with
a factor two for each increasing order [5], the effects of noise are
more pronounced for higher-order terms. To prevent overfitting,
this protocol illustrates a self-consistent approach that determines
the noise contributions and establishes a significance threshold for
epistatic coefficients of each order. We will show at the end of the
protocol that these significant terms allow reconstruction of our
original computer-generated model data at high accuracy and with
little modeled measurement noise. Not surprisingly, when mea-
surement noise is too large, this approach will break down.
2 Materials
3 Methods
3.1 Generate 1. Initialize the parameters and matrices used for the calculation
a Combinatorial of the epistasis operator (see Note 1) and its inverse. We use
Mutant Dataset auxiliary variables A and B to indicate which positions are
involved in an epistatic term and what the order of that term is.
2. Generate a vector ω of length 2N containing the epistatic
contributions. The entries are generated randomly according
to a model of preferential attachment, where higher-order
terms are more likely to be non-zero if they involve positions
Context-Dependent Mutation Effects in Proteins 129
c logð1 þ xÞ
f ðxÞ ¼
1 þ logð1 þ xÞ
3.2 Removing 1. Decide on the general form of the nonlinear scaling g ðyÞ to be
the Overall tested. Ideally, g is the inverse of f, and we have g ðyÞ x,
but in
Nonlinearities general, f is not known accurately. The particular functional
form can be chosen based on knowledge of the instrument
transfer function or the experimental assay. If we do not have
such knowledge about the system, monotonically increasing or
decreasing test functions can be tried that have properties
consistent with the expected nonlinearities, e.g., saturation
for high values of the phenotypic data x (see Note 6). Here,
for simplicity, we assume we know the transfer function f but
up to some constant c, which we will try to find using the
linear-nonlinear optimization. The chosen test function is
therefore the inverse of f:
y
g ðc; yÞ ¼ e cy 1
2. Initialize the parameters for the linear-nonlinear optimization.
If the transfer function for which we chose the test function has
pronounced saturating behavior, a small amount of
130 Frank J. Poelwijk
a 0.6 b 1
0.8
0.5
0.6
0.4 R2 = 0.94366
0.4
0.2
0.3 0
-0.2
0.2
-0.4
-0.6
0.1
-0.8
0 -1
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
c d
2.5 0.5
0.4
2 0.3
R 2 = 0.80931
0.2
1.5 R2 = 0.97473
0.1
0
1
-0.1
0.5 -0.2
-0.3
0 -0.4
0 0.5 1 1.5 2 2.5 3 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Fig. 2 Example data generated in the protocol. (a) Observable mutant phenotypes y versus latent variable x. A
clear nonlinear relation is present. (b) After removing overall nonlinearities, data is reconstructed using the
obtained significant epistatic terms. A linear relation with a high correlation can be observed, provided that the
measurement noise is not too large. (c) Reconstructed data from all epistatic terms (i.e., the data directly after
applying the nonlinear scaling). Large noise contributions are present for data points at larger values of x.
Comparing this to panel b shows the noise suppression that can be achieved when only significant epistatic
terms are used for reconstruction. (d) Comparison of the values of the significant epistatic terms and their
counterparts in the latent data
4. Find the value c∗ that minimizes h(c). With this value, we can
perform the nonlinear transformation of the observables
y ! g ðc ∗ ; yÞ.
4 Notes
Acknowledgments
References
1. Bateson W (1907) Facts limiting the theory of by maximum likelihood. Mol Biol Evol
heredity. Science 26:649–660 21:468–488
2. Fisher RA (1918) The correlation between 11. Beer T (1981) Walsh transforms. Am J Phys
relatives on the supposition of Mendelian 49:466–472
inheritance. Trans Roy Soc Edinb 52:399–433 12. Stoffer DS (1991) Walsh-Fourier analysis and
3. Phillips PC (1998) The language of gene inter- its statistical applications. J Am Stat Assoc
action. Genetics 149:1167–1171 86:461–479
4. Phillips PC (2008) Epistasis—the essential role 13. Weinberger E (1991) Fourier and Taylor series
of gene interactions in the structure and evolu- on fitness landscapes. Biol Cybernetics
tion of genetic systems. Nat Rev Genet 65:321–330
9:855–867 14. Stadler PF (2002) Spectral landscape theory.
5. Poelwijk FJ, Krishna V, Ranganathan R (2016) In: Crutchfield JP, Schuster P (eds) Evolution-
The context-dependence of mutations: a link- ary dynamics—exploring the interface of selec-
age of formalisms. PLoS Comput Biol 12: tion, accident, and function. Oxford University
e1004771 Press, Oxford, pp 231–272
6. Weinreich DM, Watson RA, Chao L (2005) 15. Poelwijk FJ, Socolich M, Ranganathan R
Perspective: sign epistasis and genetic con- (2017) High-order epistasis linking genotype
straint on evolutionary trajectories. Evolution and phenotype in a protein. Submitted
59:1165–1174 16. Otwinowski J, Nemenman I (2013) Genotype
7. Weinreich DM, Delaney NF, Depristo MA, to phenotype mapping and the fitness land-
Hartl DL (2006) Darwinian evolution can fol- scape of the E. coli lac promoter. PLoS One 8:
low only very few mutational paths to fitter e61570
proteins. Science 312:111–114 17. Sailer ZR, Harms MJ (2017) Detecting high-
8. Poelwijk FJ, Kiviet DJ, Weinreich DM, Tans SJ order epistasis in nonlinear genotype-
(2007) Empirical fitness landscapes reveal phenotype maps. Genetics 205:1079–1088
accessible evolutionary paths. Nature 18. Theil H (1950) A rank-invariant method of
445:383–386 linear and polynomial regression analysis. I, II,
9. Poelwijk FJ, T̃nase-Nicola S, Kiviet DJ, Tans SJ III, Nederl Akad Wetensch Proc 53: 386–392,
(2011) Reciprocal sign epistasis is a necessary 521–525, 1397–1412
condition for multi-peaked fitness landscapes. J 19. Sen PK (1968) Estimates of the regression
Theor Biol 272:141–144 coefficient based on Kendall’s tau. J Am Stat
10. Siepel A, Haussler D (2004) Phylogenetic esti- Assoc 63:1379–1389
mation of context-dependent substitution rates
Chapter 8
Abstract
Ancestral protein sequence reconstruction is a powerful technique for explicitly testing hypotheses about
the evolution of molecular function, allowing researchers to meticulously dissect how historical changes in
protein sequence impacted functional repertoire by altering the protein’s 3D structure. These techniques
have provided concrete, experimentally validated insights into ancient evolutionary processes and help
illuminate the complex relationship between protein sequence, structure, and function. Inferring the
protein family phylogenies on which ancestral sequence reconstruction depends and reconstructing the
sequences, themselves, are amenable to high-throughput computational analysis. However, determining
the structures of ancestral-reconstructed proteins and characterizing their functions typically rely on time-
consuming and expensive laboratory analyses, limiting most current studies to examining a relatively small
number of specific hypotheses. For this reason, we have little detailed, unbiased information about how
molecular function evolves across large protein family phylogenies. Here we describe a generalized protocol
that integrates ancestral sequence reconstruction with structural homology modeling and structure-based
molecular affinity prediction to characterize historical changes in protein function across families with
thousands of individual sequences. We highlight key steps in the analysis protocol requiring particularly
careful attention to avoid introducing potential errors as well as steps for which computationally efficient
subroutines can be substituted for more intensive approaches, allowing researchers to scale the analysis up
or down, depending on available resources and requirements for reproducibility and scientific rigor. In our
view, this approach provides a compelling compliment to more laboratory-intensive procedures, generating
important contextual information that can help guide detailed experiments.
Key words Ancestral sequence reconstruction, Structural modeling, Protein function prediction,
Affinity prediction, Protein evolution, Molecular evolution
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_8, © Springer Science+Business Media, LLC, part of Springer Nature 2019
135
136 Kelsey Aadland et al.
2 Methods
2.1 Sequence The first step in any ancestral sequence reconstruction (ASR) study
Collection and Protein is the collection and curation of protein sequences from the family
Family Curation of interest. In almost every case, the root of the protein family
under study is of interest, requiring the collection and curation of
“outgroup” sequences. The goal of this step is to collect all avail-
able members of the protein family under study—including appro-
priate outgroup sequences—and no sequences that are not
members of the target protein family or outgroup. Given the
efficiency of modern phylogenetic analysis software, we see no
reason to limit the amount of protein sequence data analyzed,
other than elimination of potentially erroneous sequences and
redundant sequences. Ideally, a roughly equal number of
“ingroup” and outgroup sequences should be included in the
analysis.
There are many approaches to collecting sequence data, nearly
all of which rely on some form of sequence similarity search to
identify members of the protein family of interest. The most com-
mon approach starts with a small number of well-annotated protein
family members from heavily studied model organisms and collects
homologs using some form of protein BLAST search, typically with
an e-value cutoff of 1.0e5 or an alternative value thought to be
appropriate for the family under study. While this approach is
Reconstruction of Protein Sequence, Structure, Function 139
rescalePSSM.py1:
#!/usr/bin/env python
import sys, re
fname = sys.argv[1]
f = open(fname, ’r’)
content = f.read()
f.close()
1
The Python scripts described in this chapter are hosted with the online version of the book.
140 Kelsey Aadland et al.
psiblast -in_pssm cd00021.smp -db nr -out cd00021.nrhits.csv -outfmt ’10 qstart qend
sstart send sacc ssciname stitle evalue qlen sseq’ -evalue 0.01 -max_target_seqs
10000000
removePartialDomainHits.py:
#!/usr/bin/env python
import sys
min_prop = 0.75
handle = open(sys.argv[1],"r")
for line in handle:
linearr = line.strip().split(",")
hitlength = (int(linearr[3]) - int(linearr[2])) + 1
qlen = int(linearr[-2])
if float(hitlength) / float(qlen) >= min_prop:
sys.stdout.write(line)
handle.close()
combineDomainHits.py:
#!/usr/bin/env python
import sys, glob
combined_hits = {}
for f in glob.glob("*.nrhits.csv"):
domname = f.split(".nrhits.csv")[0]
handle = open(f, "r")
for line in handle:
linearr = line.strip().split(",")
acc = linearr[4]
spp = linearr[5]
beg = int(linearr[2])
end = int(linearr[3])
if acc in combined_hits.keys():
combined_hits[acc].append((beg,end,domname))
else:
combined_hits[acc] = [spp,(beg,end,domname)]
handle.close()
for acc in combined_hits.keys():
spp = combined_hits[acc][0]
doms = combined_hits[acc][1:]
doms.sort()
sys.stdout.write("%s,%s" % (acc,spp))
for (b,e,name) in doms:
sys.stdout.write(",%s:%d..%d" % (name,b,e))
sys.stdout.write("\n")
2.2 Alignment and Multiple sequence alignment forms the basis for inferring the pro-
Phylogenetic Tree tein family phylogeny and reconstructing ancestral sequences. Con-
Inference ceptually, aligning protein sequences amounts to making residue-
level statements of homology: aligned residues from two different
sequences are inferred to have arisen from a common ancestor;
aligning a residue from one sequence to a gap in another sequence
(“-”) amounts to inferring that the residue in the first sequence
does not have a homologous residue in the second sequence, either
due to an insertion in the first or a deletion in the second. Most
approaches to phylogenetic inference treat alignment and tree
inference as separate problems, first aligning the sequences and
then inferring the most likely tree, given that alignment. However,
there are approaches that attempt to simultaneously infer the
sequence alignment and the phylogeny [19, 41–44].
Unfortunately, different sequence alignment algorithms can
produce different residue-level statements of homology, even
when overall alignment accuracies are similar [45–47]. Additionally,
some regions of the alignment may be easier to infer—e.g., highly
conserved functional domains—while other regions may be more
error-prone [47]. Any errors in the sequence alignment can poten-
tially impact phylogenetic inference and ancestral sequence recon-
struction [48–50]. Ideally, we would like to eliminate alignment
errors, but this is typically not possible.
Here we take a “robustness approach” to sequence alignment
and phylogenetic inference; we use a variety of alignment strategies
to generate a large number of plausible sequence alignments, use
each of these alignments to infer the protein family phylogeny, and
then combine these inferences—both formally and informally—to
identify the protein family phylogeny most “robust” to uncertainty
in the sequence alignment.
As an example, we will use the popular (and relatively fast)
alignment algorithms from clustalw2 [51, 52], muscle [53], and
mafft [54] to align protein sequences. If computational resources
permit, other alignment programs such as msaprobs [55], proba-
lign [56], probcons [57], and tcoffee [58] can also be used.
Reconstruction of Protein Sequence, Structure, Function 143
makeElision.py:
#!/usr/bin/env python
import sys, glob
def parseFasta(infname):
alnlen = 0
alignment = {}
handle = open(infname, "r")
line = handle.readline()
while line:
if line[0] == ">":
id = line[1:].strip()
seq = ""
line = handle.readline()
while line and line[0] != ">":
seq += line.strip()
line = handle.readline()
alnlen = len(seq)
alignment[id] = seq
else:
line = handle.readline()
handle.close()
return (alnlen, alignment)
allids = []
handle = open("unaligned.fulllength.fasta", "r")
for line in handle:
if line[0] == ">":
allids.append(line[1:].strip())
handle.close()
full_aln = {}
for id in allids:
full_aln[id] = ""
#!/usr/bin/env python
import sys
fname = sys.argv[1]
handle = open(fname, "r")
readmatrix = False
for line in handle:
linearr = line.split()
if len(linearr) > 0 and linearr[0] == "matrix":
readmatrix = True
continue
if readmatrix:
if len(linearr) > 0 and linearr[0] == ";":
break
elif len(linearr) > 0:
id = linearr[0]
Reconstruction of Protein Sequence, Structure, Function 147
seq = linearr[1].replace("?","-")
sys.stdout.write(">%s\n%s\n" % (id,seq))
handle.close()
RAxML_marginalAncestralStates.alignment.ancseqs
RAxML_nodeLabelledRootedTree.alignment.ancseqs
RAxML_marginalAncestralProbabilities.alignment.ancseqs
makePresenceAbssenceMatrix.py:
#!/usr/bin/env python
import sys
putAncestralIndels.py:
#!/usr/bin/env python
import sys
Reconstruction of Protein Sequence, Structure, Function 149
handle = open(SEQF,"r")
for line in handle:
linearr = line.split()
id = linearr[0]
seq = linearr[1]
ancseqs[id] = seq
handle.close()
handle = open(INDF,"r")
for line in handle:
linearr = line.split()
id = linearr[0]
ins = linearr[1]
seq = ancseqs[id]
sys.stdout.write(">%s\n" % id)
for i in range(len(seq)):
if ins[i] == "0" or seq[i] == "?":
sys.stdout.write("-")
else:
sys.stdout.write(seq[i])
sys.stdout.write("\n")
handle.close()
getBayesASR.py:
#!/usr/bin/env python
import sys
from scipy import stats
import random
if len(sys.argv) < 4:
sys.stderr.write("randomASR.py N nodeID
150 Kelsey Aadland et al.
RAxML_marginalAncestralProbabilities.ancseqs
RAxML_marginalAncestralProbabilities.ancindels\n")
sys.stderr.write(" generates N random ancestral sequence ’draws’ from the
marginal probability distributions\n")
sys.stderr.write(" including indels\n")
sys.exit(1)
nseqs = int(sys.argv[1])
nodeid = sys.argv[2]
ancseqprobfname = sys.argv[3]
ancindprobfname = sys.argv[4]
if len(linearr) < 2:
break
pdist = []
idist = []
i = 0
for k in linearr:
p = float(k)
if p > 0.0:
pdist.append(p)
idist.append(i)
i += 1
ind_probdists.append(stats.rv_discrete(values=(idist,pdist)))
line = handle.readline()
handle.close()
structureAlign.py:
#!/usr/bin/env python
import os
import sys
import glob
from modeller import *
import modeller.salign
if len(sys.argv) < 2:
sys.stderr.write("usage: structAln.py directory\n")
sys.stderr.write(" will perform an iterative structural alignment of all\n")
sys.stderr.write(" the .pdb files in the input directory\n")
sys.stderr.write(" the resulting alignment is printed to directory_it.pap\n")
sys.stderr.write(" and directory_it.ali\n")
sys.exit(1)
thedir = sys.argv[1]
if thedir[-1] == "/":
thedir = thedir[:-1]
2.3.1 Structural Modeling It is widely thought that proteins function primarily through their
and Optimization three-dimensional structure, which determines the spatial distribu-
tion of biochemical properties and its dynamics [83–85]. Here we
will exploit this structural basis for molecular function to provide
high-throughput molecular affinity predictions across ancestral and
extant protein sequences. We will use structural homology model-
ing to infer 3D structures of protein sequences for which empirical
structures are not available [79]. To facilitate downstream affinity
predictions, we will need at least one empirical structure of the
functional domain of interest from a protein family member or
distantly related homolog in complex with a ligand of interest,
which can be a small molecule, DNA/RNA, or another protein.
Most often, this will be retrieved from the Protein Data Bank by
sequence search [86]. Note that if the 3D structure of the func-
tional domain of interest has not been empirically solved in complex
with an appropriate ligand, it will need to be generated before
proceeding with this protocol. Generating a starting protein-ligand
complex is probably best done using an empirical structure-
determination protocol, although de novo structure prediction or
protein-ligand docking is an alternative method [87–91].
Once an appropriate protein-ligand complex has been gener-
ated, its protein sequence should be aligned to the same alignment
used to infer ancestral sequences; this will ensure that all extant and
ancestral protein sequences are aligned to the structural template,
facilitating high-throughput structural modeling. Given an align-
ment of a protein sequence to a structural template, modeler can be
used to generate 100 structural models and evaluate their accuracy
using a number of validation scores [79].
154 Kelsey Aadland et al.
generateStructuralModel.py:
#!/usr/bin/env python
from modeller import *
from modeller.automodel import *
a = automodel(env,
alnfile = ALNFILE,
knowns = KNOWNS,
sequence = SEQ,
assess_methods=(assess.DOPE, assess.DOPEHR, assess.normalized_dope,
assess.GA341))
a.starting_model= 1
a.ending_model = NMODELS
a.make()
>P1;ANC948
sequence:ANC948:::::::0.00: 0.00
---–QCDPDNDPSKTPISL-LSQLCEKRN-LCSPEYD------LVSQ-QG---P---–
PH---TRTFTMRVTVGD----FV-F-QGT---GRSKKEAKHNAAEKMLDHLRQ-
CPDVPYPT--
/
........../
Reconstruction of Protein Sequence, Structure, Function 155
..........*
>P1;3ADL
structure:3ADL::A:::::0.00: 0.00
------------–SHEVGA-LQELVVQKG-WRLPEYT------VTQE-SG---P---–
AH---RKEFTMTCRVER----FI-E-IGS---GTSKKLAKR-
NAAAKMLLRVHT---------–
/
........../
..........*
generateStructuralModels.py:
#!/usr/bin/env python
import sys
import glob
import os
## NOTE: you will need to change the PDBID to the template you are using ##
PDBID = "3ADL"
>P1;%s
156 Kelsey Aadland et al.
structure:%s::A:::::0.00: 0.00
%s
/
........../
..........*
"""
modelerpy="""
#!/usr/bin/env python
from modeller import *
from modeller.automodel import *
a = automodel(env,
alnfile = ALNFILE,
knowns = KNOWNS,
sequence = SEQ,
assess_methods=(assess.DOPE, assess.DOPEHR))
a.starting_model= 1
a.ending_model = NMODELS
a.make()
"""
def launchSeq(mydir,myid,myseq,number,mystruct):
index = number / 100
topdir = "D%d" % index
if not os.path.exists("%s/%s" % (mydir,topdir)):
os.mkdir("%s/%s" % (mydir,topdir))
if not os.path.exists("%s/%s/%s" % (mydir,topdir,myid)):
os.mkdir("%s/%s/%s" % (mydir,topdir,myid))
wrkdir = "%s/%s/%s" % (mydir,topdir,myid)
#write alignment file#
handle = open("%s/alignment.ali" % wrkdir, "w")
Reconstruction of Protein Sequence, Structure, Function 157
handle.write(alnstr % (myid,myid,myseq,PDBID,PDBID,mystruct))
handle.close()
#write modeler run file#
handle = open("%s/runModeller.py" % wrkdir, "w")
handle.write(modelerpy % (PDBID,myid))
handle.close()
os.system("chmod 775 %s/runModeller.py" % wrkdir)
os.system("cp parseBestModel.py %s/" % wrkdir)
os.chdir("%s/" % wrkdir)
os.system("./runModeller.py > SCORES.txt")
os.system("./parseBestModel.py SCORES.txt")
os.chdir("../../../")
sys.stderr.write("finished: %s %s\n" % (mydir,myid))
num = 0
dr = "models"
parseBestModel.py:
#!/usr/bin/env python
import os
import sys
158 Kelsey Aadland et al.
"""
Calculate mean and standard deviation of data x[]:
mean = {\sum_i x_i \over n}
std = sqrt(\sum_i (x_i - mean)^2 \over n-1)
"""
def meanstdev(x):
from math import sqrt
n, mean, std = len(x), 0, 0
for a in x:
mean = mean + a
mean = mean / float(n)
for a in x:
std = std + (a - mean)**2
std = sqrt(std / float(n-1))
return (mean, std)
scorefname = sys.argv[1]
printlines.sort(reverse=True)
for (s,v) in printlines:
print v
print "BEST MODEL (out of %d): %s" % (len(models),bestmodel)
extractChains.py:
#!/usr/bin/env python
import sys
pdbfname = sys.argv[1]
chains = sys.argv[2:]
collectPKDs.py:
#!/usr/bin/env python
import sys
import glob
seqid = f.split("/")[0]
pkds = []
handle = open(f,"r")
for line in handle:
linearr = line.split()
if linearr[0] == "predicted" and linearr[1] == "pKD:":
pkds.append(float(linearr[2]))
handle.close()
pkds.sort(reverse=True)
outf.write(seqid)
for k in pkds:
outf.write("\t%f" % k)
outf.write("\n")
outf.close()
createPKD_BaseTree.py:
#!/usr/bin/env python
import sys
# need to point to the tree with branch lengths and the labelled tree #
BL_TREE="ASRinput.tre"
LA_TREE="RAxML_nodeLabelledRootedTree.alignment.ancseqs"
# parse trees
structurals = ["(",")",",",":",";"]
numericals = ["0","1","2","3","4","5","6","7","8","9",".","e","E","-"]
i1 = 0
i2 = 0
Reconstruction of Protein Sequence, Structure, Function 163
outf.write("\n")
outf.close()
colorPKDTree.py:
#!/usr/bin/env python
import sys
pkdfname = "all_pkds.txt"
trefname = "pkd_base_tree.tre"
colors=["#0025e5","#1926d2","#3327c0","#4c28ae","#66299b",
"#7f2a89","#992b77","#b22c64","#cc2d52","#e52e40","#ff302e"]
164 Kelsey Aadland et al.
## the max and min pKd values should be chosen based on the system under study ##
new_max = 9.75
new_min = 4.75
def getColor(nodename):
if nodename not in pkd_map.keys():
return "[&!color=#d3d3d3]" ## missing data gets gray color ##
else:
pkd = pkd_map[nodename]
for i in range(len(breaks)):
if pkd < breaks[i]:
return "[&!color=%s]" % colors[i]
return "[&!color=%s]" % colors[-1]
structurals = ["(",")",",",":",";"]
numericals = ["0","1","2","3","4","5","6","7","8","9",".","e","E","-"]
i1 = 0
while i1 < len(nodetree):
# parse branch length #
if nodetree[i1] == ":":
sys.stdout.write(nodetree[i1])
i1 += 1
brlen = ""
while i1 < len(nodetree) and nodetree[i1] in numericals:
brlen += nodetree[i1]
i1 += 1
sys.stdout.write(brlen)
else:
label = ""
while nodetree[i1] not in structurals:
label += nodetree[i1]
i1 += 1
colorlabel = getColor(label)
sys.stdout.write("%s%s" % (label,colorlabel))
sys.stdout.write("\nend trees;\n")
References
1. Dean AM, Thornton JW (2007) Mechanistic 11. Bridgham JT, Carroll SM, Thornton JW
approaches to the study of evolution: the (2006) Evolution of hormone-receptor com-
functional synthesis. Nat Rev Genet 8 plexity by molecular exploitation. Science 312
(9):675–688. https://doi.org/10.1038/ (5770):97–101. https://doi.org/10.1126/
nrg2160 science.1123348
2. Harms MJ, Thornton JW (2013) Evolution- 12. Bridgham JT, Ortlund EA, Thornton JW
ary biochemistry: revealing the historical and (2009) An epistatic ratchet constrains the
physical causes of protein properties. Nat Rev direction of glucocorticoid receptor evolu-
Genet 14(8):559–571. https://doi.org/10. tion. Nature 461(7263):515–519. https://
1038/nrg3540 doi.org/10.1038/nature08249
3. Cole MF, Gaucher EA (2011) Exploiting 13. Voordeckers K, Brown CA, Vanneste K, van
models of molecular evolution to efficiently der Zande E, Voet A, Maere S, Verstrepen KJ
direct protein engineering. J Mol Evol 72 (2012) Reconstruction of ancestral metabolic
(2):193–203. https://doi.org/10.1007/ enzymes reveals molecular mechanisms
s00239-010-9415-2 underlying evolutionary innovation through
4. Ogawa T, Shirai T (2014) Tracing ancestral gene duplication. PLoS Biol 10(12):
specificity of lectins: ancestral sequence recon- e1001446. https://doi.org/10.1371/jour
struction method as a new approach in pro- nal.pbio.1001446
tein engineering. Methods Mol Biol 14. Ugalde JA, Chang BS, Matz MV (2004) Evo-
1200:539–551. https://doi.org/10.1007/ lution of coral pigments recreated. Science
978-1-4939-1292-6_44 305(5689):1433. https://doi.org/10.1126/
5. Yang Z, Kumar S, Nei M (1995) A new science.1099597
method of inference of ancestral nucleotide 15. van Hazel I, Sabouhanian A, Day L, Endler
and amino acid sequences. Genetics 141 JA, Chang BS (2013) Functional characteri-
(4):1641–1650 zation of spectral tuning mechanisms in the
6. Shih P, Malcolm BA, Rosenberg S, Kirsch JF, great bowerbird short-wavelength sensitive
Wilson AC (1993) Reconstruction and test- visual pigment (SWS1), and the origins of
ing of ancestral proteins. Methods Enzymol UV/violet vision in passerines and parrots.
224:576–590 BMC Evol Biol 13:250. https://doi.org/10.
7. Zmasek CM, Godzik A (2011) Strong func- 1186/1471-2148-13-250
tional patterns in the evolution of eukaryotic 16. Hall BG (2006) Simple and accurate estima-
genomes revealed by the reconstruction of tion of ancestral protein sequences. Proc Natl
ancestral protein domain repertoires. Acad Sci U S A 103(14):5431–5436. https://
Genome Biol 12(1):R4. https://doi.org/10. doi.org/10.1073/pnas.0508991103
1186/gb-2011-12-1-r4 17. Ashkenazy H, Penn O, Doron-Faigenboim A,
8. Whitfield JH, Zhang WH, Herde MK, Clifton Cohen O, Cannarozzi G, Zomer O, Pupko T
BE, Radziejewski J, Janovjak H, (2012) FastML: a web server for probabilistic
Henneberger C, Jackson CJ (2015) Con- reconstruction of ancestral sequences. Nucleic
struction of a robust and sensitive arginine Acids Res 40(Web Server issue):
biosensor through ancestral protein recon- W580–W584. https://doi.org/10.1093/
struction. Protein Sci 24(9):1412–1422. nar/gks498
https://doi.org/10.1002/pro.2721 18. Redelings BD, Suchard MA (2005) Joint
9. Malcolm BA, Wilson KP, Matthews BW, Bayesian estimation of alignment and phylog-
Kirsch JF, Wilson AC (1990) Ancestral lyso- eny. Syst Biol 54(3):401–418. https://doi.
zymes reconstructed, neutrality tested, and org/10.1080/10635150590947041
thermostability linked to hydrocarbon pack- 19. Suchard MA, Redelings BD (2006) BAli-Phy:
ing. Nature 345(6270):86–89. https://doi. simultaneous Bayesian inference of alignment
org/10.1038/345086a0 and phylogeny. Bioinformatics 22
10. Clifton BE, Jackson CJ (2016) Ancestral pro- (16):2047–2048. https://doi.org/10.1093/
tein reconstruction yields insights into adap- bioinformatics/btl175
tive evolution of binding specificity in solute- 20. Anderson DP, Whitney DS, Hanson-Smith V,
binding proteins. Cell Chem Biol 23 Woznica A, Campodonico-Burnett W, Volk-
(2):236–245. https://doi.org/10.1016/j. man BF, King N, Thornton JW, Prehoda KE
chembiol.2015.12.010 (2016) Evolution of an ancient protein
Reconstruction of Protein Sequence, Structure, Function 167
83. Ashtawy HM, Mahapatra NR (2012) A com- PDB2PQR: expanding and upgrading auto-
parative assessment of ranking accuracies of mated preparation of biomolecular structures
conventional and machine-learning-based for molecular simulations. Nucleic Acids Res
scoring functions for protein-ligand binding 35(Web Server issue):W522–W525. https://
affinity prediction. IEEE/ACM Trans Com- doi.org/10.1093/nar/gkm276
put Biol Bioinform 9(5):1301–1313. https:// 93. Pronk S, Pall S, Schulz R, Larsson P,
doi.org/10.1109/TCBB.2012.36 Bjelkmar P, Apostolov R, Shirts MR, Smith
84. Ashtawy HM, Mahapatra NR (2015) JC, Kasson PM, van der Spoel D, Hess B,
BgN-Score and BsN-Score: bagging and Lindahl E (2013) GROMACS 4.5: a high-
boosting based ensemble neural networks throughput and highly parallel open source
scoring functions for accurate binding affinity molecular simulation toolkit. Bioinformatics
prediction of protein-ligand complexes. BMC 29(7):845–854. https://doi.org/10.1093/
Bioinformatics 16(Suppl 4):S8. https://doi. bioinformatics/btt055
org/10.1186/1471-2105-16-S4-S8 94. Dias R, Timmers LF, Caceres RA, de Azevedo
85. Brylinski M (2013) Nonlinear scoring func- WF Jr (2008) Evaluation of molecular dock-
tions for similarity-based ligand docking and ing using polynomial empirical scoring func-
binding affinity prediction. J Chem Inf Model tions. Curr Drug Targets 9(12):1062–1070
53(11):3097–3112. https://doi.org/10. 95. De Paris R, Quevedo CV, Ruiz DD, Norberto
1021/ci400510e de Souza O, Barros RC (2015) Clustering
86. Rose PW, Bi C, Bluhm WF, Christie CH, molecular dynamics trajectories for optimiz-
Dimitropoulos D, Dutta S, Green RK, Good- ing docking experiments. Comput Intell Neu-
sell DS, Prlic A, Quesada M, Quinn GB, rosci 2015:916240. https://doi.org/10.
Ramos AG, Westbrook JD, Young J, 1155/2015/916240
Zardecki C, Berman HM, Bourne PE (2013) 96. Seo MH, Park J, Kim E, Hohng S, Kim HS
The RCSB Protein Data Bank: new resources (2014) Protein conformational dynamics dic-
for research and education. Nucleic Acids Res tate the binding affinity for a ligand. Nat
41(Database issue):D475–D482. https:// Commun 5:3724. https://doi.org/10.
doi.org/10.1093/nar/gks1200 1038/ncomms4724
87. Comeau SR, Gatchell DW, Vajda S, Camacho 97. Kruger DM, Ignacio Garzon J, Chacon P,
CJ (2004) ClusPro: an automated docking Gohlke H (2014) DrugScorePPI
and discrimination method for the prediction knowledge-based potentials used as scoring
of protein complexes. Bioinformatics 20 and objective function in protein-protein
(1):45–50 docking. PLoS One 9(2):e89466. https://
88. Kastritis PL, Bonvin AM (2010) Are scoring doi.org/10.1371/journal.pone.0089466
functions in protein-protein docking ready to 98. Camacho CJ, Zhang C (2005) FastContact:
predict interactomes? Clues from a novel rapid estimate of contact and binding free
binding affinity benchmark. J Proteome Res energies. Bioinformatics 21(10):2534–2536.
9(5):2216–2225. https://doi.org/10.1021/ https://doi.org/10.1093/bioinformatics/
pr9009854 bti322
89. Kozakov D, Beglov D, Bohnuud T, Mottar- 99. Dias R, Kolaczkowski B (2017) Improving
ella SE, Xia B, Hall DR, Vajda S (2013) How the accuracy of high-throughput protein-pro-
good is automated protein docking? Proteins tein affinity prediction may require better
81(12):2159–2166. https://doi.org/10. training data. BMC Bioinformatics 18(Suppl
1002/prot.24403 5):102. https://doi.org/10.1186/s12859-
90. Lensink MF, Wodak SJ (2013) Docking, scor- 017-1533-z
ing, and affinity prediction in CAPRI. Pro- 100. Dias R, Kolazckowski B (2015) Different
teins 81(12):2082–2095. https://doi.org/ combinations of atomic interactions predict
10.1002/prot.24428 protein-small molecule and protein-DNA/
91. Roberts VA, Thompson EE, Pique ME, Perez RNA affinities with similar accuracy. Proteins
MS, Ten Eyck LF (2013) DOT2: macromo- 83(11):2100–2114. https://doi.org/10.
lecular docking with improved biophysical 1002/prot.24928
models. J Comput Chem 34 101. O’Boyle NM, Banck M, James CA, Morley C,
(20):1743–1758. https://doi.org/10.1002/ Vandermeersch T, Hutchison GR (2011)
jcc.23304 Open Babel: an open chemical toolbox. J
92. Dolinsky TJ, Czodrowski P, Li H, Nielsen JE, Cheminform 3:33. https://doi.org/10.
Jensen JH, Klebe G, Baker NA (2007) 1186/1758-2946-3-33
Chapter 9
Abstract
Ancestral sequence reconstruction (ASR) is a powerful tool to infer primordial sequences from contempo-
rary, i.e., extant ones. An essential element of ASR is the computation of a phylogenetic tree whose leaves
are the chosen extant sequences. Most often, the reconstructed sequence related to the root of this tree is of
greatest interest: It represents the common ancestor (CA) of the sequences under study. If this sequence
encodes a protein, one can “resurrect” the CA by means of gene synthesis technology and study biochemi-
cal properties of this extinct predecessor with the help of wet-lab experiments.
However, ASR deduces also sequences for all internal nodes of the tree, and the well-considered analysis
of these “intermediates” can help to elucidate evolutionary processes. Moreover, one can identify key
mutations that alter proteins or protein complexes and are responsible for the differing properties of extant
proteins. As an illustrative example, we describe the protocol for the rapid identification of hotspots
determining the binding of the two subunits within the heteromeric complex imidazole glycerol phosphate
synthase.
Key words Ancestral sequence reconstruction, Vertical analysis, Evolutionary biochemistry, In silico
mutagenesis, Protein–protein interaction
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_9, © Springer Science+Business Media, LLC, part of Springer Nature 2019
171
172 Kristina Straub and Rainer Merkl
2 Method
3 Notes
for details see the MAFFT manual. For the initial generation of
large MSAs, the option – auto is also appropriate.
3. Modeling the history of insertions and deletions on an evolu-
tionary time scale is difficult and requires for most ASR algo-
rithms the manual adjustment of primordial sequences. One
can minimize errors by choosing a set of sequences of relatively
uniform length.
4. Jalview is an excellent tool for the preparation of sequence sets
used in ASR. The Jalview command Edit\Remove redundancy
allows the selection of a percentage identity threshold and
initiates the subsequent comparison of all sequence pairs. If
the similarity of any two sequences exceeds this cutoff, the
shorter sequence is discarded. A cutoff of 95% or lower is useful
to remove redundant sequences and to avoid highly articulated
subtrees. The command Calculate\Sort by length makes it pos-
sible to identify easily sequences that are much shorter or
longer than the reference sequence. These sequences and
those introducing strikingly long indels can be erased by click-
ing their name and the delete button. The command Web
Service\Alignment offers several alternatives for MSA creation,
among them is MAFFT.
5. Concatenation helps to deduce a robust tree due to the stron-
ger phylogenetic signal spread over a larger set of residue posi-
tions. Make sure that the sequences originate from the same
species by using for their linkage the Tax-Id assigned by the
taxonomy browser of the NCBI. Note that concatenation is only
valid for sequences that coevolve and share the same evolution-
ary history for the entire period under study.
6. For the visual inspection of trees, it is helpful to replace the
hard to interpret database identifiers with names that indicate
the function of the proteins and/or the phylogenetic position
of the species contributing the sequences. We use our in-house
tool Key2Ann [41] to denote the phylogenetic lineage; see Fig. 1
for an example.
7. A detailed description of all the programs and their options
belonging to the software suite PhyloBayes can be found at
www.phylobayes.org. For the reconstruction of amino acid
sequences, we use the CAT or JTT model and specify a minimal
effective sample size of 100. Congruence can be tested by
calculating the maximum difference (maxdiff) of posterior
probabilities of tree bipartitions by using the PhyloBayes tool
bpcomp; the maxdiff value should be below 0.3 [29]. Compu-
tation time can be reduced by using the multi-core version
PhyloBayes-MPI. Note that an MCMC calculation may take
several weeks, if a large number of recent sequences were
chosen.
Elucidating a Stepwise Evolutionary Adaptation by Means of ASR 179
Acknowledgement
References
1. Lee D, Redfern O, Orengo C (2007) Predict- 7. Harms MJ, Thornton JW (2010) Analyzing
ing protein function from sequence and struc- protein structure and function using ancestral
ture. Nat Rev Mol Cell Biol 8(12):995–1005. gene reconstruction. Curr Opin Struct Biol 20
https://doi.org/10.1038/nrm2281 (3):360–366. https://doi.org/10.1016/j.sbi.
2. Schymkowitz J, Borg J, Stricher F et al (2005) 2010.03.005
The FoldX web server: an online force field. 8. Saitou N, Nei M (1987) The neighbor-joining
Nucleic Acids Res 33(Web Server issue): method: a new method for reconstructing phy-
W382–W388 logenetic trees. Mol Biol Evol 4(4):406–425
3. Janda JO, Meier A, Merkl R (2013) CLIPS- 9. Gerlt JA (2017) Genomic enzymology: web
4D: a classifier that distinguishes structurally tools for leveraging protein family sequence-
and functionally important residue-positions function space and genome context to discover
based on sequence and 3D data. Bioinformat- novel functions. Biochemistry 56
ics 29(23):3029–3035. https://doi.org/10. (33):4293–4308. https://doi.org/10.1021/
1093/bioinformatics/btt519 acs.biochem.7b00614
4. Zellner H, Staudigel M, Trenner T et al (2012) 10. Merkl R, Sterner R (2016) Ancestral protein
PresCont: predicting protein-protein interfaces reconstruction: techniques and applications.
utilizing four residue properties. Proteins 80 Biol Chem 397(1):1–21. https://doi.org/10.
(1):154–168. https://doi.org/10.1002/prot. 1515/hsz-2015-0158
23172 11. Thornton JW (2004) Resurrecting ancient
5. Plach MG, Löffler P, Merkl R, Sterner R genes: experimental analysis of extinct mole-
(2015) Conversion of anthranilate synthase cules. Nat Rev Genet 5(5):366–375. https://
into isochorismate synthase: implications for doi.org/10.1038/nrg1324
the evolution of chorismate-utilizing enzymes. 12. Liberles DA (2007) Ancestral sequence recon-
Angew Chem Int Ed 54(38):11270–11274. struction. Oxford University Press, Oxford
https://doi.org/10.1002/anie.201505063 13. Hochberg GKA, Thornton JW (2017) Recon-
6. Edgar RC, Batzoglou S (2006) Multiple structing ancient proteins to understand the
sequence alignment. Curr Opin Struct Biol causes of structure and function. Annu Rev
16(3):368–373 Biophys 46:247–269. https://doi.org/10.
1146/annurev-biophys-070816-033631
Elucidating a Stepwise Evolutionary Adaptation by Means of ASR 181
14. Bornscheuer UT, Huisman GW, Kazlauskas RJ Nucleic Acids Res 40(Database issue):
et al (2012) Engineering the third wave of D306–D312. https://doi.org/10.1093/nar/
biocatalysis. Nature 485(7397):185–194. gkr948
https://doi.org/10.1038/nature11117 27. Katoh K, Standley DM (2013) MAFFT multi-
15. Romero-Romero ML, Risso VA, Martinez- ple sequence alignment software version 7:
Rodriguez S et al (2016) Engineering ancestral improvements in performance and usability.
protein hyperstability. Biochem J 473 Mol Biol Evol 30(4):772–780. https://doi.
(20):3611–3620. https://doi.org/10.1042/ org/10.1093/molbev/mst010
BCJ20160532 28. Waterhouse AM, Procter JB, Martin DMA,
16. Massiere F, Badet-Denisot MA (1998) The Clamp M, Barton GJ, (2009) Jalview Version
mechanism of glutamine-dependent amido- 2—a multiple sequence alignment editor and
transferases. Cell Mol Life Sci 54(3):205–222 analysis workbench. Bioinformatics 25
17. Zalkin H, Smith JL (1998) Enzymes utilizing (9):1189–1191. https://doi.org/10.1093/
glutamine as an amide donor. Adv Enzymol bioinformatics/btp033
Relat Areas Mol Biol 72:87–144 29. Castresana J (2000) Selection of conserved
18. Beismann-Driemeyer S, Sterner R (2001) blocks from multiple alignments for their use
Imidazole glycerol phosphate synthase from in phylogenetic analysis. Mol Biol Evol 17
Thermotoga maritima. Quaternary structure, (4):540–552
steady-state kinetics, and reaction mechanism 30. Lartillot N, Lepage T, Blanquart S (2009) Phy-
of the bienzyme complex. J Biol Chem 276 loBayes 3: a Bayesian software package for phy-
(23):20387–20396 logenetic reconstruction and molecular dating.
19. List F, Vega MC, Razeto A et al (2012) Cataly- Bioinformatics 25(17):2286–2288. https://
sis uncoupling in a glutamine amidotransferase doi.org/10.1093/bioinformatics/btp368
bienzyme by unblocking the glutaminase active 31. Ali RH, Bark M, Miro J et al (2017) VMCMC:
site. Chem Biol 19(12):1589–1599. https:// a graphical and statistical analysis tool for Mar-
doi.org/10.1016/j.chembiol.2012.10.012 kov chain Monte Carlo traces. BMC Bioinfor-
20. Reisinger B, Sperl J, Holinski A et al (2014) matics 18(1):97. https://doi.org/10.1186/
Evidence for the existence of elaborate enzyme s12859-017-1505-3
complexes in the Paleoarchean era. J Am Chem 32. Ronquist F, Huelsenbeck JP (2003) MrBayes
Soc 136(1):122–129. https://doi.org/10. 3: Bayesian phylogenetic inference under
1021/ja4115677 mixed models. Bioinformatics 19
21. Holinski A, Heyn K, Merkl R, Sterner R (12):1572–1574
(2017) Combining ancestral sequence recon- 33. Bouckaert R, Heled J, Kuhnert D et al (2014)
struction with protein design to identify an BEAST 2: a software platform for Bayesian
interface hotspot in a key metabolic enzyme evolutionary analysis. PLoS Comput Biol 10
complex. Proteins 85(2):312–321. https:// (4):e1003537. https://doi.org/10.1371/jour
doi.org/10.1002/prot.25225 nal.pcbi.1003537
22. Bar-Rogovsky H, Stern A, Penn O et al (2015) 34. Abascal F, Zardoya R, Posada D (2005) Prot-
Assessing the prediction fidelity of ancestral Test: selection of best-fit models of protein
reconstruction by a library approach. Protein evolution. Bioinformatics 21(9):2104–2105.
Eng Des Sel 28(11):507–518. https://doi. https://doi.org/10.1093/bioinformatics/
org/10.1093/protein/gzv038 bti263
23. Altschul SF, Gish W, Miller W et al (1990) 35. Perriere G, Gouy M (1996) WWW-query: an
Basic local alignment search tool. J Mol Biol on-line retrieval system for biological sequence
215(3):403–410 banks. Biochimie 78(5):364–369
24. Pruitt KD, Tatusova T, Klimke W, Maglott DR 36. Rambaut A (2012) FigTree v1.4. http://tree.
(2009) NCBI Reference Sequences: current bio.ed.ac.uk/software/figtree/
status, policy and new initiatives. Nucleic 37. Ciccarelli FD, Doerks T, von Mering C et al
Acids Res 37(Database issue):D32–D36. (2006) Toward automatic reconstruction of a
https://doi.org/10.1093/nar/gkn721 highly resolved tree of life. Science 311
25. Apweiler R, Martin M, O’Donovan C et al (5765):1283–1287
(2013) Update on activities at the Universal 38. Puigbo P, Wolf YI, Koonin EV (2009) Search
Protein Resource (UniProt) in 2013. Nucleic for a ‘Tree of Life’ in the thicket of the phylo-
Acids Res 41(D 1):D43–D47 genetic forest. J Biol 8(6):59. https://doi.org/
26. Hunter S, Jones P, Mitchell A et al (2012) 10.1186/jbiol159
InterPro in 2011: new developments in the 39. Yang Z (2007) PAML 4: phylogenetic analysis
family and domain prediction database. by maximum likelihood. Mol Biol Evol 24
182 Kristina Straub and Rainer Merkl
Abstract
For highly divergent sequences, there is often insufficient information to reliably construct alignments and
phylogenetic trees. Since protein structure may be strongly conserved despite large divergences in
sequence, structural information can be used to help identify homology in such cases.
While there exist well-studied models of sequence evolution, structurally informed alignment methods
have typically made use of geometric measures of deviation that do not take into account the underlying
mutational processes. In order to integrate structural information into sequence-based evolutionary
models, we recently developed a stochastic model of structural evolution on a phylogenetic tree and
implemented this as the StructAlign plugin for the StatAlign statistical alignment package.
In this chapter, we will outline the types of analyses that can be carried out using StructAlign, illustrating
how the inclusion of structural information can be used to inform joint estimation of alignments and trees.
StructAlign can also be used to infer branch-specific rates of structural evolution, and analysis of an example
globin dataset highlights strong variation in the inferred rate across the tree. While structure is more highly
conserved within clades, the rate of structural divergence as a function of sequence variation is larger
between functionally divergent proteins. Allowing for the rate of structural divergence to vary over the tree
results in an improved fit to the empirically observed pairwise RMSD values.
Key words Protein structure, Structural alignment, RMSD, Statistical alignment, Alignment uncer-
tainty, Bayesian hierarchical models, MCMC, Parallel tempering, Molecular phylogenetics, Globins
1 Introduction
The original version of this chapter was revised. The correction to this chapter is available at https://doi.org/
10.1007/978-1-4939-8736-8_23
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_10, © Springer Science+Business Media, LLC, part of Springer Nature 2019
183
184 Joseph L. Herman
2 Materials
2.1 Running StatAlign is written in Java and is run via a JAR file, which can either
StatAlign be obtained from the pre-compiled distribution or built from
source. The pre-compiled package can be obtained from GitHub at
https://github.com/statalign/statalign/releases/download/
v3.3/StatAlign-v3.3.zip.
Source code can also be downloaded and compiled from the
GitHub repository if desired.
The graphical version of StatAlign can be run on Windows,
Mac, and Linux by double-clicking on the JAR archive (see Note 1).
Instructions for using this GUI version can be found on the StatA-
lign website at http://statalign.github.io/doc/user_manual.html.
For longer-running analyses running on multiple cores, we
need to make use of the command-line version of StatAlign. In
this chapter, we will present the commands required to run the
command-line version under Unix-based systems; Windows users
can run these commands using a terminal emulator such as Cygwin.
The single-threaded version can be invoked from the JAR via
2.2 Example Dataset In this chapter we will investigate the phylogenetic relationship
between cytoglobin [43, 44] and a set of nine other globin struc-
tures (Table 1). The functional role of cytoglobin is currently
unknown, and there has been recent interest in determining its
relationship to other globins [45–47].
For input to StatAlign, each PDB file should contain a single
chain to be analyzed (see Note 2). To construct the example
186 Joseph L. Herman
Table 1
Protein structures used in example dataset, with heme coordination number and exogenous ligand
shown
dataset, the A-chain was extracted from the PDB file corresponding
to each structure in Table 1; for 2hhbB, the B-chain was used.
The PDB files corresponding to this dataset can be found in the
examples/10_globins subdirectory distributed with StatAlign,
along with a FASTA-formatted file 10_globins.fasta contain-
ing only the primary sequences.
2.3 Analysis We will make use of R in order to analyze the output files generated
of StatAlign Output by StatAlign, and code is provided for each analysis step in this
chapter. Unless otherwise specified, the example commands are to
be run from within the directory where StatAlign is installed,
with the path to this directory saved as a STATALIGN_HOME variable
in the shell and R environments. Several R packages and some
additional scripts will also be required; once installed, the requisite
packages can be loaded using the code below:
packages = c('dplyr','coda','magrittr','ape','ggplot2',
'reshape2','data.table','gridExtra')
for (package in packages) require(package,character.only=TRUE)
3 Methods
3.1 Running As a preliminary analysis, we will first analyze the globin dataset
StatAlign in Sequence- using the original sequence-only model in StatAlign; later we will
Only Mode go on to assess the effect of including structural information.
Enhancing Alignment and Tree Inference with Protein Structures 187
Table 2
Output files created by running StatAlign on the example dataset
StatAlign can be invoked via the command below (see Note 3).
Note that the backslashes below split up the command over multi-
ple lines to avoid ambiguity, but this can also be run all on one line:
base.name = paste0(STATALIGN_HOME,
"examples/10_globins/10_globins.fasta")
log.likelihood = paste0(base.name,".ll") %>%
fread %>% select(ll.all)
ali.length = paste0(base.name,".length") %>%
fread %>% select(ali.length)
plot(mcmc(cbind(log.likelihood,ali.length)))
layout(t(1:2))
acf(log.likelihood)
acf(ali.length)
effectiveSize(ali.length)
## ali.length
## 108.9655
Fig. 1 Trace plots for the log likelihood and alignment length, illustrating some noticeable autocorrelation
Enhancing Alignment and Tree Inference with Protein Structures 189
Fig. 2 Autocorrelation plots for the log likelihood and alignment length, indicating
poor mixing
3.3 Using Parallel Slow mixing between different regions of the posterior is a com-
Tempering to Improve mon issue when sampling high-dimensional hierarchical models
Mixing involving discrete parameters such as sequence alignments and
tree topologies. Although StatAlign employs a number of advanced
proposal distributions to efficiently explore the parameter space,
transitions between modes of similar posterior density may still
require traversal of configurations that are highly unfavorable,
making these transitions very infrequent.
One way to address this is to use a multiple-chain MCMC
sampler where each chain has an associated heat parameter, t, used
to flatten out the posterior surface in order to increase the frequency
of transitions between configurations [48]. Under this scheme, the
chain with t ¼ 1 (the “cold chain”) generates samples from the true
posterior, and the heated chains (with t > 1) sample from flattened
versions thereof, which can be traversed more easily. By swapping
heats between chains, accepting/rejecting heat proposals according
to the appropriate Metropolis-Hastings ratio, the stationary distribu-
tion of the cold chain is maintained.
Running several chains in parallel at different heats, parallel
tempering can be implemented efficiently by swapping tempera-
tures between specified chains according to a random sequence that
is shared between all chains [49]. Empirically we have found linear
spacings between adjacent inverse temperature values to be effec-
tive, with the default step size set to 0.01 in StatAlign v3.3 (this can
be modified via the -tempDiff argument to StatAlign). As well as
exchanging temperature parameters, StatAlign also swaps para-
meters that determine the variance of MCMC proposal distribu-
tions, ensuring that optimal acceptance ratios for individual moves
are maintained.
3.4 Running Before running the parallel version on the test dataset, we will first
the Parallel Version copy the input files to a new directory, so that the new output does
of StatAlign not overwrite the existing output files:
mkdir -p mpi_output
cp examples/10_globins/10_globins.fasta mpi_output
190 Joseph L. Herman
3.5 Analyzing When running in parallel mode, each chain generates a separate set
Parallel Output of output files, indicated by chainX in the filename (see Note 4).
The .coreModel.params files now contain the inverse tempera-
ture parameter (beta) as the second column. We will first extract
this information in order to aggregate the samples based on the
chain temperature:
n.chains = 8
chains = 0:(n.chains-1)
base.name = paste0(STATALIGN_HOME,
"/mpi_output/10_globins.fasta")
# read in core model (TKF92) MCMC output file
coreModel.list = lapply(chains,function(x)
paste0(base.name,".chain",x,".coreModel.params") %>% fread
)
coreModel = do.call(rbind,coreModel.list)
# extract the values of the beta (inverse temperature)
# parameter
beta.values = as.character(sort(unique(coreModel$beta)))
log.likelihood =
Map(function(i)
log.likelihood %>%
filter(coreModel$beta==i) %>%
arrange(sample) %>%
select(ll.all),
beta.values)
ali.length =
Map(function(i)
ali.length %>%
filter(coreModel$beta==i) %>%
arrange(sample) %>%
select(ali.length),
beta.values)
Table 3
Frequency of inverse temperature parameters sampled in each MCMC chain
1 2 3 4 5 6 7 8
0.86 3200 2686 3177 2748 3472 3578 3477 2662
0.88 3171 2982 3252 2901 3216 3350 3381 2747
0.90 3033 3102 3395 3021 3223 3141 3222 2863
0.92 3137 3319 3311 3280 3100 3111 2932 2810
0.94 3227 3161 3119 3209 3041 2973 3051 3219
0.96 3153 3239 3057 3218 2978 2962 2967 3426
0.98 3126 3153 2901 3294 3036 2979 2979 3532
1.00 2953 3358 2788 3329 2934 2906 2991 3741
192 Joseph L. Herman
Fig. 4 Autocorrelation plots for the log likelihood and alignment length, sampled
using parallel tempering
do.call(cbind,ali.length) %>%
boxplot(outline = FALSE,
names = beta.values,
xlab = expression("Inverse temperature ("*beta*")"),
ylab = "Alignment length",
cex.axis = 0.8)
layout(t(1:2))
acf(log.likelihood$`1`)
acf(ali.length$`1`)
effectiveSize(mcmc(ali.length$`1`))
## ali.length
## 970.4661
## R Lambda Theta
## 2588.848 5630.171 12228.428
3.6 Running In addition to basic checks on the trace plots and autocorrelation
with Different Random functions of individual parameters, a common approach for asses-
Seeds to Assess sing convergence is to run multiple MCMC samplers with different
Convergence random seeds or starting configurations. This can be accomplished
by rerunning StatAlign using a different value for the -seed argu-
ment, storing the output into a separate directory for each run. If
each of these samplers ends up sampling from the same posterior
distribution, it is a good indication that they have converged.
Agreement between the independent runs can be quantified more
rigorously via the Gelman-Rubin potential scale reduction factor
[51]. We will return to this when analyzing the output of
StructAlign.
3.7 Consensus Trees The .tree files generated by StatAlign contain samples from the
posterior distribution over phylogenetic trees. To summarize this
distribution, a majority consensus tree can be generated by using
StatAlign’s ConsensusTree plugin, which can be called from the
command line on the MCMC output via the following command
(see Note 5):
194 Joseph L. Herman
Fig. 5 Consensus tree for trees sampled under the sequence-only model
root(consensus,"2oif") %>%
plot(show.node.label = TRUE,
use.edge.length = TRUE,
edge.color = "grey",
cex = 1.2,
show.tip.label = TRUE,
edge.width = 4,
no.margin = TRUE)
add.scale.bar(lwd=4,lcol="grey")
3.8 Including Protein In order to better resolve the relationships between the different
Structures clades, we can utilize the StructAlign plugin for StatAlign, which
models structural divergence using a continuous-time stochastic
process on C-alpha coordinates, combined with a Markov model
of sequence evolution [41]. The rate of structural divergence along
each branch is modeled via a diffusivity parameter, σ. To account for
non-evolutionary sources of structural variability (e.g., due to con-
formational flexibility, differences in experimental conditions, or
technical noise), each residue has an intrinsic baseline variability
parameterized based on the crystallographic B-factors (see Note 6).
StructAlign can read protein structure coordinates directly
from PDB files and can be run on the example globin dataset
using the following command (see Note 7):
Table 4
Additional output files generated by StructAlign
Fig. 6 Comparison of posterior distributions for alignment length and TKF92 model parameters under the
sequence-only and sequence + structure models
effectiveSize(mcmc(tkf.struc))
## R Lambda Theta
## 1528.826 3241.599 11782.182
comparison =
rbind(
cbind(‘Alignment length‘=ali.length.seq$‘1‘$ali.length,
tkf.seq,
Model=rep("seq",nrow(tkf.seq))
),
cbind(‘Alignment length‘=ali.length.struc$‘1‘$ali.length,
tkf.struc,
Model=rep("seq+struc",nrow(tkf.struc))
)
)
df.m = melt(comparison, id.var="Model")
ggplot(df.m, aes(x=variable, y=value)) +
geom_boxplot(aes(fill=Model)) +
facet_wrap( ~ variable, scales="free")
3.9 Effect We can visualize the change in distribution over alignments in more
of Structural detail by computing a summary alignment annotated with associated
Information posterior probabilities for each column. To do so, we will utilize the
on Sequence program WeaveAlign, which is distributed along with StatAlign. Wea-
Alignments veAlign takes as input a file or files containing multiple alignments for
the same set of sequences and computes the summary alignment that
maximizes the expected accuracy under a chosen scoring scheme [7].
When running StatAlign in parallel mode, the alignment
samples generated by each chain must be combined into a single
file before running WeaveAlign. This can be achieved using the
combine_logfiles.pl script distributed with StatAlign, run as
shown below (with the environment variables set to the appropriate
values):
scripts/combine_logfiles.pl $SEQ_DIR/$FASTA.chain{0..7}.log
scripts/combine_logfiles.pl $STRUC_DIR/$PDB.chain{0..7}.log
Fig. 7 Summary alignment under sequence-only model (left) and structural model (right), annotated with
posterior probabilities for each column (blue lines)
200 Joseph L. Herman
Fig. 8 Zoomed-in view of two regions of the summary alignment corresponding to Ala128-Glu135 and Val82-
Ala87 in the sequence for 1bin (left and right panels, respectively), aligned under the sequence-only model
(left within each panel) and structural model (right within each panel)
Fig. 9 Aligned regions of helix H for leghemoglobin structure 1bin (green) and myoglobin structure 1myt (blue),
with the corresponding aligned regions shown in red using the sequence-only model (center), and structural
model (right). Figures generated using VMD [52]
Fig. 10 Maximum likelihood superposition for the ten globin structures in the
example dataset, oriented with the E-helix descending from top right to bottom
left, and the F-helix ascending vertically (left), and a view of the EF-loop section
of 2oif, including heme and cyanide ligand (right). Highlighted in red and green
on the left panel are sections of the structures 1bin and 2oif, illustrating the
large deformation that occurs in the latter at the start of the F-helix, which may
stabilize the ligand-bound conformation. Figures generated using VMD [52]
(see Note 8). As shown in Fig. 10, the structure 2oif exhibits a large
deformation at the start of the F-helix, with a number of side chains
coming into contact with residues within the E-helix, for example,
Arg94 with Glu80. As discussed by Hoy et al. [53], this deforma-
tion may help to stabilize the conformation in which the exogenous
ligand displaces His70.
3.10 Posterior As mentioned in the previous section, the large structural displace-
Distribution ment in the EF-loop region of 2oif relative to the other structures
of Structural Model causes StructAlign to treat Thr92-Thr97 as an indel. To understand
Parameters how the model makes this decision and how this affects parameter
inference, we can examine the individual parameters of the struc-
tural model in more detail.
The structural parameters can be read in from the
.struct.params output files in a manner similar to the TKF92
model parameters. Here we will illustrate reading in parameters
from four independent runs conducted with different random
number seeds, each executed in its own separate ’run_x’ subdirec-
tory, in order to assess the consistency of the parameter estimates
across runs:
202 Joseph L. Herman
core =
lapply(chains,function(x)
fread(paste0(base,".chain",x,".coreModel.params"))
) %>%
do.call(rbind,.)
struct.params =
lapply(chains,function(x)
fread(paste0(base,".chain",x,".struct.params"))
) %>%
do.call(rbind,.) %>%
filter(core$beta==1) %>%
select(c(tau,eps,s2_g,nu))
return(struct.params)
}
)
comparison =
lapply(run,
function(r) cbind(struct.list[[r]],Seed=r)
) %>%
do.call(rbind,.)
comparison$Seed = factor(comparison$Seed)
df.m = melt(comparison, id.var="Seed")
ggplot(df.m, aes(x=variable, y=value)) +
geom_boxplot(aes(fill=Seed)) +
facet_wrap( ~ variable, scales="free")
Fig. 11 Comparison of posterior distributions for structural model parameters with four different starting seeds
Table 5
Gelman-Rubin potential scale reduction factors for the structural model
parameters
gelman = gelman.diag(mcmc.list(lapply(struct.list,mcmc)))$psrf
## lower upper
## tau 14.66535 16.48737
## attr(,"Probability")
## [1] 0.95
The top end of this range includes the value of 16.4 0.2
reported by Lobanov et al. [54] for all-alpha proteins of comparable
size taken from SCOP.
The ε parameter acts as a multiplier on the baseline variance
associated with each alignment
pffiffiffiffiffi column (estimated via squared nor-
malized B-factors), with Bi 3ε yielding the expected standard devia-
tion for site i arising from non-phylogenetic sources of structural
variability (including uncertainty in the structural superposition).
We can examine the correlation between these predicted values and
the observed per-site RMSD via three of the additional files generated
via the printRmsd option to StructAlign, i.e., the .mle.fasta
alignment file and the .mle.rmsd and .mle.bfactors files that
contain the RMSD and B-factor-based predictions, respectively.
When running in parallel mode, each chain again generates its own
version of these files; we will select the chain with the highest likeli-
hood MLE (in our example case, chain 7) for further analysis. Wea-
veAlign can again be used to generate an image with these
annotations plotted above the alignment. As shown in Fig. 12, the
correlation between predicted and observed values is very high, with
higher structural variability occurring mostly in the areas of high
alignment uncertainty in the loop regions (see Note 9):
Fig. 12 Maximum likelihood alignment, annotated with predicted (green) and observed (red) per-site RMSD
Fig. 13 Consensus tree for trees sampled under the sequence+structure model, with branches scaled
according to branch length (left) and structural diffusivity (right)
consensus.seq = read.tree(paste0(SEQ_DIR,"/10_globins.fasta.ctree"))
consensus.struc = read.tree(paste0(STRUC_DIR,"/1bina.pdb.ctree"))
n = length(consensus.struc$tip.label)
# map ordering of sequences in consensus.seq
# to that of consensus.struc
map = pmatch(consensus.struc$tip.label,consensus.seq$tip.label)
grid.arrange(p1,p2,ncol=2)
Fig. 14 Left: branch lengths for branches leading to tips of the tree. Right: Pairwise tree distances
(in substitutions per site) between each pair of structures for the consensus tree computed under the
sequence-only and sequence + structure models
base = paste0(STRUC_DIR,"/run_1/",PDB)
core = lapply(chains, function(x)
fread(paste0(base,".chain",x,".coreModel.params"))
) %>% do.call(rbind,.)
rmsd =
lapply(chains,
function(x) fread(paste0(base,".chain",x,".msd"))
) %>%
do.call(rbind,.) %>%
filter(core$beta==1) %>% colMeans %>% sqrt
The distance between each leaf on the tree can then be com-
puted using the cophenetic function from ape:
d1 = cophenetic(consensus)
We can now plot the two tree distances versus RMSD for each
pair of proteins:
p1 = ggplot(df,aes(x=dist,y=rmsd)) +
geom_point() +
xlab("Substitutions per site") +
210 Joseph L. Herman
p2 = ggplot(df,aes(x=diffusion,y=rmsd)) +
geom_point() +
xlab("Weighted tree distance") +
ylab(expression(paste("Pairwise RMSD / ",ring(A)))) +
# relationship implied by StructAlign model
geom_abline(intercept=baseline.sd,slope=1)
grid.arrange(p1,p2,ncol=2)
3.15 Distinguishing In the case of 2oif, much of the local structural deviation observed
Structural Drift from in the EF-loop region can be attributed to the effect of binding the
Conformational exogenous cyanide ligand [53]. In contrast, a rice globin with very
Change similar sequence was crystallized in the hexacoordinate form [57]
and does not display this large deviation at the start of the F-helix.
Fig. 15 Structural deviation as a function of evolutionary distance computed using the tree distance, before
and after weighting by the branch-specific structural diffusivity parameters (left and right, respectively). The
line on the left plot shows the linear relationship inferred by Illergård et al. [29] for the globins; the line on the
right shows y ¼ x + 0.25
Enhancing Alignment and Tree Inference with Protein Structures 211
4 Notes
References
1. Godzik A (1996) The structural alignment alignments using directed acyclic graphs.
between two proteins: is there a unique BMC Bioinformatics 16:108
answer? Protein Sci 5:1325–1338 8. Nelesen S, Liu K, Zhao D, Linder CR, Warnow
2. Sela I, Ashkenazy H, Katoh K, Pupko T (2015) T (2008) The effect of the guide tree on multi-
GUIDANCE2: accurate detection of unreli- ple sequence alignments and subsequent phy-
able alignment regions accounting for the logenetic analyses. In: Proceedings of the 2008
uncertainty of multiple parameters. Nucleic Pacific Symposium on Biocomputing. World
Acids Res 43:W7–W14 Scientific. p 25–36
3. Morrison DA, Ellis JT (1997) Effects of nucle- 9. Lunter G, Drummond AJ, Miklós I, Hein J
otide sequence alignment on phylogeny esti- (2005) Statistical alignment: recent progress,
mation: a case study of 18S rDNAs of new applications, and challenges. In: Statistical
apicomplexa. Mol Biol Evol 14:428–441 Methods in Molecular Evolution. Statistics for
4. Ogden TH, Rosenberg MS (2006) Multiple Biology and Health. Springer, New York, NY
sequence alignment accuracy and phylogenetic 10. Redelings BD, Suchard MA (2005) Joint
inference. Syst Biol 55:314–328 Bayesian estimation of alignment and phylog-
5. Wong KM, Suchard MA, Huelsenbeck JP eny. Syst Biol 54:401–418
(2008) Alignment uncertainty and genomic 11. Westesson O, Lunter G, Paten B, Holmes I
analysis. Science 319:473–476 (2012) Accurate reconstruction of insertion-
6. Lunter G, Rocco A, Mimouni N, Heger A, deletion histories by statistical phylogenetics.
Caldeira A, Hein J (2008) Uncertainty in PLoS One 7:e34572
homology inferences: assessing and improving 12. Holmes IH (2017) Historian: accurate recon-
genomic sequence alignment. Genome Res struction of ancestral sequences and evolution-
18:298–309 ary rates. Bioinformatics 33:1227–1229
7. Herman JL, Novák Á, Lyngsø R, Szabó A, 13. Redelings BD (2014) Erasing errors due to
Miklós I, Hein J (2015) Efficient representa- alignment ambiguity when estimating positive
tion of uncertainty in multiple sequence selection. Mol Biol Evol 31:1979–1993
Enhancing Alignment and Tree Inference with Protein Structures 213
14. Satija R, Pachter L, Hein J (2008) Combining 29. Illergård K, Ardell DH, Elofsson A (2009)
statistical alignment and phylogenetic foot- Structure is three to ten times more conserved
printing to detect regulatory elements. Bioin- than sequence: a study of structural response in
formatics 24:1236–1242 protein cores. Proteins 77:499–508
15. Satija R, Novák Á, Miklós I, Lyngsø R, Hein J 30. Echave J, Spielman SJ, Wilke CO (2016)
(2009) BigFoot: Bayesian alignment and phy- Causes of evolutionary rate variation among
logenetic footprinting with MCMC. BMC protein sites. Nat Rev Genet 17:109–121
Evol Biol 9:217 31. Worth CL, Gong S, Blundell TL (2009) Struc-
16. Philippe H, Brinkmann H, Lavrov DV, Little- tural and functional constraints in the evolu-
wood DTJ, Manuel M, Wörheide G, Baurain D tion of protein families. Nat Rev Mol Cell Biol
(2011) Resolving difficult phylogenetic ques- 10:709–720
tions: why more sequences are not enough. 32. Gilson AI, Marshall-Christensen A, Choi J-M,
PLoS Biol 9:e1000602 Shakhnovich EI (2017) The role of evolution-
17. Kumar S, Filipski AJ, Battistuzzi FU, Kosa- ary selection in the dynamics of protein struc-
kovsky Pond SL, Tamura K (2012) Statistics ture evolution. Biophys J 112:1350–1365
and truth in phylogenomics. Mol Biol Evol 33. Choi SC, Hobolth A, Robinson DM,
29:457–472 Kishino H, Thorne JL (2007) Quantifying
18. Talavera G, Castresana J (2007) Improvement the impact of protein tertiary structure on
of phylogenies after removing divergent and molecular evolution. Mol Biol Evol
ambiguously aligned blocks from protein 24:1769–1782
sequence alignments. Syst Biol 56:564–577 34. Kleinman CL, Rodrigue N, Lartillot N, Phi-
19. Wu M, Chatterji S, Eisen JA (2012) Account- lippe H (2010) Statistical potentials for
ing for alignment uncertainty in phyloge- improved structurally constrained evolutionary
nomics. PLoS One 7:e30288 models. Mol Biol Evol 27:1546–1560
20. Gatesy J, DeSalle R, Wheeler W (1993) 35. Rodrigue N, Philippe H, Lartillot N (2006)
Alignment-ambiguous nucleotide sites and Assessing site-interdependent phylogenetic
the exclusion of systematic data. Mol Phylo- models of sequence evolution. Mol Biol Evol
genet Evol 2:152–157 23:1762–1775
21. Lee MS (2001) Unalignable sequences and 36. Sadowski M, Taylor W (2010) On the evolu-
molecular evolution. Trends Ecol Evol tionary origins of “fold space continuity”: a
16:681–685 study of topological convergence and diver-
22. Löytynoja A, Goldman N (2008) Phylogeny- gence in mixed alpha-beta domains. J Struct
aware gap placement prevents errors in Biol 172:244–252
sequence alignment and evolutionary analysis. 37. Rackovsky S (2015) Nonlinearities in protein
Science 320:1632–1635 space limit the utility of informatics in protein
23. Hasegawa H, Holm L (2009) Advances and biophysics. Proteins 83:1923–1928
pitfalls of protein structural alignment. Curr 38. Sadreyev RI, Kim B-H, Grishin NV (2009)
Opin Struct Biol 19:341–348 Discrete–continuous duality of protein struc-
24. Johnson MS, Šali A, Blundell TL (1990) Phy- ture space. Curr Opin Struct Biol 19:321–328
logenetic relationships from three-dimensional 39. Holzgr€a fe C, Wallin S (2014) Smooth func-
protein structures. Methods Enzymol tional transition along a mutational pathway
183:670–690 with an abrupt protein fold switch. Biophys J
25. Bujnicki JM (2000) Phylogeny of the restric- 107:1217–1225
tion endonuclease-like superfamily inferred 40. Challis CJ, Schmidler SC (2012) A stochastic
from comparison of protein structures. J Mol evolutionary model for protein structure align-
Evol 50:39–44 ment and phylogeny. Mol Biol Evol
26. Lundin D, Poole AM, Sjöberg B-M, Högbom 29:3575–3587
M (2012) Use of structural phylogenetic net- 41. Herman JL, Challis CJ, Novák Á, Hein J,
works for classification of the ferritin-like Schmidler SC (2014) Simultaneous Bayesian
superfamily. J Biol Chem 287:20565–20575 estimation of alignment and phylogeny under
27. Chothia C, Lesk AM (1986) The relation a joint model of protein sequence and struc-
between the divergence of sequence and struc- ture. Mol Biol Evol 31:2251–2266
ture in proteins. EMBO J 5:823 42. Novák Á, Miklós I, Lyngsø R, Hein J (2008)
28. Panchenko AR, Wolf YI, Panchenko LA, Madej StatAlign: an extendable software package for
T (2005) Evolutionary plasticity of protein joint Bayesian estimation of alignments and
families: coupling between sequence and struc- evolutionary trees. Bioinformatics
ture variation. Proteins 61:535–544 24:2403–2404
214 Joseph L. Herman
43. Burmester T, Ebner B, Weich B, Hankeln T 52. Humphrey W, Dalke A, Schulten K (1996)
(2002) Cytoglobin: a novel globin type ubiq- VMD: visual molecular dynamics. J Mol
uitously expressed invertebrate tissues. Mol Graph 14:33–38
Biol Evol 19:416–421 53. Hoy JA, Robinson H, Trent JT, Kakar S,
44. de Sanctis D, Dewilde S, Pesce A, Moens L, Smagghe BJ, Hargrove MS (2007) Plant
Ascenzi P, Hankeln T, Burmester T, Bolognesi hemoglobins: a molecular fossil record for the
M (2004) Crystal structure of cytoglobin: the evolution of oxygen transport. J Mol Biol
fourth globin type discovered in man displays 371:168–179
heme hexa-coordination. J Mol Biol 54. Lobanov M, Bogatyreva N, Galzitskaia O
336:917–927 (2008) Radius of gyration is indicator of com-
45. Hoffmann FG, Opazo JC, Storz JF (2010) pactness of protein structure. Mol Biol
Gene cooption and convergent evolution of 42:701–706
oxygen transport hemoglobins in jawed and 55. Christensen AB, Herman JL, Elphick MR,
jawless vertebrates. Proc Natl Acad Sci U S A Kober KM, Janies D, Linchangco G, Semmens
107:14274–14279 DC, Bailly X, Vinogradov SN, Hoogewijs D
46. Hoffmann FG, Opazo JC, Storz JF (2011) (2015) Phylogeny of echinoderm hemoglo-
Differential loss and retention of cytoglobin, bins. PLoS One 10:e0129668
myoglobin, and globin-e during the radiation 56. Gupta KJ, Hebelstrup KH, Mur LA, Igamber-
of vertebrates. Genome Biol Evol 3:588–600 diev AU (2011) Plant hemoglobins: important
47. Hoffmann FG, Opazo JC, Hoogewijs D, players at the crossroads between oxygen and
Hankeln T, Ebner B, Vinogradov SN, nitric oxide. FEBS Lett 585:3843–3849
Bailly X, Storz JF (2012) Evolution of the glo- 57. Hargrove MS, Brucker EA, Stec B, Sarath G,
bin gene family in deuterostomes: lineage- Arredondo-Peter R, Klucas RV, Olson JS, Phil-
specific patterns of diversification and attrition. lips GN (2000) Crystal structure of a nonsym-
Mol Biol Evol 29:1735–1745 biotic plant hemoglobin. Structure
48. Geyer C (2011) Importance sampling, 8:1005–1014
simulated tempering, and umbrella sampling. 58. Sharir-Ivry A, Xia Y (2017) The impact of
In: Brooks S, Gelman A, Jones G, Meng X native state switching on protein sequence evo-
(eds) Handbook of Markov Chain Monte lution. Mol Biol Evol 34:1378–1390
Carlo. Chapman & Hall/CRC, Boca Raton, 59. Maadooliat M, Zhou L, Najibi SM, Gao X,
pp 295–311 Huang JZ (2016) Collective estimation of
49. Altekar G, Dwarkadas S, Huelsenbeck JP, Ron- multiple bivariate density functions with appli-
quist F (2004) Parallel Metropolis coupled cation to angular-sampling-based protein loop
Markov chain Monte Carlo for Bayesian phylo- modeling. J Am Stat Assoc 111:43–56
genetic inference. Bioinformatics 20:407–415 60. Golden M, Garcı́a-Portugués E, Sørensen M,
50. Thorne JL, Kishino H, Felsenstein J (1992) Mardia KV, Hamelryck T, Hein J (2017) A
Inching toward reality: an improved likelihood generative angular model of protein structure
model of sequence evolution. J Mol Evol evolution. Mol Biol Evol 34:2085–2100
34:3–16
51. Gelman A, Rubin DB (1992) Inference from
iterative simulation using multiple sequences.
Stat Sci 7:457–472
Chapter 11
Abstract
Phylogenetic inference from protein data is traditionally based on empirical substitution models of evolu-
tion that assume that protein sites evolve independently of each other and under the same substitution
process. However, it is well known that the structural properties of a protein site in the native state affect its
evolution, in particular the sequence entropy and the substitution rate. Starting from the seminal proposal
by Halpern and Bruno, where structural properties are incorporated in the evolutionary model through
site-specific amino acid frequencies, several models have been developed to tackle the influence of protein
structure on sequence evolution. Here we describe stability-constrained substitution (SCS) models that
explicitly consider the stability of the native state against both unfolded and misfolded states. One of them,
the mean-field model, provides an independent sites approximation that can be readily incorporated in
maximum likelihood methods of phylogenetic inference, including ancestral sequence reconstruction.
Next, we describe its validation with simulated and real proteins and its limitations and advantages with
respect to empirical models that lack site specificity. We finally provide guidelines and recommendations to
analyze protein data accounting for stability constraints, including computer simulations and inferences of
protein evolution based on maximum likelihood. Some practical examples are included to illustrate these
procedures.
Key words Stability-constrained substitution models, Mean-field substitution model, Protein folding
stability, Protein evolution, Ancestral protein reconstruction
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_11, © Springer Science+Business Media, LLC, part of Springer Nature 2019
215
216 Ugo Bastolla and Miguel Arenas
methods such as Monte Carlo [23, 24] that have inherent limita-
tions in computer efficiency and may get trapped in local maxima.
An alternative is to derive a model with independent sites that
effectively enforces stability constraints, in the spirit of mean-field
models from physics. Of course, the resulting model will be less
realistic than a model with dependencies between sites but still may
represent real data better than empirical substitution models that
neglect stability constraints. One of these models is the mean-field
model (MF) [25, 26].
In the present chapter we describe models of protein evolution
with explicit stability constraints and models that effectively incor-
porate these constraints into site-independent substitution matri-
ces, highlighting their implementation in phylogenetic frameworks.
We also present some applications of these frameworks in the
simulation and evolutionary analysis of diverse protein data. We
finally provide guidelines and recommendations to use the pre-
sented frameworks.
2.1 Modeling Protein The thermodynamic model adopted in the simulator of protein
Evolution with Stability sequence evolution ProteinEvolver [29] estimates the stability of
Constraints the native state not only against the unfolded state but also against
compact, wrongly folded conformations (misfolded states) that are
usually neglected in other models of protein stability. The charac-
teristics of protein sequences that weaken the stability of frequently
formed misfolded conformations are referred to as negative design,
and its evolutionary importance is recognized through statistical
analysis of protein sequences [29], and it was proposed to have
important evolutionary consequences (for a review see [15, 30]).
The stability of the native state is estimated from the contact
matrix representation of one native structure in the Protein Data
Bank (PDB), Cij ¼ 1 if any two atoms in residues i and j are closer
than 4.5Å and 0 otherwise:
X
G nat ðC nat ; A Þ ¼ C ijnat U A i ; A j ð1Þ
ij
where Ai is the amino acid at site i (for instance leucine), Cnat is the
native contact matrix, and U(a, b) are the 210 contact interaction
parameters derived in [31]. Contacts with |i j| < 4 are not
218 Ugo Bastolla and Miguel Arenas
where Uij ¼ U(Ai, Aj) and hCiji represents the frequency of con-
tacts between residues at sequence distance |i j| in compact
structures of L residues and hCijCkli represents contact correlations,
which are precomputed from a representative subset of the PDB.
The program DeltaGREM computes stabilities ΔG for sequence-
structure pairs in the PDB and a list of user-supplied mutations or
for multiple sequence alignments that include the PDB sequence. It
is freely available from https://ub.cbm.uam.es/index.php.
Given the estimate of ΔG, two alternative models are used to
compute the acceptance probability of a mutation.
1. In the neutral model, all sequences with ΔG < ΔGthr are con-
sidered viable and equally fit, and all other sequences are elimi-
nated by negative selection. The threshold is chosen as 98% of
the ΔG of the sequence in the PDB, so that this sequence would
be selected and less stable sequences would be discarded.
2. In the fitness model, the fitness of the protein sequence is com-
puted as the fraction of the folded protein:
1
f ¼ ð4Þ
1 þ e ΔG=kT
that, for low temperature and large protein sequences, f tends to be
a sigmoidal function, f ¼ 1 if ΔG < 0 and f ¼ 0 otherwise. This
binary fitness function enforces neutral evolution that is unable to
Phylogenetic Inference with Stability Constraints 219
Fig. 1 Illustrative example of the recursive algorithm to simulate protein evolution along an ancestral
recombination graph based on two recombination events. White and gray circles correspond to coalescence
and recombination nodes, respectively. (1) The evolution starts from the GMRCA node; the protein is evolved
along branches according to the SCS substitution model and the branch lengths. (3) The simulation reaches a
recombinant node and because its parental recombinant node has not been assigned to a protein yet, the
evolutionary process continues toward other direction (4). (6) The simulation reaches a parental recombinant
node, and because its parental has already been assigned to a protein, (7) the simulation combines the two
proteins according to the recombination breakpoint at position 3. (9) Another recombinant node is reached,
and because its parental node has not been reached yet, a protein is assigned to this node and the simulation
continues in the other direction (10). (11) The parental node is reached, and (12) the recombinant fragments
are combined according to the recombination breakpoint at position 4. At the end of the process, a sequence
was simulated for every internal and tip node
2.3 SCS Models We tested whether the SCS models improve results obtained with
Outperform Empirical traditional empirical substitution models analyzing ten protein
Substitution Models families (phototactive yellow proteins, triosephosphate isomerases,
in Terms rubredoxins, kinesins, phage lysozymes, ferredoxins, DNA ligases,
of Distribution heat shock proteins, oxysterol-binding proteins, and retroviral
of Frequencies Among aspartyl proteases) [29]. For each protein family, we downloaded
Sites and Maximum from the Pfam database a multiple sequence alignment (MSA),
Likelihood together with its associated phylogenetic tree and a representative
protein structure deposited in the PDB. We also selected the best-
fitting empirical amino acid substitution model with ProtTest [44].
Phylogenetic Inference with Stability Constraints 221
model the native model. This model is inferior to the full model
under several aspects: (i) if misfolding is not considered, the result-
ing sequences are on the average more hydrophobic than sequences
in the PDB; (ii) in particular, exposed sites with few contacts are
more hydrophobic than it is observed, indicating that it is negative
design against misfolding that acts to limit the hydrophobicity of
exposed sites; (iii) the likelihood of observed sequences is much
higher with the full model than with the native model; (iv) the
average folding free energy (taking into account both unfolded and
misfolded states) is negative with the full model but positive with
the reduced model, i.e., the sequences produced with the reduced
model are not stable. These results confirm that it is important to
impose stability against misfolding in SCS models.
3.2 The Wild-Type Another possibility to develop a model with independent sites that
(WT) Model implements stability constraints consists in computing the effect on
stability and fitness of any possible mutation at site i starting from
the wild-type sequence. We thus computesite-specific amino acid
frequencies from Eq. 6 with ϕi ða Þ ¼ logf A 1WT . . . ALWT ; A i ¼ a ,
i.e., the wild-type sequence with the mutation A iWT ! a. that the
WT evolutionary model is only valid one mutation away from the
sequence in the PDB, while the MF model is designed to remain
valid after a long evolutionary divergence. The parameters Pmut(a)
and Λ are determined as in the MF model.
3.3 The Substitution To fully specify the substitution process, the site-specific amino acid
Process frequencies P ai ¼ P i ða Þ modeled with Eq. 6 must be complemented
with site-specific exchangeability matrices Eiab, and the site-specific
substitution rates that define the substitution process used to com-
pute the likelihood function are computed as Q ab i
¼ E abi
P bi . The
exchangeability matrices that characterize the dynamics of the sub-
stitution process are assumed to be symmetric; thus, the detailed
balance is satisfied, and Pia are the stationary distributions. The
i
matrices E ab are computed with the method of Halpern and Bruno
mut
as the product between a global exchangeability matrix E ab that
represents the mutation process and a fixation probability analo-
gous to Eq. 5, such that the site-specific frequency of amino acid
a is the power of its site-specific fitness [47]. Specifically, if we write
P ai ¼ P amut F ai , where F ai are site-specific selective factors, the
exchangeability matrices are given by:
logF ai logF bi
i
E ab ¼ E ab
mut
ð7Þ
F ai F bi
which is also a symmetric matrix that fulfills detailed balance. The
substitution rates are maximal if the two amino acids have the same
selective factors, in which case the fixation probability tends to
224 Ugo Bastolla and Miguel Arenas
3.4 Implementation The MF model was implemented in the computer simulator Pro-
in the Ancestral tEvol and in the ancestral sequence reconstruction (ASR) frame-
Sequence work ProtASR [50].
Reconstruction ProtEvol computes global (whole protein) and local (site-
Framework ProtASR specific) amino acid frequencies and exchangeability matrices that
satisfy stability of the native state against both unfolding and mis-
folding. The program is freely available from https://ub.cbm.uam.
es/index.php.
ProtASR is an evolutionary framework to infer ancestral pro-
tein sequences from a multiple sequence alignment (MSA) of pro-
teins, a rooted phylogenetic tree, a protein structure representative
of the proteins of the MSA, and a set of thermodynamic parameters.
Internally, ProtASR runs ProtEvol to generate global and local
amino acid frequencies and exchangeability matrices. Next, these
frequencies and matrices are transferred to the well-established
program PAML [51] where the ASR is performed under joint or
marginal maximum likelihood (ML) approaches [52]. ProtASR is
freely available from https://github.com/MiguelArenas/protasr.
4.1 Guidelines and As for any computer simulator, the first step is to design the
Recommendations simulation study including the choice of the parameters to mimic
for Simulating Protein the desired evolutionary scenario, the required number of simula-
Evolution with tions, and the output format. Second, ProteinEvolver includes
ProteinEvolver detailed documentation and several examples, which we recom-
mend to read in detail. Next, we describe the input and output
information of this framework.
Since the simulation of molecular evolution is a stochastic
process [43], the user has to indicate the number of computer
simulations to be performed. The simulation of protein evolution
is performed upon a phylogeny. This phylogeny can be user-
specified or can be simulated with ProteinEvolver under the coales-
cent with recombination, demographics, longitudinal sampling,
population structure, and migration (see Subheading 2.2). For the
latter, the user has to specify the sample size (number of protein
sequences of the simulated MSA), population size, and, optionally,
other population genetics parameters (i.e., recombination rate,
distributions for recombination hotspots, population growth rate,
demographic periods, number of populations and migration rate,
among others). Next, the user has to specify a substitution model of
protein evolution, which could be empirical or stability-
constrained. Concerning SCS models, the user has to indicate a
protein structure, a representative set of alternative contact matrices
(already included in the package), and some thermodynamic para-
meters (see Subheading 2.2). Proportion of invariable sites and
additional rate heterogeneity among sites can be optionally speci-
fied. Finally, a sequence for the MRCA node can also be user-
specified or, alternatively, internally computed by sampling from
the amino acid frequencies.
Concerning the outputs, the program generates a MSA of
proteins of the sample (and, optionally, of proteins of ancestral
nodes) that can be written in formats such fasta, phylip, or nexus.
Optionally, the program also outputs the simulated recombination
breakpoints and folding energies of the simulated proteins.
Next, we describe a practical example to simulate data with
ProteinEvolver under a site-dependent SCS model. We apply the
second example (simulation of protein sequences under the neutral
site-dependent SCS model) included in the program package.
1. Setting up the input files. First, we can explore the file para-
meters, which is the main input file. In this file, that text in
brackets is ignored by the program. The specifications by default
in this example indicate the simulation of two replicates. Since
the setting input tree/s file is empty, the program will perform a
coalescent simulation. The coalescent simulation considers a
sample of 8 individuals (proteins) with length 255 amino acids.
Effective population size is 1000 individuals, and its variation
over time is considered with the specification of a population
Phylogenetic Inference with Stability Constraints 227
4.2 Guidelines and ProtASR is a computer program written in C and Perl that runs on
Recommendations the command line. The program includes detailed documentation
for Inferring Ancestral and several examples, which we also recommend to read in detail.
Protein Sequences Its input is very simple with just a main input file that calls second-
with ProtASR ary input files. The input files are a MSA of protein sequences, a
rooted phylogenetic tree for the MSA, a PDB protein that should
be representative of the MSA, and a series of parameters to specify
the desired substitution model. For beginners we recommend
applying the parameters provided by default in the examples
included in the package since those parameters have provided a
good fitting with diverse real data [25, 29, 50].
228 Ugo Bastolla and Miguel Arenas
5 Concluding Remarks
Acknowledgments
References
1. Schmitt AO, Schuchhardt J, Ludwig A, Brock- 9. Liberles DA, Teichmann SA, Bahar I,
mann GA (2007) Protein evolution within and Bastolla U, Bloom J, Bornberg-Bauer E, Col-
between species. J Theor Biol 249 well LJ, de Koning AP, Dokholyan NV,
(2):376–383. https://doi.org/10.1016/j.jtbi. Echave J, Elofsson A, Gerloff DL, Goldstein
2007.08.001 RA, Grahnen JA, Holder MT, Lakner C,
2. Gao F, Bhattacharya T, Gaschen B, Taylor J, Lartillot N, Lovell SC, Naylor G, Perica T,
Moore JP, Novitsky V, Yusim K, Lang D, Pollock DD, Pupko T, Regan L, Roger A,
Foley B, Beddows S, Alam M, Haynes B, Rubinstein N, Shakhnovich E, Sjolander K,
Hahn BH, Korber B (2003) Consensus and Sunyaev S, Teufel AI, Thorne JL, Thornton
ancestral state HIV vaccines. Science 299 JW, Weinreich DM, Whelan S (2012) The
(5612):1515–1518 interface of protein structure, protein biophys-
3. Arenas M, Posada D (2010) Computational ics, and molecular evolution. Protein Sci 21
design of centralized HIV-1 genes. Curr HIV (6):769–785
Res 8(8):613–621 10. Bastolla U (2014) Detecting selection on pro-
4. Wilson C, Agafonov RV, Hoemberger M, tein stability through statistical mechanical
Kutter S, Zorba A, Halpin J, Buosi V, models of folding and evolution. Biomol Ther
Otten R, Waterman D, Theobald DL, Kern D 4:291–314
(2015) Kinase dynamics. Using ancient protein 11. Wilke CO (2012) Bringing molecules back
kinases to unravel a modern cancer drug’s into molecular evolution. PLoS Comput Biol
mechanism. Science 347(6224):882–886. 8(6):e1002572
https://doi.org/10.1126/science.aaa1823 12. Sikosek T, Chan HS (2014) Biophysics of pro-
5. Perez-Jimenez R, Ingles-Prieto A, Zhao ZM, tein evolution and evolutionary protein bio-
Sanchez-Romero I, Alegre-Cebollada J, physics. J R Soc Interface 11(100):20140419.
Kosuri P, Garcia-Manyes S, Kappock TJ, https://doi.org/10.1098/rsif.2014.0419
Tanokura M, Holmgren A, Sanchez-Ruiz JM, 13. Goldstein RA (2011) The evolution and evo-
Gaucher EA, Fernandez JM (2011) Single- lutionary consequences of marginal thermosta-
molecule paleoenzymology probes the chemis- bility in proteins. Proteins 79(5):1396–1407
try of resurrected enzymes. Nat Struct Mol 14. Serohijos AW, Shakhnovich EI (2014) Merg-
Biol 18(5):592–596 ing molecular mechanism and evolution: the-
6. Wijma HJ, Floor RJ, Janssen DB (2013) Struc- ory and computation at the interface of
ture- and sequence-analysis inspired engineer- biophysics and evolutionary population genet-
ing of proteins for enhanced thermostability. ics. Curr Opin Struct Biol 26:84–91. https://
Curr Opin Struct Biol 23(4):588–594. doi.org/10.1016/j.sbi.2014.05.005
https://doi.org/10.1016/j.sbi.2013.04.008 15. Bastolla U, Dehouck Y, Echave J (2017) What
7. Cole MF, Gaucher EA (2011) Utilizing natural evolution tells us about protein physics, and
diversity to evolve protein function: applica- protein physics tells us about evolution. Curr
tions towards thermostability. Curr Opin Opin Struct Biol 42:59–66. https://doi.org/
Chem Biol 15(3):399–406. https://doi.org/ 10.1016/j.sbi.2016.10.020
10.1016/j.cbpa.2011.03.005 16. Echave J (2008) Evolutionary divergence of
8. Arenas M (2015) Trends in substitution mod- protein structure: the linearly forced elastic net-
els of molecular evolution. Front Genet 6:319. work model. Chem Phys Lett 457
https://doi.org/10.3389/fgene.2015.00319
230 Ugo Bastolla and Miguel Arenas
44. Abascal F, Zardoya R, Posada D (2005) Prot- Weaver EA, Gao F, Haynes BF, Shaw GM,
Test: selection of best-fit models of protein Korber BT, Hahn BH (2006) Ancestral and
evolution. Bioinformatics 21(9):2104–2105 consensus envelope immunogens for HIV-1
45. Kullback S, Leibler RA (1951) On information subtype C. Virology 352(2):438–449
and sufficiency. Ann Math Stat 22(1):79–86 56. Gaucher EA, Govindarajan S, Ganesh OK
46. Marti-Renom MA, Stuart AC, Fiser A, (2008) Palaeotemperature trend for Precam-
Sanchez R, Melo F, Sali A (2000) Comparative brian life inferred from resurrected proteins.
protein structure modeling of genes and gen- Nature 451(7179):704–707
omes. Annu Rev Biophys Biomol Struct 57. Hobbs JK, Shepherd C, Saul DJ, Demetras NJ,
29:291–325 Haaning S, Monk CR, Daniel RM, Arcus VL
47. Halpern AL, Bruno WJ (1998) Evolutionary (2012) On the origin and evolution of thermo-
distances for protein-coding sequences: model- phily: reconstruction of functional precam-
ing site-specific residue frequencies. Mol Biol brian enzymes from ancestors of Bacillus. Mol
Evol 15(7):910–917 Biol Evol 29(2):825–835. https://doi.org/10.
48. Whelan S, Goldman N (2001) A general 1093/molbev/msr253
empirical model of protein evolution derived 58. Bastolla U, Moya A, Viguera E, van Ham RC
from multiple protein families using a (2004) Genomic determinants of protein fold-
maximum-likelihood approach. Mol Biol Evol ing thermodynamics in prokaryotic organisms.
18(5):691–699 J Mol Biol 343(5):1451–1466
49. Jones DT, Taylor WR, Thornton JM (1992) 59. Williams PD, Pollock DD, Blackburne BP,
The rapid generation of mutation data matrices Goldstein RA (2006) Assessing the accuracy
from protein sequences. Comput Appl Biosci 8 of ancestral protein reconstruction methods.
(3):275–282 PLoS Comput Biol 2(6):e69
50. Arenas M, Weber CC, Liberles DA, Bastolla U 60. Lartillot N, Lepage T, Blanquart S (2009) Phy-
(2017) ProtASR: an evolutionary framework loBayes 3: a Bayesian software package for phy-
for ancestral protein reconstruction with selec- logenetic reconstruction and molecular dating.
tion on folding stability. Syst Biol Bioinformatics 25(17):2286–2288. https://
66:1054–1064. https://doi.org/10.1093/sys doi.org/10.1093/bioinformatics/btp368
bio/syw121 61. Lartillot N, Philippe H (2004) A Bayesian mix-
51. Yang Z (2007) PAML 4: phylogenetic analysis ture model for across-site heterogeneities in the
by maximum likelihood. Mol Biol Evol 24 amino-acid replacement process. Mol Biol Evol
(8):1586–1591 21(6):1095–1109
52. Yang Z (1997) PAML: a program package for 62. Mustonen V, Lassig M (2009) From fitness
phylogenetic analysis by maximum likelihood. landscapes to seascapes: non-equilibrium
Comput Appl Biosci 13(5):555–556 dynamics of selection and adaptation. Trends
53. Merkl R, Sterner R (2016) Ancestral protein Genet 25(3):111–119. https://doi.org/10.
reconstruction: techniques and applications. 1016/j.tig.2009.01.002
Biol Chem 397(1):1–21. https://doi.org/10. 63. Arenas M, Patricio M, Posada D, Valiente G
1515/hsz-2015-0158 (2010) Characterization of phylogenetic net-
54. Liberles DA (2007) Ancestral sequence recon- works with NetTest. BMC Bioinformatics 11
struction. Oxford University Press, Oxford (1):268
55. Kothe DL, Li Y, Decker JM, Bibollet-Ruche F,
Zammit KP, Salazar MG, Chen Y, Weng Z,
Chapter 12
Abstract
Present-day protein space is the result of 3.7 billion years of evolution, constrained by the underlying
physicochemical qualities of the proteins. It is difficult to differentiate between evolutionary traces and
effects of physicochemical constraints. Nonetheless, as a rule of thumb, instances of structural reuse, or
focusing on structural similarity, are likely attributable to physicochemical constraints, whereas sequence
reuse, or focusing on sequence similarity, may be more indicative of evolutionary relationships. Both types
of relationships have been studied and can provide meaningful insights to protein biophysics and evolution,
which in turn can lead to better algorithms for protein search, annotation, and maybe even design.
In broad strokes, studies of protein space vary in the entities they represent, the similarity measure
comparing these entities, and the representation used. The entities can be, for example, protein chains,
domains, supra-domains, or smaller protein sub-parts denoted themes. The measures of similarity
between the entities can be based on sequence, structure, function, or any combination of these. The
representation can be global, encompassing the whole space, or local, focusing on a particular region
surrounding protein(s) of interest. Global representations include lists of grouped proteins, protein
networks, and maps. Networks are the abstraction that is derived most directly from the similarity
data: each node is the protein entity (e.g., a domain), and edges connect similar domains. Selecting
the entities, the similarity measure, and the abstraction are three intertwined decisions: the similarity
measures allow us to identify the entities, and the selection of entities influences what is a meaningful
similarity measure. Similarly, we seek entities that are related to each other in a way, for which a simple
representation describes their relationships succinctly and accurately. This chapter will cover studies that
rely on different entities, similarity measures, and a range of representations to better understand protein
structure space. Scholars may use publicly available navigators offering a global representation, and in
particular the hierarchical classifications SCOP, CATH, and ECOD, or a local representation, which
encompass structural alignment algorithms. Alternatively, scholars can configure their own navigator
using existing tools. To demonstrate this DIY (do it yourself) approach for navigating in protein space,
we investigate substrate-binding proteins. By presenting sequence similarities among this large and
diverse protein family as a network, we can infer that one member (pdb ID 4ntl; of yet unknown
function) may bind methionine and suggest a putative binding mechanism.
Key words Protein space navigation, Structure space, Evolutionary relationships in protein space
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_12, © Springer Science+Business Media, LLC, part of Springer Nature 2019
233
234 Aya Narunsky et al.
1 Introduction
1.1 Protein Structure Protein structure space is an abstract model which we use when we
Space study large, representative, sets of protein structures and their
interrelationships. Inspecting these large datasets allows us to bet-
ter understand protein evolution and biophysics. While protein
space is not real, the entities that populate it are: for example,
these can be protein chains or domains; furthermore, their compar-
isons are meaningful. Thus, the first and essential step when study-
ing protein structure space is to decide on the set of entities and the
measure of similarity among them (coupled with a method to
compute it). We can then calculate all-against-all comparisons of
these entities to construct the initial dataset. Because the abstract
model is derived from these comparisons, it is essential that this
initial set is as accurate and comprehensive as possible. Navigating
in protein structure space is in many ways navigating within this
initial dataset, and we can do this either locally or globally.
1.3 The Potential Studying protein structure space can help us better understand
of Studying Protein protein evolution and biophysics. It may also have a practical
Structure Space value: insights could be used in protein structure prediction, pro-
tein function prediction, and protein design. By way of motivation,
we list a few examples; there are many more (e.g., those listed in
[1, 2].) Evolution scholars have navigated protein space looking for
clues in the remnants of evolutionary processes [3, 4]. For example,
Choi et al. [5] derive the “multiple birth model” for proteins from
maps, Dokholyan et al. [6] offered support for all proteins evolving
from a few precursors, Alva et al. [7] studied the relationship
between convergent and divergent evolution, Farias-Rico et al.
traced the evolutionary relationships between ancient superfolds
Navigating Protein Space 235
2.1 The Entities The entities are derived from the proteins of known structure in the
Protein Data Bank (PDB) [15] and can be parts of proteins of
different scales, depending on the question at hand. With minimal
processing, these can be protein complexes or protein chains. One
could also consider protein domains [16, 17] (or even supra-
domains [18]), or meaningful sub-domain entities: protein frag-
ments (e.g., [19, 20]), protein themes [9], protein interfaces [21],
protein-peptide complexes [22], repetitive secondary structure ele-
ments (e.g., Smotifs [23]), or tertiary structural motifs (TERMS)
[12]. Alternatively, the structures could possibly be predictions
[24], or homology models [25]. Typically, one would use datasets
that were curated by others (e.g., the domain sets in SCOP [26],
CATH [27], or ECOD [28]). It is important to consider if the
entities are mutually exclusive, or not. For example, domains are
mutually exclusive because when partitioning chains to domains,
each residue is associated with only a single domain; in contrast,
themes cover multiple (nested) segments in a protein chain.
2.3 Addressing The PDB is redundant, and some proteins are far more abundant
Redundancy than others (e.g., due to research interests of the scholars studying
these proteins) [41]. This suggests that when seeking a global
1
Notice that the terms used here characterize the similarity measure, not the style of navigation in protein space,
to use the same terms as in the Needleman–Wunsch and Smith–Waterman sequence alignment algorithms.
Navigating Protein Space 237
2.4 Data Structures For a global perspective, one must derive a data structure, or an
for Global abstract model, from the dataset of all proteins and their compar-
Representation isons. Scholars used three types of models: (1) networks, (2) classi-
fications, and (3) maps (for a review of these, see [2]). A network is
the data structure closest to the raw data. To construct it, one only
needs to list the meaningful similarities, and the network is a
straightforward representation of the entities (as nodes) and the
similarities (as edges connecting these nodes.) A classification
groups the entities into nonoverlapping sets of proteins. It is
assumed that proteins in the same set in the classification (i.e.,
with the same classification) are similar to each other, while those
not in the same set are not (or less so). The classifications are
hierarchical, and proteins are grouped with decreasing degrees of
similarity. Hence, to construct a classification, one needs to weight
the importance of the similarities identified among the protein
entities: emphasizing the ones that are within a set and downplay-
ing the ones between sets. Finally, in a map, each protein is repre-
sented by a point, and the points are positioned in two or three
dimensions, so that the distance between them approximates the
dissimilarity between the proteins they represent. The mapping is
calculated by first converting the measures of similarity between the
protein entities to an all-by-all dissimilarity matrix, followed by a
multidimensional scaling (MDS) to project this matrix to a lower
(two or three) dimension. Because the position of a protein is not
indicative of its relationship to other proteins in a straightforward
manner, maps were not used for local navigation. Rather, the
insights were derived from a global perspective [5, 14, 35, 44, 45].
2.6 Navigators The most established resources for navigating protein structure
with a Global space are the hierarchical classifications; the popular ones are
Perspective SCOP from the Murzin lab, CATH from the Orengo lab, and
ECOD from the Grishin Lab; another popular classification—
Pfam [46]—is not discussed here because it is based on sequence
rather than structure. For a recent and extensive review of the
classifications, see [47]. The classifications organize the data in a
hierarchy: a user can gain a perspective of the whole space by
drilling down, starting at the top. For example, starting at the
highest level of SCOP, we see that structure space has regions of
all-alpha domains, all-beta domains, alpha+beta domains, and
alpha/beta domains, where the two latter classes include both
alpha and beta elements, separated or intertwined, respectively
[48]. Alternatively, one can search for a specific protein and con-
sider the classification of its domains and the list of all its related
proteins—ones whose domains are classified similarly (at different
levels of the hierarchy.) In short, the data structure that is used in
the hierarchies is a collection of sets (or lists), organized as a tree;
each entity is classified in several (nested) sets (depending on the
height of the hierarchy). The similarity measure used is based on
the sequences (at the lower levels of the hierarchy) and structures
(at the higher levels of the hierarchy). The entities classified are
domains: nonoverlapping subsections of the protein chains, which
cover all chain residues (or, in other words, each PDB chain is
segmented into one or more domains such that each residue is
part of exactly one domain). There is much discussion, and contro-
versy, on what is the correct definition of domains [49–51]; that
there are several domains databases (rather than one) is a clear
indication of this.
In practical terms, domains are the entities classified in SCOP,
CATH, ECOD, or in servers curating domains like CDD
[52]. More formally, there are several (not necessarily overlapping)
definitions of a domain [16, 17, 53]: (1) a structurally distinct
region (perhaps a compact unit) [54], (2) a segment that is identi-
fied as an evolutionary unit based on observations of reuse in
protein space, (3) an independently folding unit, and (4) a section
with assigned biochemical function. The domains in the hierarchi-
cal classifications are defined based on reuse. Unfortunately, these
domains, which are classified in the different databases, are not the
same ones (for comparisons, see [50, 51, 55, 56]); a recent study
estimates that only 60% of CATH domains have a similar SCOP
counterpart [53]. Nonetheless, the domains in the hierarchical
classifications have similar lengths of approximately 100 residues;
this is the average for the distributions of domain lengths in the
Navigating Protein Space 239
2.7 Publicly Another way of navigating protein structure space is zooming into
Available Navigators a local region, while ignoring the global view, and exploring, by
for Local Environments moving between such local environments. Starting from the pro-
of Structure Space tein of interest, we think of its local environment as a list of its
structural neighbors (sorted from near to distant ones); we can
then move in space by selecting one of these neighbors to see its
slightly shifted local environment (centered on this neighbor.) We
think of this process as navigating in protein structure space, like a
driver following a navigation app without seeing the full landscape.
For this, all one must have is the list of neighbors for each protein in
the dataset. The entities considered are typically both PDB chains
and domains (either taken from the classifications or calculated with
240 Aya Narunsky et al.
2.8 DIY: Build-Your- There are several reasons why scholars may want to customize their
Own Navigator own navigator to explore protein structure space, or parts of
it. First, the entities they wish to include may be specific to their
problem: a set of proteins that is not covered in the public servers
(perhaps a more redundant one), unpublished structures, or even
predicted ones. Also, one may want to study subsections of pro-
teins, which are different from chains or domains, for example,
shorter themes [9] or loops [73]. Second, scholars may want to
compare the entities themselves, as it gives them flexibility in the
choice of a specific sequence or structure alignment program, full
control over the parameters used, and the ability to enforce addi-
tional conditions when comparing proteins (e.g., a minimal align-
ment length). In some cases, even though there is a publicly
available structural alignment server, it is not fast enough for navi-
gating structure space; for these, one may prefer to pre-calculate all-
against-all comparisons (e.g., using the parallel power of a com-
puter cluster). We list just a few examples of comparison methods
that were used in a similar context: HHSearch [30], Matt [74], CE
[75], Mammoth [76], 3D-BLAST [77], FragBag [78], TM-align
[71], SSM [68], GRASP [79], and STRUCTAL [80]. Third, the
structural alignment servers do not offer a global perspective of
structure space, only a local one, and one may be interested in this
global perspective. Finally, scholars have different preferences when
Navigating Protein Space 241
template and runs the molecular viewer with a copy of this script.
The source code of CyToStruct is publicly available (https://
bitbucket.org/sergeyn/cytostruct/wiki/Home), along a series of
demos that users can rely on as a starting point. The demos include
visualization using the four popular molecular viewers (each with
their own syntax), configuring the visualization of complete struc-
tures, protein interfaces, structurally aligning multiple structures,
and selecting specific residues. CyToStruct can also be used within
the web-based version of Cytoscape (Cytoscape.js), to provide an
online visualization combining a network and a molecular viewer.
We present two examples for DIY navigators. The first is the
navigator that Nepomnyachiy et al. customized for a global view of
protein structure space [11]. The entities, or nodes in the network,
are 9710 SCOP domains (70% nonredundant set). These domains
were compared using the structural aligner SSM [68]; for suffi-
ciently meaningful alignments, Nepomnyachiy et al. calculated
measures of the similarity of the domains. Then, they define several
networks, each characterized by its edges, which connect all domain
pairs that were aligned with parameters better than some fixed
thresholds: a minimal alignment length (55, 75 residues), maximal
RMSD (2, 2.5, and 3 Å), and minimal percent sequence similarity
(30, 40, and 50%). By coloring the nodes based on their SCOP
class, all-alpha, all-beta, alpha/beta, and alpha+beta, they could see
that protein structure space has a continuous region (the alpha/
beta domains) and discrete regions [11]. The Cytoscape networks
provide a global view, but navigating in specific regions of structure
space is also interesting. Nepomnyachiy et al. link and configure the
molecular viewer using CyToStruct [82] to see the domains and
the alignments and package and distribute the data and configura-
tion files (http://cs.haifa.ac.il/~trachel/domain_motif_networks/
), allowing anyone to study protein structure space in this way.
2.9 Case Study We present here a new example, where Cytoscape and CyToStruct
are used to navigate protein space for function inference. The
navigator helps because a careful examination of populated regions
in the protein universe can help decipher unknown qualities of
proteins found in these regions. Here, we demonstrate this using
substrate-binding proteins (SBPs) [90]. SBPs are involved in trans-
port of substrates into the cell, where their role is to recognize the
substrate and relay it to its transmembrane transporter. Although
they vary in size and share relatively low sequence similarity, they
share a similar, highly conserved, fold. In general, their shape is a
lung-like structure, formed of two structurally similar globular
domains, connected by a hinge. The hinge facilitates alteration
between substrate-free and substrate-bound conformations; sub-
strate binding to a cavity between the two domains brings them
closer to one another, into a bound, or “closed,” conformation.
Navigating Protein Space 243
Fig. 1 Navigating protein structure space to study proteins with unknown function. Left panel: network of
substrate-binding proteins. Each node represents a single PDB chain; two nodes are connected by an edge if
they share some sequential and structural similarity. The nodes are colored according to the substrate; see
color-code at the bottom. White nodes represent proteins of unknown function. Middle panel: zooming-in on
the top-right cluster. This cluster is composed mostly of amino acid binding proteins. Right panel: zooming-in
on one connected component. Violet nodes represent methionine binding proteins. 4ntl, represented here by a
white node encircled in orange, has no bound substrate, and its function is unknown. It is connected to the two
central nodes, 4qhq and 3tqw (encircled in blue and purple). The figure was created using Cytoscape [94]
Fig. 2 Methionine binding in the SBPs 4ntl, 4qhq, and 3tqw. (A) Structural superposition of the 4ntl query
(orange) with 4qhq and 3tqw (blue and purple), respectively. The superposition is over the C-terminal lobe to
highlight the conformational change between the bound (close; 4qhq and 3tqw) and unbound (open; query)
states of the SBPs. The bound methionine is shown in red spheres. (B) The methionine binding site in 4qhq.
Methionine is presented using sticks model, and the polar residues of the binding site are depicted as
wireframes. The hydrogen bonds that mediate methionine’s interactions with these residues and with water
molecules (red sphere) are marked as red dashed lines. The highly conserved Arg143 is also marked. (C) The
methionine binding site in 3tqw. The highly conserved Arg113, equivalent of Arg143 in panel B, is marked. (D)
Putative encounter complex between methionine and the query. Arg144 (depicted as wireframe) has the same
location and rotameric state as its equivalents: Arg144 of 4qhq and Arg113 of 3tqw. The dashed line shows
the putative hydrogen bond, which could form between the arginine and the methionine carbonyl group. The
figure was created using the Pymol molecular viewer [84]
using ConSurf [92, 93], shows that the binding cavity is highly
conserved, providing further support for the inferred function and
binding mode. In particular, the three binding sites feature a highly
conserved arginine residue (conservation grade of 9 on a 1–9 scale).
Furthermore, in all three proteins, the arginine populates the exact
same rotameric state, which allows it to form a hydrogen bond with
the methionine substrate (Fig. 2b–d). In addition, water molecules
that participate in the binding are also found in all the structures.
However, not all the interactions that are found in the two bound
states have equivalents in the query, and the structural superposi-
tion indicates that it is in an open conformation (Fig. 2a). It
suggests that binding may follow the population shift theory,
where methionine is initially recognized by the conserved arginine
residue in the open conformation. This interaction may induce a
shift of the protein to its closed conformation, where additional
residues interact with methionine. Further investigation is needed
to examine this suggestion.
References
physics to Darwinian selection. Annu Rev Phys IN, Bourne PE (2000) The Protein Data Bank.
Chem 59:105–127 Nucleic Acids Res 28(1):235–242
4. Trifonov EN, Berezovsky IN (2003) Evolu- 16. Koehl P (2006) Protein structure classification.
tionary aspects of protein structure and fold- In: Reviews in Computational Chemistry. John
ing. Curr Opin Struct Biol 13(1):110–114 Wiley & Sons, Inc., New York, pp 1–55.
5. Choi IG, Kim SH (2006) Evolution of protein https://doi.org/10.1002/0471780367.ch1
structural classes and protein sequence families. 17. Ponting CP, Russell RR (2002) The natural
Proc Natl Acad Sci U S A 103 history of protein domains. Annu Rev Biophys
(38):14056–14061. https://doi.org/10. Biomol Struct 31(1):45–71. https://doi.org/
1073/pnas.0606239103 10.1146/annurev.biophys.31.082901.
6. Dokholyan NV, Shakhnovich B, Shakhnovich 134314
EI (2002) Expanding protein universe and its 18. Vogel C, Berzuini C, Bashton M, Gough J,
origin from the biological big bang. Proc Natl Teichmann SA (2004) Supra-domains: evolu-
Acad Sci 99(22):14132–14136. https://doi. tionary units larger than single protein
org/10.1073/pnas.202497999 domains. J Mol Biol 336(3):809–823.
7. Alva V, Remmert M, Biegert A, Lupas AN, https://doi.org/10.1016/j.jmb.2003.12.026
Söding J (2010) A galaxy of folds. Protein Sci 19. Kolodny R, Koehl P, Guibas L, Levitt M
19(1):124–130. https://doi.org/10.1002/ (2002) Small libraries of protein fragments
pro.297 model native protein structures accurately. J
8. Farı́as-Rico JA, Schmidt S, Höcker B (2014) Mol Biol 323(2):297–307
Evolutionary relationship of two ancient pro- 20. Vanhee P, Verschueren E, Baeten L, Stricher F,
tein superfolds. Nat Chem Biol 10 Serrano L, Rousseau F, Schymkowitz J (2011)
(9):710–715. https://doi.org/10.1038/ BriX: a database of protein building blocks for
nchembio.1579 http://www.nature.com/ structural analysis, modeling and design.
nchembio/journal/v10/n9/abs/nchembio. Nucleic Acids Res 39(Suppl 1):D435–D442
1579.html#supplementary-information 21. Davis FP, Sali A (2005) PIBASE: a comprehen-
9. Nepomnyachiy S, Ben-Tal N, Kolodny R sive database of structurally defined protein
(2017) Complex evolutionary footprints interfaces. Bioinformatics 21(9):1901–1907
revealed in an analysis of reused protein seg- 22. Vanhee P, Reumers J, Stricher F, Baeten L,
ments of diverse lengths. Proc Natl Acad Sci U Serrano L, Schymkowitz J, Rousseau F
S A 114:11703 (2009) PepX: a structural database of
10. Skolnick J, Arakaki AK, Lee SY, Brylinski M non-redundant protein–peptide complexes.
(2009) The continuity of protein structure Nucleic Acids Res 38(Suppl 1):D545–D551
space is an intrinsic property of proteins. Proc 23. Fernandez-Fuentes N, Dybas JM, Fiser A
Natl Acad Sci 106:15690. https://doi.org/10. (2010) Structural characteristics of novel pro-
1073/pnas.0907683106 tein folds. PLoS Comput Biol 6(4):e1000750
11. Nepomnyachiy S, Ben-Tal N, Kolodny R 24. Ovchinnikov S, Park H, Varghese N, Huang
(2014) Global view of the protein universe. P-S, Pavlopoulos GA, Kim DE, Kamisetty H,
Proc Natl Acad Sci 111:11691. https://doi. Kyrpides NC, Baker D (2017) Protein struc-
org/10.1073/pnas.1403395111 ture determination using metagenome
12. Mackenzie CO, Zhou J, Grigoryan G (2016) sequence data. Science 355(6322):294–298
Tertiary alphabet for the observable protein 25. Pieper U, Eswar N, Davis FP, Braberg H, Mad-
structural universe. Proc Natl Acad Sci U S A husudhan MS, Rossi A, Marti-Renom M,
113(47):E7438–E7447 Karchin R, Webb BM, Eramian D (2006)
13. Kolodny R, Petrey D, Honig B (2006) Protein MODBASE: a database of annotated compara-
structure comparison: implications for the tive protein structure models and associated
nature of ‘fold space’, and structure and func- resources. Nucleic Acids Res 34(Suppl 1):
tion prediction. Curr Opin Struct Biol 16 D291–D295
(3):393–398 26. Lo Conte L, Ailey B, Hubbard TJP, Brenner
14. Osadchy M, Kolodny R (2011) Maps of pro- SE, Murzin AG, Chothia C (2000) SCOP: a
tein structure space reveal a fundamental rela- structural classification of proteins database.
tionship between protein structure and Nucleic Acids Res 28(1):257–259
function. Proc Natl Acad Sci 108 27. Orengo C, Michie A, Jones S, Jones D,
(30):12301–12306. https://doi.org/10. Swindells M, Thornton J (1997) CATH-a hier-
1073/pnas.1102727108 archic classification of protein domain struc-
15. Berman HM, Westbrook J, Feng Z, tures. Structure 5(8):1093–1108
Gilliland G, Bhat TN, Weissig H, Shindyalov
Navigating Protein Space 247
28. Cheng H, Schaeffer RD, Liao Y, Kinch LN, 43. Wang G, Dunbrack RL (2003) PISCES: a pro-
Pei J, Shi S, Kim B-H, Grishin NV (2014) tein sequence culling server. Bioinformatics 19
ECOD: an evolutionary classification of pro- (12):1589–1591. https://doi.org/10.1093/
tein domains. PLoS Comput Biol 10(12): bioinformatics/btg224
e1003926. https://doi.org/10.1371/journal. 44. Choi I-G, Kim S-H (2007) Global extent of
pcbi.1003926 horizontal gene transfer. Proc Natl Acad Sci
29. Lupas AN, Ponting CP, Russell RB (2001) On 104(11):4489–4494. https://doi.org/10.
the evolution of protein folds: are similar 1073/pnas.0611557104
motifs in different protein folds the result of 45. Orengo CA, Flores TP, Taylor WR, Thornton
convergence, insertion, or relics of an ancient JM (1993) Identification and classification of
peptide world? J Struct Biol 134 protein fold families. Protein Eng 6
(2–3):191–203 (5):485–500. https://doi.org/10.1093/pro
30. Soding J (2005) Protein homology detection tein/6.5.485
by HMM-HMM comparison. Bioinformatics 46. Finn RD, Bateman A, Clements J, Coggill P,
21(7):951–960 Eberhardt RY, Eddy SR (2014) Pfam: the pro-
31. Eddy SR (2009) A new generation of homol- tein families database. Nucleic Acids Res 42:
ogy search tools based on probabilistic infer- D222. https://doi.org/10.1093/nar/
ence. Genome Inform 1:205–211 gkt1223
32. Alva V, Söding J, Lupas AN (2016) A vocabu- 47. Pearl FMG, Sillitoe I, Orengo CA (2015) Pro-
lary of ancient peptides at the origin of folded tein structure classification. In: eLS. John Wiley
proteins. elife 4:e09410 & Sons, Ltd., New York. https://doi.org/10.
33. Kosloff M, Kolodny R (2008) Sequence- 1002/9780470015902.a0003033.pub3
similar, structure-dissimilar protein pairs in 48. Levitt M, Chothia C (1976) Structural patterns
the PDB. Proteins 71(2):891–902 in globular proteins. Nature 261
34. Narunsky A, Nepomnyachiy S, Ashkenazy H, (5561):552–558
Kolodny R, Ben-Tal N (2015) ConTemplate 49. Holland TA, Veretnik S, Shindyalov IN,
suggests possible alternative conformations for Bourne PE (2006) Partitioning protein struc-
a query protein of known structure. Structure tures into domains: why is it so difficult? J Mol
23(11):2162–2170 Biol 361(3):562–590
35. Holm L, Sander C (1996) Mapping the protein 50. Hadley C, Jones DT (1999) A systematic com-
universe. Science 273(5275):595–603 parison of protein structure classifications:
36. Skolnick J, Gao M, Zhou H (2014) On the role SCOP, CATH and FSSP. Structure 7
of physics and evolution in dictating protein (9):1099–1112
structure and function. Israel J Chem 54 51. Day R, Beck DAC, Armen RS, Daggett V
(8–9):1176–1188 (2003) A consensus view of fold space: com-
37. Hasegawa H, Holm L (2009) Advances and bining SCOP, CATH, and the Dali Domain
pitfalls of protein structural alignment. Curr Dictionary. Protein Sci 12(10):2150–2160.
Opin Struct Biol 19(3):341–348 https://doi.org/10.1110/ps.0306803
38. Kolodny R, Koehl P, Levitt M (2005) Compre- 52. Marchler-Bauer A, Lu S, Anderson JB,
hensive evaluation of protein structure align- Chitsaz F, Derbyshire MK, DeWeese-Scott C,
ment methods: scoring by geometric measures. Fong JH, Geer LY, Geer RC, Gonzales NR
J Mol Biol 346(4):1173–1188 (2010) CDD: a conserved domain database
39. Kolodny R, Linial N (2004) Approximate pro- for the functional annotation of proteins.
tein structural alignment in polynomial time. Nucleic Acids Res 39(Suppl 1):D225–D229
Proc Natl Acad Sci U S A 101 53. Kelley LA, Sternberg MJ (2015) Partial protein
(33):12201–12206 domains: evolutionary insights and bioinfor-
40. Carugo O (2007) Recent progress in measur- matics challenges. Genome Biol 16(1):1–3.
ing structural similarity between proteins. Curr https://doi.org/10.1186/s13059-015-0663-
Protein Pept Sci 8(3):241 8
41. Yanover C, Vanetik N, Levitt M, Kolodny R, 54. Veretnik S, Gu J, Wodak S (2009) Identifying
Keasar C (2014) Redundancy-weighting for structural domains in proteins. In: Gu G,
better inference of protein structural features. Bourne P (eds) Structural bioinformatics, 2nd
Bioinformatics 30(16):2295–2301 edn. Wiley-Blackwell, Hoboken, NJ, pp
485–513
42. Li W, Godzik A (2006) Cd-hit: a fast program
for clustering and comparing large sets of pro- 55. Schaeffer RD, Jonsson AL, Simms AM, Dag-
tein or nucleotide sequences. Bioinformatics gett V (2011) Generation of a consensus pro-
22(13):1658–1659 tein domain dictionary. Bioinformatics 27
248 Aya Narunsky et al.
the entire PDB quickly and accurately. Proc 87. Humphrey W, Dalke A, Schulten K (1996)
Natl Acad Sci U S A 107(8):3481–3486. VMD: visual molecular dynamics. J Mol
https://doi.org/10.1073/pnas.0914097107 Graph 14(1):33–38
79. Petrey D, Xiang Z, Tang CL, Xie L, 88. Rose AS, Hildebrand PW (2015) NGL viewer:
Gimpelev M, Mitros T, Soto CS, Goldsmith- a web application for molecular visualization.
Fischman S, Kernytsky A, Schlessinger A, Koh Nucleic Acids Res 43(Web Server issue):
IY, Alexov E, Honig B (2003) Using multiple W576–W579. https://doi.org/10.1093/
structure alignments, fast model building, and nar/gkv402
energetic analysis in fold recognition and 89. O’Donoghue SI, Goodsell DS, Frangakis AS,
homology modeling. Proteins 53(Suppl Jossinet F, Laskowski RA, Nilges M, Saibil HR,
6):430–435. https://doi.org/10.1002/prot. Schafferhans A, Wade RC, Westhof E (2010)
10550 Visualization of macromolecular structures.
80. Subbiah S, Laurents DV, Levitt M (1993) Nat Methods 7:S42–S55
Structural similarity of DNA-binding domains 90. Berntsson RP-A, Smits SH, Schmitt L, Slot-
of bacteriophage repressors and the globin boom D-J, Poolman B (2010) A structural
core. Curr Biol 3(3):141–148 classification of substrate-binding proteins.
81. Saito R, Smoot ME, Ono K, Ruscheinski J, FEBS Lett 584(12):2606–2617
Wang P-L, Lotia S, Pico AR, Bader GD, Ideker 91. Radivojac P, Clark WT, Oron TR, Schnoes
T (2012) A travel guide to Cytoscape plugins. AM, Wittkop T, Sokolov A, Graim K,
Nat Methods 9(11):1069–1076 Funk C, Verspoor K, Ben-Hur A (2013) A
82. Nepomnyachiy S, Ben-Tal N, Kolodny R large-scale evaluation of computational protein
(2015) CyToStruct: augmenting the network function prediction. Nat Methods 10
visualization of cytoscape with the power of (3):221–227
molecular viewers. Structure 23(5):941–948 92. Glaser F, Pupko T, Paz I, Bell RE, Bechor-
83. Morris JH, Huang CC, Babbitt PC, Ferrin TE Shental D, Martz E, Ben-Tal N (2003) Con-
(2007) structureViz: linking Cytoscape and Surf: identification of functional regions in pro-
UCSF chimera. Bioinformatics 23 teins by surface-mapping of phylogenetic
(17):2345–2347. https://doi.org/10.1093/ information. Bioinformatics 19(1):163–164
bioinformatics/btm329 93. Ashkenazy H, Abadi S, Martz E, Chay O, May-
84. Schrodinger, LLC (2010) The PyMOL molec- rose I, Pupko T, Ben-Tal N (2016) ConSurf
ular graphics system, Version 1.3r1. Schrodin- 2016: an improved methodology to estimate
ger, LLC, New York and visualize evolutionary conservation in
85. Pettersen EF, Goddard TD, Huang CC, macromolecules. Nucleic Acids Res 44(W1):
Couch GS, Greenblatt DM, Meng EC, Ferrin W344–W350
TE (2004) UCSF chimera—a visualization sys- 94. Shannon P, Markiel A, Ozier O, Baliga NS,
tem for exploratory research and analysis. J Wang JT, Ramage D, Amin N,
Comput Chem 25(13):1605–1612 Schwikowski B, Ideker T (2003) Cytoscape: a
86. Jmol: an open-source java viewer for chemical software environment for integrated models of
structure in 3D. http://www.jmol.org/ biomolecular interaction networks. Genome
Res 13(11):2498–2504. https://doi.org/10.
1101/gr.1239303
Chapter 13
Abstract
Reconstructing evolutionary relationships in repeat proteins is notoriously difficult due to the high degree
of sequence divergence that typically occurs between duplicated repeats. This is complicated further by the
fact that proteins with a large number of similar repeats are more likely to produce significant local sequence
alignments than proteins with fewer copies of the repeat motif. Furthermore, biologically correct sequence
alignments are sometimes impossible to achieve in cases where insertion or translocation events disrupt the
order of repeats in one of the sequences being aligned. Combined, these attributes make traditional
phylogenetic methods for studying protein families unreliable for repeat proteins, due to the dependence
of such methods on accurate sequence alignment.
We present here a practical solution to this problem, making use of graph clustering combined with the
open-source software package HH-suite, which enables highly sensitive detection of sequence relationships.
Carrying out multiple rounds of homology searches via alignment of profile hidden Markov models, large
sets of related proteins are generated. By representing the relationships between proteins in these sets as
graphs, subsequent clustering with the Markov cluster algorithm enables robust detection of repeat protein
subfamilies.
Key words Repeat proteins, Sequence homology, Graph clustering, Profile-HMM alignment, Protein
families, Evolution
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_13, © Springer Science+Business Media, LLC, part of Springer Nature 2019
251
252 Jonathan N. Wells and Joseph A. Marsh
Poly-X Collagen
repeats G-X-Y repeats
1BKV
And-1 N-terminus
WD40 repeats
5GVA
0 5 35 40 45 50 55 80 90
Repeat unit length (amino acids)
Fig. 1 Repeat protein classes. Repeat proteins can be roughly categorized according to the length of the
repeated sequence motif. The very simplest repeats are simply long tracts of a single amino acid, most
commonly alanine or glutamine. These are often associated with disease, most famously the poly-Q tracts in
Huntington’s disease, but are nonetheless prevalent and have recently been shown to play an important role in
facilitating rapid protein divergence in eukaryotes [28]. The most diverse classes, both structurally and
functionally, are III and IV, which include families of solenoid proteins such as the HEAT repeat containing
Hawk family [22], of which Pds5B is a member, and ubiquitous domains such as the WD40 beta-propeller
[4]. At the other extreme are proteins such as titin, which comprises hundreds of repeated domains joined by
short linker regions. The method described here is likely to perform best on those proteins in classes III and IV
2 Methods
Fig. 2 Example network showing leucine-rich repeat subfamilies. Starting with leucine-rich repeat protein
1 (LRR1, not shown) to initiate searches, a network showing different subgroups within the large and diverse
LRR family was generated, using the protocol that follows, with an MCL inflation parameter of 3.0 for the final
clustering. Several known families are recapitulated here, most notably the Toll-like receptors, highlighted
top-left. The networks generated are very dense, with the majority of proteins being connected to tens to
hundreds of others. To aid visualization, only those with a reported true positive probability of 1.0 are shown.
Darker edges represent higher mutual ranks
2.2 Generating Once the required species databases have been built, we are ready to
Homology Networks begin carrying out the sequence searches needed to generate our
network. This can be initiated using either multiple candidate
sequences or just a single representative of the family of interest
256 Jonathan N. Wells and Joseph A. Marsh
2.3 Clustering At this stage, the network is undirected and has been trimmed
Networks down considerably from the size that would otherwise be pro-
and Assessing duced. Clustering is carried out with the MCL algorithm, which
Significance can be downloaded as a standalone program from Stijn van Don-
gen’s personal website https://micans.org/mcl/index.html, or as
implemented in other programs such as Cytoscape [27]. A sensible
inflation parameter should be used (roughly speaking, this is a
measure of the granularity of the resulting clusters): I ¼ 2.5 is a
good starting point but may need to be changed depending on the
properties of your network (see Note 4). The workflow leading to
this point is summarized graphically in Fig. 3.
Once you have obtained clusters containing your proteins of
interest, their statistical significance can be assessed with permuta-
tion tests of the underlying network. Specifically, the probability
of obtaining a specific cluster by chance can be calculated by
randomizing the ranks of the underlying alignments in each result
file, regenerating the network and clustering it. This is then
repeated as many times as is computationally practical—on a
powerful desktop computer (tested on 8 Intel® Core™
i7-4790 K CPUs @ 4.00 GHz), a network containing
258 Jonathan N. Wells and Joseph A. Marsh
... Proteins of
Scc3 Ycs4
interest
Search for paralogues MCL clustering
of starting proteins to identify families
Scc3 Query
Ycg1 Rank 1
Scc2 Rank 2
Edges weighted
...
by ranks
Xyz1 Rank n
Clusters of closely
related proteins
Search for paralogues
of all hits returned
...
...
Fig. 3 Summary of network construction. First choose a small number of proteins to initiate the searches. Use
hhblits and the clustered UniProt database to generate profile HMMs for each sequence, and then use this with
hhsearch to search the species database. Carry out a second round of searches using the results from the first
round as queries in the second. After choosing appropriate thresholds for alignment significance, build a
network using the alignment ranks as edge weights. Simplify this network by removing nodes with degree ¼ 1
and collapse edge pairs using the geometric mean. Cluster this network using the MCL algorithm to reveal
subfamilies within the larger repeat motif family. Figure adapted from Wells et al. [22]
3 Discussion
4 Notes
References
13. Söding J, Remmert M, Biegert A, Lupas AN 21. Enright AJ, Van Dongen S, Ouzounis CA
(2006) HHsenser: exhaustive transitive profile (2002) An efficient algorithm for large-scale
search using HMM-HMM comparison. detection of protein families. Nucleic Acids
Nucleic Acids Res 34:374–378. https://doi. Res 30:1575–1584
org/10.1093/nar/gkl195 22. Wells JN, Gligoris TG, Nasmyth KA, Marsh JA
14. Newman AM, Cooper JB (2007) XSTREAM: a (2017) Evolution of condensin and cohesin
practical algorithm for identification and archi- complexes driven by replacement of kite by
tecture modeling of tandem repeats in protein hawk proteins. Curr Biol 27:R17–R18.
sequences. BMC Bioinformatics 8:382. https://doi.org/10.1016/j.cub.2016.11.050
https://doi.org/10.1186/1471-2105-8-382 23. Eddy SR (1998) Profile hidden Markov mod-
15. Vo A, Nguyen N, Huang H (2010) Solenoid els. Bioinformatics 14:755–763
and non-solenoid protein recognition using 24. Viterbi A (1967) Error bounds for convolu-
stationary wavelet packet transform. Bioinfor- tional codes and an asymptotically optimum
matics 26:i467–i473. https://doi.org/10. decoding algorithm. IEEE Trans Inf Theory
1093/bioinformatics/btq371 13:260–269. https://doi.org/10.1109/TIT.
16. Szalkowski AM, Anisimova M (2013) Graph- 1967.1054010
based modeling of tandem repeats improves 25. Altschul SF, Gish W, Miller W et al (1990)
global multiple sequence alignment. Nucleic Basic local alignment search tool. J Mol Biol
Acids Res 41:e162–e162. https://doi.org/10. 215:403–410. https://doi.org/10.1016/
1093/nar/gkt628 S0022-2836(05)80360-2
17. Schaper E, Kajava AV, Hauser A, Anisimova M 26. Altschul SF, Madden TL, Sch€affer AA et al
(2012) Repeat or not repeat?--Statistical vali- (1997) Gapped BLAST and PSI-BLAST: a
dation of tandem repeat prediction in genomic new generation of protein database search pro-
sequences. Nucleic Acids Res grams. Nucleic Acids Res 25:3389–3402
40:10005–10017. https://doi.org/10.1093/ 27. Cline MS, Smoot M, Cerami E et al (2007)
nar/gks726 Integration of biological networks and gene
18. Soding J, Söding J (2005) Protein homology expression data using Cytoscape. Nat Protoc
detection by HMM-HMM comparison. Bioin- 2:2366–2382. https://doi.org/10.1038/
formatics 21:951–960. https://doi.org/10. nprot.2007.324
1093/bioinformatics/bti125 28. Chavali S, Chavali PL, Chalancon G et al
19. Remmert M, Biegert A, Hauser A, Söding J (2017) Constraints and consequences of the
(2011) HHblits: lightning-fast iterative protein emergence of amino acid repeats in eukaryotic
sequence searching by HMM-HMM align- proteins. Nat Struct Mol Biol 24:765–777.
ment. Nat Methods 9:173–175. https://doi. https://doi.org/10.1038/nsmb.3441
org/10.1038/nmeth.1818
20. Van Dongen S (2000) A cluster algorithm for
graphs. Rep Inf Syst 10:1–40
Chapter 14
Abstract
The goal of our research is to increase our understanding of how biology works at the molecular level, with
a particular focus on how enzymes evolve their functions through adaptations to generate new specificities
and mechanisms. FunTree (Sillitoe and Furnham, Nucleic Acids Res 44:D317–D323, 2016) is a resource
that brings together sequence, structure, phylogenetic, and chemical and mechanistic information for 2340
CATH superfamilies (Sillitoe et al., Nucleic Acids Res 43:D376–D381, 2015) (which all contain at least
one enzyme) to allow evolution to be investigated within a structurally defined superfamily.
We will give an overview of FunTree’s use of sequence and structural alignments to cluster proteins
within a superfamily into structurally similar groups (SSGs) and generate phylogenetic trees augmented by
ancestral character estimations (ACE). This core information is supplemented with new measures of
functional similarity (Rahman et al., Nat Methods 11:171–174, 2014) to compare enzyme reactions
based on overall bond changes, reaction centers (the local environment atoms involved in the reaction),
and the structural similarities of the metabolites involved in the reaction. These trees are also decorated with
taxonomic and Enzyme Commission (EC) code and GO annotations, forming the basis of a comprehensive
web interface that can be found at http://www.funtree.info. In this chapter, we will discuss the various
analyses and supporting computational tools in more detail, describing the steps required to extract
information.
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_14, © Springer Science+Business Media, LLC, part of Springer Nature 2019
263
264 Jonathan D. Tyzack et al.
2 Methods
A B CATH
Data Collection
M-CSA CATH-Gene3D
Cluster Domains - Catalytic and mechanistic data
PDBSum ArchSchema
- Structural annotations - Multi-Domain Architectures
- Cross reference E.C. assignments for
Align Sequences PDB and UniProtKB UniProtKB
- Sequence and taxonomic data
Data Processing
- collate all structure / sequence / functional annotations resolving
ambiguities
Visualisation
Annotated Phylogenetic Tree
ArchSchema Graph
Display & Visualise Ancestral Character Tree Ligand Similarity Tree
Annotated Alignment
Fig. 1 The FunTree pipeline. (a) An overview of the workflow for collecting and processing sequence,
structure, and functional information in FunTree. (b) A detailed schematic representation of the various
steps in data collection, processing, and visualization in FunTree
2.1 CATH This page is the gateway for results at the CATH superfamily level
Superfamily Results for the selected domain (Fig. 2), where each thumbnail provides a
Gateway link to a detailed analysis of the selected results. The SSGs within
the superfamily are shown in Clusters with a link to lower level
results for that SSG.
2.1.1 Domain This page shows an interactive force directed graph generated by
Architectures ArchSchema [6] of the multidomain architectures (MDAs) asso-
ciated with the current search, with the current domain shown at
the center connected to increasingly more complicated
architectures.
266 Jonathan D. Tyzack et al.
Fig. 2 Superfamily gateway. CATH superfamily results for CATH 3.20.20.120 Enolase. Each thumbnail
provides a link to a detailed analysis of the selected results. The SSGs within the superfamily are shown in
Clusters with a link to lower level results for that SSG
1. The colored graph nodes represent the different MDAs and can
be dragged to reorganize the graph. Hovering over the graph
nodes shows the following information for that MDA:
(a) Number of sequences
(b) Number of structures
(c) List of EC codes (annotated by UniProtKB [7])
(d) List of structures
2. The colored domain bars show the domain composition, where
hovering over the bar reveals the domain name and clicking
opens the webpage for that CATH superfamily.
2.1.2 Overview Stats This page contains a dynamic, interactive plot allowing various
properties of CATH superfamilies to be plotted on two axes
(Fig. 3). The different properties that can be plotted on either a
linear or log scale and also used to scale and color the data points
are as follows:
1. Alphabetical order (x-axis only)
2. Average conservation score for each position in the alignment
(ScoreCons) [8] for SSGs
FunTree: Exploring Enzyme Evolution 267
Fig. 3 CATH superfamily Overview Stats. The plot shows the number of sequences on the y-axis against the
number of structures on the x-axis with color representing the number of EC codes and size representing the
number of partial EC codes
2.1.3 EC Wheel This page shows the EC hierarchy as an unrooted tree with EC
codes within the superfamily labelled outside the wheel.
Nodes/leaves for class, sub-class, sub-subclass, and numerical
identifier are highlighted for the enzymes found in the superfamily.
2.1.4 EC-Blast This page shows the EC classification rendered as a circular rooted
tree.
1. Leaves represent EC code and are colored by primary EC class.
EC codes that are found in the superfamily are pushed out of the
circle and are colored blue.
268 Jonathan D. Tyzack et al.
Fig. 4 SSG (structurally similar group) Gateway: SSG results for CATH 3.20.20.120 SSG1. Each thumbnail
provides a link to a detailed analysis of the selected results. The SSGs within the superfamily are shown in
Clusters with a link to lower level results for that SSG
2.1.5 CATH This is a link to the CATH page [2] for that superfamily containing
further information on structure and function.
2.2 Structurally This page is the gateway for results at the SSG level for the selected
Similar Group (SSG) domain (Fig. 4). Each thumbnail provides a link to a detailed
Results Gateway analysis of the selected results.
FunTree: Exploring Enzyme Evolution 269
2.2.1 FunTree: Rooted This page contains a rooted phylogenetic tree for the SSG selected,
Phylogenetic Tree with annotations and links embedded in the nodes and leaves
(Fig. 5).
1. Navigation is implemented using the mouse wheel to zoom,
dragging the image to pan, clicking on a node to collapse/
expand that node, clicking on text for links to data sources,
and hovering over text/images for more information.
2. At each node to the tree, a confidence score can be found. This is
the confidence bootstrap score provided by TreeBest for bifur-
cation at the node. Please note that as these trees are automati-
cally generated, some of the bifurcations might have low
confidence scores and should be considered with caution.
3. The annotations at the end of each leaf are as follows:
(a) The first number/text section is the node name (internal to
FunTree) made up of a reference and the taxonomic code.
Fig. 5 Rooted phylogenetic tree for SSG1 in CATH 3.20.20.120 Enolase. Each node contains a score that
measures the confidence in the bifurcation. Each leaf contains labels for reaction similarity represented as
green circles, EC code/function (where available), UniProtKB sequence, representative PDB domain (where
available), and a domain bar representing the multidomain architecture (MDA). See Subheading 2.1 for further
details
270 Jonathan D. Tyzack et al.
(b) If the leaf represents an enzyme, the next three circles show
the similarity between reactions in the EC code on a bond
change, reaction center, and sub-structure basis, respec-
tively. Coloring is based on the degree of similarity as
calculated by EC-Blast [3].
(c) Primary EC code, containing a hyperlink to the IntEnz
database.
(d) UniProtKB identifier, containing a link to the UniProtKB
record.
(e) If the sequence represents a known structural superfamily,
then the PDB (linked to PDBe entry [9]) and CATH
domain (linked to the CATH superfamily page) are shown.
(f) The MDA of the protein at each leaf is depicted showing
the domains as uniquely colored bars along a line, the
position and length of which are proportional to the total
sequence. Hovering over each bar shows the CATH super-
family, and clicking navigates to the CATH
superfamily page.
2.2.2 Taxa Distribution The taxa distribution shows the distribution of taxonomic classes
within the SSG tree.
1. Hovering over the band reveals the taxonomic lineage (shown
top left) as well as the percentage of sequences in the tree that
belong to that group.
2.2.4 Reaction Clustering This page shows a tree representing the similarities between reac-
tions based on bond changes calculated by EC-Blast [3], where the
clustering is made using the PVClust [10] methods as implemented
in R (Fig. 6).
1. The tree can be zoomed using the mouse wheel or moved/
panned by dragging the image.
2. Each leaf shows a schematic of the reaction with color coding
highlighting the atoms that are involved in the reaction.
FunTree: Exploring Enzyme Evolution 271
Fig. 6 Reaction Clustering for SSG1 in CATH 3.2.2.120 Enolase. A tree representing the similarities between
reactions based on bond changes calculated by EC-Blast, where the clustering is made using the PVClust
methods as implemented in R
2.2.5 GO Clustering This page shows a tree representing the similarities between GO
annotations using a semantic similarity score.
1. The tree can be zoomed using the mouse wheel or moved/
panned by dragging the image.
272 Jonathan D. Tyzack et al.
2.2.6 Ligand Clustering This page shows a similarity tree of all the small molecules found in
all the reactions in the SSG. The similarities are calculated using
SMSD [11], and the clustering is made using the PVClust methods
as implemented in R.
1. The tree can be zoomed using the mouse wheel or moved/
panned by dragging the image.
2. By hovering over the leaves of the tree, the reaction is displayed,
and the other ligands in the reaction are highlighted.
2.2.7 EC Wheel The functionality is as described in Subheading 2.1.3 but for data at
the SSG level.
2.2.8 Annotated This page shows the multiple sequence alignment generated with
Alignment the BioJS [12] module (Fig. 7) that was used to build the phyloge-
netic tree. The sequences in the alignment are annotated by sec-
ondary structure where available and catalytic residues as
catalogued in the M-CSA [13] (bright red if from the curated
M-CSA, light red if from the predicted M-CSA).
2.2.9 Overview Stats The functionality is as described in Subheading 2.1.2 but for data at
the SSG level.
2.3 Examples As FunTree [1] holds data across many domain superfamilies, it is
of the Application possible to use FunTree to make large-scale general observations
of FunTree about how enzymes have evolved their function [14]. These obser-
vations can be made at the domain and residue level, exploring how
function is modulated via the addition/removal of domains within
a multi-domain architecture or adaptations of the catalytic/binding
pocket. This allows analyses to be prepared comparing the number
and types of evolutionary steps observed within domain
superfamilies [15].
Furthermore, detailed analysis within a single superfamily or
for a specific enzyme can be undertaken. An example of this is the
evolution of functionality within the phosphatidylinositol-
phosphodiesterase superfamily (CATH 3.20.20.190), which is
summarized briefly here but discussed more comprehensively in
reference [16]. This superfamily shows relatively high structural
conservation, presenting just one structurally similar group, but
the phylogenetic tree generated within FunTree reveals three clades
(see Fig. 8). Clades C1 and C3 show hydrolase activity (EC: 3.1.4)
using a metal cofactor, whereas Clade 2 exhibits a transition to lyase
activity (EC: 4.6.1). The structure-informed sequence alignment
reveals that none of the three metal-chelating residues are con-
served in Clade 2, so that a metal is no longer bound, resulting in
the cyclic intermediate leaving the active site prior to hydrolysis and
giving the change from hydrolase to lyase functionality. The mech-
anistic changes that give rise to this change in functionality can be
explored further using the Mechanism and Catalytic Site Atlas
(M-CSA [13], formerly called MACiE [17] and CSA [18]).
FunTree is an important resource providing a comprehensive
analysis of the evolution of enzyme functionality within structurally
similar subdivisions of CATH superfamilies. Not only will this
improve our understanding of the link between enzyme structure
and function but, coupled with FunTree’s various supporting ana-
lyses such as structural alignments and measures of molecular
274 Jonathan D. Tyzack et al.
Fig. 8 Summary of phylogenetic, functional, metabolite, and multidomain architectures for the
phosphatidylinositol-phosphodiesterase superfamily (3.20.20.190) [16]. This shows a diagrammatic repre-
sentation of the FunTree phylogenetic tree with associated functional data and multidomain architectures.
Domain 3.20.20.190 performs all molecular functionality and is represented in green in the multidomain
architecture analysis. Three major clades (C1–C3) are highlighted. Within the first group, a number of
functional sub-groups can be observed, with differences in function defined by changes in substrate or
product formed
References
Abstract
Evolutionary domains are protein regions with observable sequence similarity to other known domains.
Here we describe how to use common sequence and profile alignment algorithms (i.e., BLAST, HHsearch)
to delineate putative domains in novel protein sequences, given a reference library of protein domains. In
this case, we use our database of evolutionary domains (ECOD) as a reference, but other domain sequence
libraries could be used (e.g., SCOP, CATH). We describe our domain partition algorithm along with
specific notes on how to avoid domain indexing errors when working with multiple data sources and
software algorithms with differing outputs.
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_15, © Springer Science+Business Media, LLC, part of Springer Nature 2019
277
278 Dustin Schaeffer and Nick V. Grishin
2 Materials
3 Methods
3.1 Preparation of Given a set of domain sequence ranges, generate a set of domain
Domain and Protein sequences in FASTA formats. In this case we will translate modified
Sequence Databases or unnatural residues to unknown residues, although it is possible
in some cases to identify parent amino acids and translate accord-
ingly (see Note 1). If prepared sequence databases are used, this
step can be skipped. This protocol assumes basic scripting knowl-
edge (e.g., Python, Perl, or Ruby) and the ability to parse and write
structured data formats (e.g., XML, mmCIF) [12]. We will use a
sample workspace as illustrated in Fig. 1; directory structures can
clearly be adapted for individual computing needs and infrastruc-
tures. The overall workflow of this domain partition is illustrated in
Fig. 2. The workflow presented here assumes the sequence inputs
are sourced from PDB structure depositions.
1. Download the PDBml no-atom headers and place into work-
space (see Note 2):
(a) /data/pdb
2. Download the ECOD domain description file and place into
workspace:
(b) /work/domain_search
Iterative Domain Sequence Alignment 279
Fig. 1 A sample workspace for domain partition. We delineate directories for storage of external databases
(top left), the reference domain database against which we search (top right), necessary downloadable
software programs (bottom left), and the contents of the search directory for a chain A found in the PDB
deposition 5XCT
Fig. 2 Workflow for domain partition by iterative sequence alignment. Briefly, the workflow can be split into
three large components. The search databases and the query workspace are prepared based on your domain
definitions and external protein database (left). Alignments are generated for the query proteins against the
reference databases, and the subsequent alignments are post-processed into structured data files containing
only well-covered hits (middle). Well-covered hits are used to iteratively assign and partition domain
boundaries using protein-protein sequence hits with the highest precedence and domain-domain profile
hits with the lowest (right)
280 Dustin Schaeffer and Nick V. Grishin
3.3 Preparation of This step can be omitted if you have chosen to use a pre-generated
HHsearch Reference profile database. Select a subset of your original sequence database
Profile Database that is more sparsely populated. We will use the ECOD F40 repre-
sentatives in our example (see Note 4).
1. For each reference domain sequence, generate a reference
sequence profile using HHblits queried against a non-redundant
protein sequence database (e.g., UniRef30, nr, RefSeq, etc.):
(a) Use PSIPRED secondary structure prediction to aid with
HHsearch alignments.
(b) HHblits can be allocated to use multiple CPUs using the
–cpu switch; select a value that is appropriate for your local
computing infrastructure.
(c) We find that three iterations are sufficient ( n 3) to locate
close sequence homologs.
(d) /programs/hhsuite/bin/hhblits -i
/data/domain_data/e1mppA2/ef1ooA1.fasta -d
/data/hhsuite/lib/UniRef30.fa -ohhm
/data/domain_data/e1mppA2/e1mppA2.hhm -n 3 -cpu
8 -addss
-psipred /programs/psipred -psipred_data /data/psipred
Iterative Domain Sequence Alignment 281
3.4 Prepare Query 1. Download or prepare the set of query sequences. We will
Workspace demonstrate with a recent week of PDB depositions
(20170929) with a single protein (5xct_B). If a set of query
sequences is highly redundant, it is appropriate to cluster the
set using CD-HIT or blastclust to reduce the size of the search
set [13].
2. Create a subdirectory for each query sequence:
(a) /work/domain_sequences/20170929/5xct_B
3. Distribute a FASTA file for each query sequence into each
subdirectory:
(b) /work/domain_sequences/20170929/5xct_B/5xct_B.fa
4. Create a sequence profile for each query sequence using
HHblits:
(c) /work/domain_sequences/20170929/5xct_B/5xct_B.hhm
3.5 Performing 1. For each query FASTA sequence, perform a BLAST search
BLAST and HHsearch against both reference sequence BLAST databases (protein
Queries and domain) using the XML output format ( outfmt 5):
(a) /programs/blast/bin/blastp -query
/work/domain_sequences/20170929/5xct_B/5xct_B.fa -db
/work/domain_search/domain_ref_seq.fa -outfmt
5 -num_alignments
5000 -evalue 0.002 >
/work/domain_sequences/20170929/5xct_B/5xct_B.
protein_blast.xml
(b) /programs/blast/bin/blastp -query
/work/domain_sequences/20170929/5xct_B/5xct_B.fa -db
/work/domain_search/protein_ref_seq.fa -outfmt
5 -num_alignments
5000 -evalue 0.002 >
/work/domain_sequences/20170929/5xct_B/5xct_B.
domain_blast.xml
2. For each query HMM, perform an HHsearch against the ref-
erence sequence profile database:
(a) /programs/hhsuite/bin/hhsearch -i
/work/domain_sequences/20170929/5xct_B/5xct_B.fa -db
/work/domain_search/domain_ref_seq_40.hhm -o
/work/domain_search/20170929/5xct_B/5xct_B.hh_result
-cpu 8
282 Dustin Schaeffer and Nick V. Grishin
3.6 Collate HHsearch To better work with HHsearch results, it is convenient to parse
Output to Structured them to a structured data format, so that inconsistencies and errors
Data Format in batch jobs can be identified early in the process. It is possible but
not recommended to work directly from the standard HHsearch
result format in interpretation of results.
1. Locate the completed HHsearch outputs from the query
sequence workspace.
2. Record the hit number, HH probability of homology (%), and
HH E-value from the HH summary result block.
3. Using the original domain reference ranges, index aligned
positions in hit alignments to original domain residues indices
(see Note 3).
4. Convert alignments into ranges of aligned residue indices from
both the query sequence and the reference sequence.
5. Calculate the residue coverage of the reference alignment over
the reference domain sequence.
6. For hits with more than 70% of the reference domain aligned to
the query, deposit the query aligned range, the reference
aligned range, the HH probability of homology, the HH E-
value, and the coverage of the template sequence into a
structured data format in the query workspace directory.
3.7 Collate Protein In the previous step, we chose an XML format for BLAST output.
BLAST Hits Data locations for BLAST results are presented as an XPath
statement.
1. For each protein BLAST query result, record the following:
(a) Database used (//BlastOutput/BlastOutput-db)
(b) Query submitted (//BlastOutput/BlastOutput-
query_def)
(c) Query length (//BlastOutput/BlastOutput-
query_len)
2. Iterate over the protein BLAST hits and their high-scoring
segment pairs (HSPs) and determine whether the hit is of
sufficient quality for further consideration.
3. For each hit, record the hit number (Hit/hit_num), the hit
length (Hit/hit_len), and the hit definition (Hit/hit_def).
For protein queries conducted against a protein reference data-
base containing a set of PDB chains, the hit definition is a four-
character PDB identifier and a chain identifier of up to four
characters.
4. For each high-scoring segment pair (Hit-hsps/Hsp), record
the hsp E-value (Hsp/Hsp-evalue) and generate the aligned
range for the query sequence (Hsp-query_from .. Hssp-
query_to) as well as the aligned range for the reference
sequence (Hsp-hit_from .. Hsp-hit_to).
Iterative Domain Sequence Alignment 283
3.8 Collate Domain Collation of domain BLAST hits is similar to that of protein
BLAST Hits BLAST, with some small modifications to tighten constraints on
hsp overlaps arising from discontinuous domains.
1. As in protein BLAST, record the database used, the query
submitted, and the query length.
2. For each domain hit, record the reference domain identifier
(Hit/hit_def), the hit length (Hit/hit_len), and the hit num-
ber (Hit/hit_num).
3. For each HSP in a hit, allow no more than five residues overlap
between query aligned residues and no more than ten residues
overlap between reference aligned residues. HSPs must have an
E-value lower than 2e 3 to be accepted. The total coverage of
aligned reference residues over the reference sequence must be
70% or greater for the hit to be accepted.
4. Collect the protein BLAST, domain BLAST, and domain
HHsearch results into a single structured data format, where
each method contains a list of hits with the respective query
aligned range, reference aligned range, reference aligned cov-
erage, and quality score (E-value for BLAST, HH probability of
homology for HHsearch) associated with each hit.
3.9 Domain Partition Given a set of well-scoring hits to protein sequences, domain
by Iteration Over sequences, and domain profiles, we are prepared to partition the
Alignments query sequence into domains.
1. For a query protein, process alignments in the following order:
protein sequence hits, domain sequence hits, and domain pro-
file hits. If either less than 5% of the query sequence is unas-
signed or less than ten residues remain unassigned, the
partition is complete.
2. For each protein sequence alignment, if at least 90% of the
query aligned residues have not been assigned and less than
5% of the query aligned residues have not been assigned to a
previous putative domain, then define domains based on this
protein sequence alignment.
284 Dustin Schaeffer and Nick V. Grishin
Fig. 3 An example domain partition using multiple aligner. A domain partition using iterative sequence
alignment of a fusion structure of Fv and MST1 coiled coil (5xct_B). This novel domain architecture (a) was
partitioned into its components by hits against a Fv Ig beta-sandwich (b) domain(1mfa_L) and a coiled-coil
domain (c) from MST1 kinase (2jo8_B)
4 Notes
Acknowledgments
References
1. Soding J, Lupas AN (2003) More than the sum SCOP database: new developments. Nucleic
of their parts: on the evolution of proteins from Acids Res 36(Database issue):D419–D425
peptides. BioEssays 25(9):837–846 8. Altschul SF, Madden TL, Schaffer AA,
2. Leipe DD, Aravind L, Grishin NV, Koonin EV Zhang J, Zhang Z, Miller W, Lipman DJ
(2000) The bacterial replicative helicase DnaB (1997) Gapped BLAST and PSI-BLAST: a
evolved from a RecA duplication. Genome Res new generation of protein database search pro-
10(1):5–16 grams. Nucleic Acids Res 25(17):3389–3402
3. Tyzack JD, Furnham N, Sillitoe I, Orengo 9. Soding J (2005) Protein homology detection
CM, Thornton JM (2017) Understanding by HMM-HMM comparison. Bioinformatics
enzyme function evolution from a computa- 21(7):951–960. https://doi.org/10.1093/
tional perspective. Curr Opin Struct Biol 47 bioinformatics/bti125
(Suppl C):131–139. https://doi.org/10. 10. Remmert M, Biegert A, Hauser A, Soding J
1016/j.sbi.2017.08.003 (2011) HHblits: lightning-fast iterative protein
4. Cheng H, Schaeffer RD, Liao Y, Kinch LN, sequence searching by HMM-HMM align-
Pei J, Shi S, Kim BH, Grishin NV (2014) ment. Nat Methods 9:173. https://doi.org/
ECOD: an evolutionary classification of pro- 10.1038/nmeth.1818
tein domains. PLoS Comput Biol 10(12): 11. Cheng H, Liao Y, Schaeffer RD, Grishin NV
e1003926. https://doi.org/10.1371/journal. (2015) Manual classification strategies in the
pcbi.1003926 ECOD database. Proteins 83(7):1238–1251.
5. Song N, Sedgewick RD, Durand D (2007) https://doi.org/10.1002/prot.24818
Domain architecture comparison for multido- 12. Westbrook J, Ito N, Nakamura H, Henrick K,
main homology identification. J Comput Biol Berman HM (2005) PDBML: the representa-
14(4):496–516. https://doi.org/10.1089/ tion of archival macromolecular structure data
cmb.2007.A009 in XML. Bioinformatics 21(7):988–992.
6. Holland TA, Veretnik S, Shindyalov IN, https://doi.org/10.1093/bioinformatics/
Bourne PE (2006) Partitioning protein struc- bti082
tures into domains: why is it so difficult? J Mol 13. Fu L, Niu B, Zhu Z, Wu S, Li W (2012)
Biol 361(3):562–590. https://doi.org/10. CD-HIT: accelerated for clustering the next-
1016/j.jmb.2006.05.060 generation sequencing data. Bioinformatics 28
7. Andreeva A, Howorth D, Chandonia JM, (23):3150–3152. https://doi.org/10.1093/
Brenner SE, Hubbard TJ, Chothia C, Murzin bioinformatics/bts565
AG (2008) Data growth and its impact on the
Chapter 16
Abstract
Protein domains are reusable segments of proteins and play an important role in protein evolution. By
combining the elements from a relatively small set of domains into unique arrangements, a large number of
distinct proteins can be generated. Since domains often have specific functions, changes in their arrange-
ment usually affect the overall protein function. Furthermore, domains are well amenable to computational
representations, e.g., by Hidden Markov Models (HMMs), and these HMMs are widely represented in
various databases. Therefore, domains can be efficiently used for proteomic analyses. Here, we describe how
domains are annotated using different domain databases and then how to assess the annotation quality of
proteomes. We next show how functional annotations of domains in large-scale data such as whole genomes
or transcriptomes can be used to analyze molecular differences between species. Furthermore, we describe
methods to analyze the changes in domain content of proteins which significantly helps to characterize and
reconstruct the modular evolution of proteins. Altogether, domain-based methods offer a computationally
highly effective approach to analyze large amounts of proteomic data in an evolutionary setting.
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_16, © Springer Science+Business Media, LLC, part of Springer Nature 2019
287
288 Carsten Kemena and Erich Bornberg-Bauer
2 Materials
All the tools used in the method section are freely available. Below
is a list of used programs with a short description of their purpose.
A Roadmap to Domain Based Proteomics 289
2.1 Databases l Gene Ontology Gene Ontology (GO) [9] is an effort to pro-
vide a vocabulary to represent biological functions. Website:
http://geneontology.org.
l InterPro InterPro [12] is a meta-domain database. It contains
domains from 14 databases and groups identical domains from
different databases into the same InterPro ID. Website:
http://www.ebi.ac.uk/interpro/.
l Pfam Pfam [13] is a database of domains, with about 16,700
domains (versions 31). The domains are based on sequence
conservation and are clustered into clans based on similarity of
either sequence or structure. For each domain family an e-value
threshold is defined to separate random hits from real domain
instance occurrences. Website: http://pfam.xfam.org/.
Beside databases, several programs are needed to analyze the data.
An overview can be found in Table 1.
Table 1
List of software programs needed
Program Description
BioBundle A small collection of programs we use to prepare the data. Website: https://github.com/
CarstenK/BioBundle
DAMA DAMA[14] annotates sequences with Pfam domains. The results are based on an existing
(e.g., HMMER) annotation that is then improved by using different filter criteria.
Website: http://www.lcqb.upmc.fr/DAMA/
DOGMA DOGMA can assess the quality of proteomes and transcriptomes based on the occurrences
of domains. Website: http://domainworld.uni-muenster.de/programs/dogma/
DomRates Program to trace evolutionary changes in domain arrangements. Website: http://
domainworld.uni-muenster.de/programs/domrates/
gffread gffread is part of the cufflinks package[15] and is used to extract protein
sequences from a genome based on GFF file. Website: http://cole-trapnell-lab.github.
io/cufflinks/install/
HMMER HMMER is a program suite containing programs to construct sequence HMMs of, e.g.,
domains. These HMMs can then be used in searches for further matches in other
sequences. Website: http://hmmer.org/
InterProScan Program to annotate proteins with domains contained in the InterPro domain database.
Website: http://www.ebi.ac.uk/interpro/download.html
PfamScan The database Pfam [13] provides a software to annotate sequences with Pfam domains.
The software as well as the domain database are needed to annotate sequences. It uses
the HMMER program suite to find domain matches and then uses the Pfam e-value
thresholds to filter out overlaps and spurious hits. Website: ftp://ftp.ebi.ac.uk/pub/
databases/Pfam/Tools/
RADIANT A fast domain annotation program. Used here together with DOGMA for a fast quality
assessment. Website: http://domainworld.uni-muenster.de/programs/radiant/
290 Carsten Kemena and Erich Bornberg-Bauer
3 Methods
proteome genome
+ GFF genes of interest
gffread protein
extraction
stop and
stopCleaner pseudogene
data
removal
preparation
DOGMA
+ RADIANT quality check
domain hmmscan
annotation PfamScan InterproScan
+ DAMA
Fig. 1 Workflow of a domain-based proteome analysis. The steps “data preparation,” “domain annotation,”
and the analysis itself are covered
A Roadmap to Domain Based Proteomics 291
3.1 Preparing Proteomes for the species to be analyzed can be found in publicly
a Data set for Domain available databases, e.g., on general portals (e.g., NCBI [16] or
Annotation Ensembl [17]) or on more specialized websites for certain species
and Subsequent groups (e.g., Hymenoptera genomes [18]) or single species. The
Domain-Based simplest way to obtain a proteome set is to download the proteome
Analyses directly but sometimes only a genome and a GFF file are available.
In this case gffread can extract the mRNA from the genome and
translate them into proteins.
It is important to make sure that the gene annotation version
fits the genome version. If this is not the case the protein extraction
might fail or, in the worst case, might extract incorrect proteins due
to shifts in protein coordinates. Even if the versions match pro-
blems might occur. A possible error can be that two identical gene
annotations exist (with same ID) or the same ID has been used
twice for different genes. In these cases the gene annotation needs
to be fixed manually either by removing the gene annotation (first
case) or change the gene ID (second case).
In other cases, e.g., if scaffolds in the genome file are missing,
the providers of the GFF/genome need to be contacted to ask for
correction. Sometimes the GFF/genome files contain a prefix to
the scaffold names in either the GFF or the genome sequence file
but not in both. The solution is simply to remove the prefix (or add
it to the other file).
On the first run of gffread on a genome file it creates an index
file ( < genome file >.fai) that contains the names and positions of
the scaffolds for faster access. This file is not regenerated automati-
cally when the genome file is changed. It is therefore important to
delete the index file after manually having changed the genome as
otherwise gffread will not recalculate the index.
3.1.1 Annotating The first step to annotating the sequences with domains is to decide
Sequences with Domains which database to use as many different ones exist. They differ in
the number of domains they contain, and in the way they define
them (e.g., more structural or sequence based). Here, for demon-
stration purposes, we use the Pfam and the InterPro database and
apply it to the prepared file from the previous section. It contains
17,146 sequences that will be annotated with domains such that we
can perform a domain-based functional enrichment or rearrange-
ment analysis in the next step.
294 Carsten Kemena and Erich Bornberg-Bauer
3.2 DomRates: The analysis of domain arrangement changes can provide insights
Analyzing Domain into the kind of events that were important for a new species. We
Arrangement Changes traceback domain arrangement changes using the DomRates pro-
Along gram. Based on a domain annotation and a given phylogeny it is
a Phylogenetic Tree able to reconstruct the events that lead to the extant species in the
data set.
We will analyze a small set of hymenopterans. For each species
we prepare the proteome as described above and put the final
domain annotation together in one folder. The set additionally
includes an outgroup (Drosophila melanogaster) to reconstruct
the ancient state at the root of the hymenopteran branch. Further-
more, a phylogenetic tree in Newick format is needed. The labels in
the tree correspond to the domain file names without a fixed suffix.
For later visualization we will produce a statistics file which will be
used in a subsequent step. With only six species the tree is very
short. We therefore use the “-l” option to adjust the legend.
3.3 Functional It is often of great interest if gene sets (e.g., genes under positive
Enrichment Analysis selection) have a common function as this can help to find
Based on Domain biological explanations. Domains, as known functional units, com-
Annotations bined with a defined biological vocabulary (e.g., Gene Ontology)
can be used together to characterize genes in respect of the molec-
ular or biological processes they are involved in or the cellular
component they are active in.
The Gene Ontology (GO) consortium provides mappings of
GO terms to domains of different databases. Here, we will use the
pfam2go mapping that assigns GO terms to numerous Pfam
domains. The combination of the domain assignments is then
used to identify the function of a protein and perform analyses of
enrichment of certain terms in a set of genes. The R package
topGO [23] provides several algorithms and statistical tests that
can be used for the enrichment analysis. It compares the GO terms
296 Carsten Kemena and Erich Bornberg-Bauer
395Linepithema_humile
133
457Atta_cephalotes
143
164
327 Harpegnathos_saltator
427 Apis_mellifera
706 Nasonia_vitripennis
Drosophila_melanogaster
0.50
The next step is to load the data from the files and prepare it for
the following GO term enrichment analysis.
Table 2
Top 10 enriched GO terms based on the “parentChild” algorithm with a fisher test
The “results.csv” file will now contain a list of all GO terms that
are enriched in the gene set of interest and have a p-value 0.05.
An example output is shown in Table 2.
In this chapter, we gave a basic overview why the analysis of
domains is important. Additionally, we described the basic methods
to prepare and analyze data within a protein domain context.
Domains allow a fast evolutionary analysis of large data sets and
by using GO term assignments allow to perform functional analyses
as well. However, it is important to remember that not all domains
have a known function and that not all proteins contain a domain
which might influence the analysis. Additionally, the database used
might have a species bias (e.g., contain an over-proportional
amount of domains of eukaryotes) which will influence coverage
and functional depth of analyses based on such annotations.
Acknowledgements
References
1. Vogel C, Bashton M, Kerrison ND, Chothia C, 2. Moore AD, Asa KB, Ekman D, Bornberg-
Teichmann SA (2004) Structure, function and Bauer E, Elofsson A (2008) Arrangements in
evolution of multidomain proteins. Curr Opin the modular evolution of proteins. Trends Bio-
Struct Biol 14(2):208–216 chem Sci 33(9):444–451
A Roadmap to Domain Based Proteomics 299
3. Lees JG, Dawson NL, Sillitoe I, Orengo CA S, Sutton G, Thanki N, Thomas PD, Tosatto
(2016) Functional innovation from changes in SC, Wu CH, Xenarios I, Yeh LS, Young SY,
protein domains and their combinations. Curr Mitchell AL (2017) InterPro in 2017–beyond
Opin Struct Biol 38:44–52 protein family and domain annotations.
4. Levitt M (2009) Nature of the protein uni- Nucleic Acids Res 45(D1):D190–D199
verse. Proc Natl Acad Sci USA 106 13. Finn RD, Coggill P, Eberhardt RY, Eddy
(27):11079–11084 SR, Mistry J, Mitchell AL, Potter SC, Punta
5. Remmert M, Biegert A, Hauser A, Soding J M, Qureshi M, Sangrador-Vegas A, Salazar
(2011) HHblits: lightning-fast iterative pro- GA, Tate J, Bateman A (2016) The Pfam
tein sequence searching by HMM-HMM protein families database: towards a more sus-
alignment. Nat Methods 9(2):173–175 tainable future. Nucleic Acids Res 44(D1):
6. Moore AD, Grath S, Schüler A, Huylmans AK, D279–D285
Bornberg-Bauer E (2013) Quantification and 14. Bernardes JS, Vieira FR, Zaverucha G,
functional analysis of modular protein evolu- Carbone A (2016) A multi-objective optimiza-
tion in a dense phylogenetic tree. Biochim Bio- tion approach accurately resolves protein
phys Acta Proteins Proteomics 1834 domain architectures. Bioinformatics 32
(5):898–907 (3):345–353
7. Moore AD, Bornberg-Bauer E (2012) The 15. Trapnell C, Williams BA, Pertea G, Mortazavi
dynamics and evolutionary potential of domain A, Kwan G, van Baren MJ, Salzberg SL, Wold
loss and emergence. Mol Biol Evol 29 BJ, Pachter L (2010) Transcript assembly and
(2):787–796 quantification by RNA-Seq reveals unanno-
8. Kersting AR, Bornberg-Bauer E, Moore AD, tated transcripts and isoform switching during
Grath S (2012) Dynamics and adaptive benefits cell differentiation. Nat Biotechnol 28
of protein domain emergence and arrange- (5):511–515
ments during plant genome evolution. 16. NCBI Resource Coordinators (2017) Data-
Genome Biol Evol 4(3):316–329 base Resources of the National Center for Bio-
9. Ashburner M, Ball CA, Blake JA, Botstein technology Information. Nucleic Acids Res 45
D, Butler H, Cherry JM, Davis AP, Dolinski K, (D1):D12–D17
Dwight SS, Eppig JT, Harris MA, Hill 17. Yates A, Akanni W, Amode MR, Barrell D, Billis
DP, Issel-Tarver L, Kasarskis A, Lewis S, K, Carvalho-Silva D, Cummins C, Clapham
Matese JC, Richardson JE, Ringwald M, P, Fitzgerald S, Gil L, Giron CG, Gordon L,
Rubin GM, Sherlock G (2000) Gene ontol- Hourlier T, Hunt SE, Janacek SH, Johnson
ogy: tool for the unification of biology. The N, Juettemann T, Keenan S, Lavidas I, Martin
Gene Ontology Consortium. Nat Genet 25 FJ, Maurel T, McLaren W, Murphy DN, Nag R,
(1):25–29 Nuhn M, Parker A, Patricio M, Pignatelli
10. Sigrist CJA, Castro E, de Cerutti L, Cuche M, Rahtz M, Riat HS, Sheppard D, Taylor
BA, Hulo N, Bridge A, Lydie B, Xenarios I K, Thormann A, Vullo A, Wilder SP, Zadissa A,
(2013) New and continuing developments at Birney E, Harrow J, Muffato M, Perry E, Ruf-
PROSITE. Nucleic Acids Res 41(Database- fier M, Spudich G, Trevanion SJ, Cunning-
Issue):344–347 ham F, Aken BL, Zerbino DR, Flicek P
(2016) Ensembl 2016. Nucleic Acids Res 44
11. Bitard-Feildel T, Heberlein M, Bornberg- (D1):D710–D716
Bauer E, Callebaut I (2015) Detection of
orphan domains in Drosophila using “hydro- 18. Elsik CG, Tayal A, Diesh CM, Unni DR,
phobic cluster analysis”. Biochimie Emery ML, Nguyen HN, Hagen DE (2016)
119:244–253 Hymenoptera Genome Database: integrating
genome annotations in HymenopteraMine.
12. Finn RD, Attwood TK, Babbitt PC, Bateman Nucleic Acids Res 44(D1):793–800
A, Bork P, Bridge AJ, Chang HY, Dosztanyi
Z, El-Gebali S, Fraser M, Gough J, Haft D, 19. Labunskyy VM, Hatfield DL, Gladyshev VN
Holliday GL, Huang H, Huang X, Letunic (2014) Selenoproteins: molecular pathways
I, Lopez R, Lu S, Marchler-Bauer A, Mi and physiological roles. Physiol Rev 94
H, Mistry J, Natale DA, Necci M, Nuka G, (3):739–777
Orengo CA, Park Y, Pesseat S, Piovesan D, 20. Dohmen E, Kremer LPM, Bornberg-Bauer E,
Potter SC, Rawlings ND, Redaschi N, Kemena C. (2016) DOGMA: domain-based
Richardson L, Rivoire C, Sangrador-Vegas transcriptome and proteome quality assess-
A, Sigrist C, Sillitoe I, Smithers B, Squizzato ment. Bioinformatics 32(17):2577–2581
300 Carsten Kemena and Erich Bornberg-Bauer
21. Simão FA, Waterhouse RM, Ioannidis P, Kri- using co-occurrence: application to Plasmo-
ventseva EV, Zdobnov EM (2015) BUSCO: dium falciparum. Bioinformatics 25
assessing genome assembly and annotation (23):3077–3083
completeness with single-copy orthologs. Bio- 23. Alexa A, Rahnenführer J (2016) topGO:
informatics 31(19):3210–3212 enrichment analysis for gene ontology. R pack-
22. Terrapon N, Gascuel O, Marechal E, Breehelin age version 2.26.0
L (2009) Detection of new protein domains
Chapter 17
Abstract
Proteins are subject to evolutionary forces that shape their three-dimensional structure to meet specific
functional demands. The knowledge of the structure of a protein is therefore instrumental to gain
information about the molecular basis of its function. However, experimental structure determination is
inherently time consuming and expensive, making it impossible to follow the explosion of sequence data
deriving from genome-scale projects. As a consequence, computational structural modeling techniques
have received much attention and established themselves as a valuable complement to experimental
structural biology efforts. Among these, comparative modeling remains the method of choice to model
the three-dimensional structure of a protein when homology to a protein of known structure can be
detected.
The general strategy consists of using experimentally determined structures of proteins as templates for
the generation of three-dimensional models of related family members (targets) of which the structure is
unknown. This chapter provides a description of the individual steps needed to obtain a comparative model
using SWISS-MODEL, one of the most widely used automated servers for protein structure homology
modeling.
Key words Homology modeling, Oligomeric proteins, Quaternary structure, Protein structure pre-
diction, Model quality assessment, Model quality estimates, SWISS-MODEL
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_17, © Springer Science+Business Media, LLC, part of Springer Nature 2019
301
302 Gabriel Studer et al.
1.1 Building In this step, structural information from template residues is trans-
an Initial Model ferred to corresponding target residues as defined by the target-
template alignment. Several algorithms have been developed to
accomplish this task, based on different approaches which are
reviewed elsewhere [19]. In SWISS-MODEL this is done by trans-
ferring the atomic coordinates from the corresponding template
residues in Cartesian space. ProMod3 aims at inferring as many
atom positions as possible from template structures, depending on
the conservation of the corresponding residues between target and
template. This usually results in an incomplete model with missing
side-chain coordinates and gaps originating from amino acid inser-
tions/deletions.
Modeling Tertiary and Quaternary Structures 305
Yes
Antibody?
Use No
automated
mode?
Inspect templates
Yes
Determine
quaternary
structure
Inspect models
No Re-evaluate
Quality
ok? choices
Yes
Fig. 1 Flowchart of the comparative protein structure modeling pipeline implemented in SWISS-MODEL
306 Gabriel Studer et al.
1.2 Loop Modeling With the possible exception of antibody loops, as discussed later in
this chapter, modeling protein loops is a challenging task and often
a major source of modeling errors. Loop modeling methods can be
categorized into two main groups: ab initio and database
approaches [19–22]. ProMod3 uses geometric criteria to query a
database containing high-resolution X-ray structures for matching
loop candidates. Candidate loops are fitted to the loop stems using
the cyclic coordinate descent algorithm [23] and scored based on
statistical potentials of mean force [24]. The best candidate is then
selected according to its score and inserted into the model.
1.3 Modeling of Side To model non-conserved side chains, ProMod3 extracts side-chain
Chains conformations from the Dunbrack rotamer library [25] and deter-
mines their optimal conformation by minimizing the SCWRL4
energy function [26] using a graph-based approach [27].
1.5 Model Quality Quality estimation tools aim to quantify modeling errors and give
Estimation estimates on expected model accuracy both on a global and
per-residue scale. From a modeling perspective, such estimates are
useful to select the best model in a set of alternatives or detecting
local errors. But, even more importantly, they aim to determine the
usefulness of a model for a specific application at hand [30, 31]. Var-
ious tools assessing physical plausibility are routinely applied on
models based on experimental data [32]. However, while stereo-
chemistry is a necessary condition for a high-quality model, it is not
a sufficient criterion to indicate similarity of a theoretical model to
the native target structure. Knowledge-based approaches with sta-
tistical potentials of mean force [24] constitute a valid complement
for estimating the expected accuracy of a theoretical model. SWISS-
MODEL relies on QMEAN [33, 34] to assign global and
per-residue quality estimates. QMEAN linearly combines four sta-
tistical potentials of mean force. Two of them evaluate pairwise
distances, the first between all chemically distinguishable heavy
atoms and the second between Cβ atoms. Two more potentials
evaluate backbone torsion angles and packing of the model.
The accuracy of models generated by SWISS-MODEL is con-
tinuously assessed by the CAMEO project [35] based on weekly
blind prediction of proteins from the upcoming PDB release.
Modeling Tertiary and Quaternary Structures 307
1.6 Concluding The availability of reliable and robust fully automated workflows for
Remarks protein structure modeling has made homology modeling the
method of choice to reliably generate three-dimensional models
for proteins when experimental structures are not available. Easy-
to-use interactive web servers and reliable model quality estimation
tools allow also nonspecialists to successfully use protein models in
structure-based applications in biomedical research.
2 Materials
3 Methods
3.2 Start a New Depending on the type of information at hand, there are different
Modeling Project modes to start a new modeling project. If only the sequence of the
target protein to be modeled is available, a first step is to search for
templates, as described in Subheading (3.2.1.) Sequence mode.
Alternatively, if the model should be based on a specific template
308 Gabriel Studer et al.
3.2.1 Sequence Mode: 1. Insert the amino acid sequence of the target protein into the
Starting from the Sequence main input box of the homepage. The sequence can be
of the Target Protein provided either as a plain text or in FASTA format. Alterna-
tively, the UniProtKB identifier can be used.
2. Press the “Validate” button or the return key. A sequence
validation step is performed to check for nonstandard amino
acid codes and to reformat the input sequence. If the target
UniProtKB identifier is provided as input, the protein sequence
is automatically retrieved and validated. After validation, a
non-editable wrapped view of the target sequence is displayed.
3. If the target protein is heteromeric, i.e., it consists of different
protein chains as subunits, it is possible to enter an additional
amino acid sequence by clicking the “Add Hetero Target”
button. Repeat this step until all subunit sequences have been
entered.
4. The next step is to identify reliable templates to be used for
modeling. Two options are available to perform this task:
(a) Manual template selection: This option allows the user to
inspect the template search results before selecting one or
more template structures for modeling, taking into
account information such as quality of the experimental
structure, oligomeric state, bound ligands, or crystalliza-
tion conditions. To use this option, click the “Search for
Templates” button and proceed to step 3.3.
(b) Automatic template selection: Using this option, when
the template search is complete, templates are ranked
according to the expected quality of the resulting models
and a number of templates are selected automatically. This
option is especially useful for well-characterized protein
families where target-template sequence similarity is
expected to be sufficiently high to automatically generate
unambiguous alignments and high-quality models. Note
that also in this option, the full template search results can
be inspected to select additional template structures in
case the automated modeling results are not satisfactory.
To use this option, click the “Build Model” button and
proceed to step 3.4 to access the modeling results.
In both cases, as soon as the template search starts, an auto-
matic scanning of the target sequence is performed to verify
whether any immunoglobulin variable domain is present in the
input. If this is the case, the user is provided with a link to the
Modeling Tertiary and Quaternary Structures 309
3.2.4 Project Mode: The desktop application DeepView (available for Windows and
Using a Specific Three- Mac OS) [1] allows for visualization of one or more template
Dimensional Structure structures, and manual editing of the target-template alignment.
as Template by Manually Projects generated in DeepView, or obtained in step 3.4, can be
Adjusting the Target- submitted for modeling after manual manipulation of the target-
Template Alignment template alignment.
in the DeepView Desktop
1. Click the “DeepView Project” button.
Application
2. Upload the DeepView project file.
3. Click the “Build Model” button and proceed to step 3.4.
3.3 Template The Template Results page provides an overview of the available
Identification templates as well as interactive views and selection tools.
1. From the views below, select one or more template structures
for modeling.
(a) Templates. The main table displays the list of top 50 tem-
plates, ranked according to the expected quality of the
resulting model. The complete list of templates is accessi-
ble by links at the bottom of the Template Results page.
Features such as coverage, model quality estimates
(GMQE and QSQE), oligomeric state, and bound ligands
are shown in a condensed tabular form. Each of the table
rows can be expanded to display additional information
310 Gabriel Studer et al.
3.4 Accessing After completion of the modeling process, a detailed report of the
Modeling Results modeling project is generated and can be accessed from the work-
space. Model coordinates can be downloaded either formatted as
PDB or DeepView project files.
The generated model(s) can be inspected in the model results
page using the embedded structural viewer. The target-template
alignment used for modeling is also shown and linked to the
structure visualization such that hovering the mouse over the
alignment highlights the corresponding residue in the viewer and
vice versa (Fig. 2).
Modeling Tertiary and Quaternary Structures 311
Fig. 2 Modeling results for the superoxide dismutase [Cu-Zn] protein from S. pombe (SOD1, UniProtKB AC:
P28758) generated in automated mode in SWISS-MODEL. SpSOD1 is predicted as homo-2-mer including 1 Zn
ion and 1 Cu ion per subunit as cofactors based on the experimental structure of the deep-sea yeast
Cryptococcus liquefaciens homologue (SMTL: 3ce1.1.A; [36]) as template
3.5 Model Quality By default, models and alignments are colored based on the
Estimation per-residue quality estimates from the QMEAN scoring function
[37]. The color gradient ranges from red to blue, indicating low to
high estimated per-residue quality. The same information is also
available for every model in the form of a Local Quality plot, as well
as in the B-factor column of the downloadable PDB file. The
Global Quality plot gives an estimate of the overall model quality,
based on four individual terms: Cβ, all atom, solvation, and torsion.
The QMEAN score is also compared to what one would expect
from experimentally determined protein structures of similar size
using a Z-score scheme (hence 0.0 would be the optimal score).
This is illustrated in the Comparison plot (Fig. 2).
4 Notes
References
1. Guex N, Peitsch MC, Schwede T (2009) Auto- 10. Altschul SF, Madden TL, Schaffer AA et al
mated comparative protein structure modeling (1997) Gapped BLAST and PSI-BLAST: a
with SWISS-MODEL and Swiss-PdbViewer: a new generation of protein database search pro-
historical perspective. Electrophoresis 30 Suppl grams. Nucleic Acids Res 25:3389–3402
1:S162–S173 11. Remmert M, Biegert A, Hauser A et al (2011)
2. Sali A, Blundell TL (1993) Comparative pro- HHblits: lightning-fast iterative protein
tein modelling by satisfaction of spatial sequence searching by HMM-HMM align-
restraints. J Mol Biol 234:779–815 ment. Nat Methods 9:173–175
3. Chothia C, Lesk AM (1986) The relation 12. Jones DT (1999) Protein secondary structure
between the divergence of sequence and struc- prediction based on position-specific scoring
ture in proteins. EMBO J 5:823–826 matrices. J Mol Biol 292:195–202
4. Arnold K, Bordoli L, Kopp J et al (2006) The 13. Sillitoe I, Cuff AL, Dessailly BH et al (2013)
SWISS-MODEL workspace: a web-based envi- New functional families (FunFams) in CATH
ronment for protein structure homology mod- to improve the mapping of conserved func-
elling. Bioinformatics 22:195–201 tional sites to 3D structures. Nucleic Acids
5. Biasini M, Bienert S, Waterhouse A et al (2014) Res 41:D490–D498
SWISS-MODEL: modelling protein tertiary 14. Aloy P, Ceulemans H, Stark A et al (2003) The
and quaternary structure using evolutionary relationship between sequence and interaction
information. Nucleic Acids Res 42: divergence in proteins. J Mol Biol
W252–W258 332:989–998
6. Kiefer F, Arnold K, Kunzli M et al (2009) The 15. Bertoni M, Kiefer F, Biasini M et al (2017)
SWISS-MODEL repository and associated Modeling protein quaternary structure of
resources. Nucleic Acids Res 37:D387–D392 homo- and hetero-oligomers beyond binary
7. Waterhouse A, Bertoni M, Bienert S et al interactions by homology. Sci Rep 7:10480
(2018) SWISS-MODEL: homology modelling 16. Marcatili P, Olimpieri PP, Chailyan A et al
of protein structures and complexes. Nucleic (2014) Antibody modeling using the predic-
Acids Research Res 46(W1):W296–W303 tion of immunoglobulin structure (PIGS) web
8. Kryshtafovych A, Venclovas C, Fidelis K et al server [corrected]. Nat Protoc 9:2771–2783
(2005) Progress over the first decade of CASP 17. Lepore R, Olimpieri PP, Messih MA et al
experiments. Proteins 61(Suppl 7):225–236 (2017) PIGSPro: prediction of immunoGlob-
9. Berman H, Henrick K, Nakamura H et al ulin structures v2. Nucleic Acids Res 45:W17
(2007) The worldwide protein data Bank 18. Biasini M, Schmidt T, Bienert S et al (2013)
(wwPDB): ensuring a single, uniform archive OpenStructure: an integrated software frame-
of PDB data. Nucleic Acids Res 35: work for computational structural biology.
D301–D303 Acta Crystallogr D Biol Crystallogr
69:701–709
Modeling Tertiary and Quaternary Structures 315
19. Fiser A (2010) Template-based protein struc- models in biomedical research. Structure
ture modeling. Methods Mol Biol 673:73–94 17:151–159
20. Choi Y, Deane CM (2010) FREAD revisited: 32. Read RJ, Adams PD, Arendall WB 3rd et al
accurate loop structure prediction using a data- (2011) A new generation of crystallographic
base search algorithm. Proteins 78:1431–1440 validation tools for the protein data bank.
21. Liang S, Zhang C, Zhou Y (2014) LEAP: Structure 19:1395–1412
highly accurate prediction of protein loop con- 33. Benkert P, Biasini M, Schwede T (2011)
formations by integrating coarse-grained sam- Toward the estimation of the absolute quality
pling and optimized energy scores with of individual protein structure models. Bioin-
all-atom refinement of backbone and side formatics 27:343–350
chains. J Comput Chem 35:335–341 34. Benkert P, Kunzli M, Schwede T (2009)
22. Messih MA, Lepore R, Tramontano A (2015) QMEAN server for protein model quality esti-
LoopIng: a template-based tool for predicting mation. Nucleic Acids Res 37:W510–W514
the structure of protein loops. Bioinformatics 35. Haas J, Roth S, Arnold K et al (2013) The
31:3767–3772 protein model portal--a comprehensive
23. Canutescu AA, Dunbrack RL Jr (2003) Cyclic resource for protein structure and model infor-
coordinate descent: a robotics algorithm for mation. Database 2013:bat031
protein loop closure. Protein science: a publi- 36. Teh AH, Kanamasa S, Kajiwara S et al (2008)
cation of the protein. Society 12:963–972 Structure of cu/Zn superoxide dismutase from
24. Sippl MJ (1990) Calculation of conformational the heavy-metal-tolerant yeast Cryptococcus
ensembles from potentials of mean force. An liquefaciens strain N6. Biochem Biophys Res
approach to the knowledge-based prediction of Commun 374:475–478
local structures in globular proteins. J Mol Biol 37. Benkert P, Tosatto SC, Schomburg D (2008)
213:859–883 QMEAN: a comprehensive scoring function
25. Shapovalov MV, Dunbrack RL Jr (2011) A for model quality assessment. Proteins
smoothed backbone-dependent rotamer 71:261–277
library for proteins derived from adaptive ker- 38. Chothia C, Lesk AM (1987) Canonical struc-
nel density estimates and regressions. Structure tures for the hypervariable regions of immuno-
19:844–858 globulins. J Mol Biol 196:901–917
26. Krivov GG, Shapovalov MV, Dunbrack RL Jr 39. Morea V, Tramontano A, Rustici M et al
(2009) Improved prediction of protein side- (1998) Conformations of the third hypervari-
chain conformations with SCWRL4. Proteins able region in the VH domain of immunoglo-
77:778–795 bulins. J Mol Biol 275:269–294
27. Xu J (2005) Rapid protein side-chain packing 40. Tramontano A, Chothia C, Lesk AM (1990)
via tree decomposition. In: Miyano S, Framework residue 71 is a major determinant
Mesirov J, Kasif S, Istrail S, Pevzner PA, Water- of the position and conformation of the second
man M (eds) Research in computational hypervariable region in the VH domains of
molecular biology: 9th Annual International immunoglobulins. J Mol Biol 215:175–182
Conference, RECOMB 2005, Cambridge, 41. Messih MA, Lepore R, Marcatili P et al (2014)
MA, USA, May 14–18, 2005. Proceedings. Improving the accuracy of the structure predic-
Springer Berlin, Heidelberg, pp 423–439 tion of the third hypervariable loop of the
28. Mackerell AD Jr, Feig M, Brooks CL 3rd heavy chains of antibodies. Bioinformatics
(2004) Extending the treatment of backbone 30:2733–2740
energetics in protein force fields: limitations of 42. Almagro JC, Teplyakov A, Luo J et al (2014)
gas-phase quantum mechanics in reproducing Second antibody modeling assessment
protein conformational distributions in molec- (AMA-II). Proteins 82:1553–1562
ular dynamics simulations. J Comput Chem
25:1400–1415 43. Moult J (2005) A decade of CASP: progress,
bottlenecks and prognosis in protein structure
29. Eastman P, Swails J, Chodera JD et al (2017) prediction. Curr Opin Struct Biol 15:285–289
OpenMM 7: rapid development of high per-
formance algorithms for molecular dynamics. 44. Tai CH, Bai H, Taylor TJ et al (2014) Assess-
PLoS Comput Biol 13:e1005659 ment of template-free modeling in CASP10
and ROLL. Proteins 82(Suppl 2):57–83
30. Baker D, Sali A (2001) Protein structure pre-
diction and structural genomics. Science 45. Meier A, Soding J (2015) Automatic predic-
294:93–96 tion of protein 3D structures by probabilistic
multi-template homology modeling. PLoS
31. Schwede T, Sali A, Honig B et al (2009) Out- Comput Biol 11:e1004343
come of a workshop on applications of protein
316 Gabriel Studer et al.
46. Larsson P, Wallner B, Lindahl E et al (2008) 54. De Vries SJ, Van Dijk M, Bonvin AM (2010)
Using multiple templates to improve quality of The HADDOCK web server for data-driven
homology models in automated homology biomolecular docking. Nat Protoc 5:883–897
modeling. Protein Sci 17:990–1002 55. Leaver-Fay A, Tyka M, Lewis SM et al (2011)
47. Cheng J (2008) A multi-template combination ROSETTA3: an object-oriented software suite
algorithm for protein comparative modeling. for the simulation and design of macromole-
BMC Struct Biol 8:18 cules. Methods Enzymol 487:545–574
48. Webb B, Sali A (2014) Comparative protein 56. Russel D, Lasker K, Webb B et al (2012) Put-
structure modeling using MODELLER. Curr ting the pieces together: integrative modeling
Protoc Bioinformatics 47:5.6.1–5.6.32 platform software for structure determination
49. Grosdidier A, Zoete V, Michielin O (2011) of macromolecular assemblies. PLoS Biol 10:
Fast docking using the CHARMM force field e1001244
with EADock DSS. J Comput Chem 57. Simons KT, Kooperberg C, Huang E et al
32:2149–2159 (1997) Assembly of protein tertiary structures
50. Grosdidier A, Zoete V, Michielin O (2011) from fragments with similar local sequences
SwissDock, a protein-small molecule docking using simulated annealing and Bayesian scoring
web service based on EADock DSS. Nucleic functions. J Mol Biol 268:209–225
Acids Res 39:W270–W277 58. Yang J, Yan R, Roy A et al (2015) The
51. Lensink MF, Velankar S, Wodak SJ (2017) I-TASSER suite: protein structure and function
Modeling protein-protein and protein-peptide prediction. Nat Methods 12:7–8
complexes: CAPRI 6th edition. Proteins 59. Maghrabi AHA, Mcguffin LJ (2017) Mod-
85:359–377 FOLD6: an accurate web server for the global
52. Esquivel-Rodriguez J, Filos-Gonzalez V, Li B and local quality estimation of 3D protein
et al (2014) Pairwise and multimeric protein- models. Nucleic Acids Res 45(W1):
protein docking using the LZerD program W416–W421
suite. Methods Mol Biol 1137:209–234 60. Heo L, Feig M (2018) What makes it difficult
53. Pierce B, Tong W, Weng Z (2005) to refine protein models further via molecular
M-ZDOCK: a grid-based approach for Cn dynamics simulations? Proteins 86(Suppl
symmetric multimer docking. Bioinformatics 1):177–188
21:1472–1478
Chapter 18
Abstract
About 20% of the cancer incidences worldwide have been estimated to be associated with infections.
However, the molecular mechanisms of exactly how they contribute to host tumorigenesis are still
unknown. To evade host defense, pathogens hijack host proteins at different levels: sequence, structure,
motif, and binding surface, i.e., interface. Interface similarity allows pathogen proteins to compete with
host counterparts to bind to a target protein, rewire physiological signaling, and result in persistent
infections, as well as cancer. Identification of host-pathogen interactions (HPIs)—along with their struc-
tural details at atomic resolution—may provide mechanistic insight into pathogen-driven cancers and
innovate therapeutic intervention. HPI data including structural details is scarce and large-scale experimen-
tal detection is challenging. Therefore, there is an urgent and mounting need for efficient and robust
computational approaches to predict HPIs and their complex (bound) structures. In this chapter, we review
the first and currently only interface-based computational approach to identify novel HPIs. The concept of
interface mimicry promises to identify more HPIs than complete sequence or structural similarity. We
illustrate this concept with a case study on Kaposi’s sarcoma herpesvirus (KSHV) to elucidate how it subverts
host immunity and helps contribute to malignant transformation of the host cells.
1 Introduction
1.1 Molecular Signaling pathways shape and convey the cell’s responses to stimuli
Mimicry from its environment; however, pathogens can circumvent this
response by “repurposing” host signaling. Pathogens can interact
with the host through proteins, metabolites, small molecules, and
nucleic acids [1]. Direct protein-protein interactions are the most
common interaction type (see Note 1). By interfering with key
pathways pathogens can reshape physiological signaling, subverting
the immune system, altering the cytoskeletal organization [2, 3],
modifying membrane and vesicular trafficking [2, 4, 5], boosting
pathogen entry into the host cell, changing the cell cycle regulation
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_18, © Springer Science+Business Media, LLC, part of Springer Nature 2019
317
318 Emine Guven-Maiorov et al.
1.2 Review of Several HPI databases have been developed for experimentally
Available identified HPIs, including PHISTO [36], HPIDB [37], Proteo-
Computational Tools to pathogen [38], PATRIC [39], PHI-base [40], PHIDIAS [41],
Identify HPIs HoPaCI-DB [42], VirHostNet [43], ViRBase [44], VirusMentha
[45], and HCVpro [46]. These databases comprise only a limited
number of pathogens. Given that at least hundreds of different
species can infect the host, thousands of HPIs are still unknown.
Enriching of the host-pathogen interactome and construction of
comprehensive HPI networks will still mostly rely on computa-
tional models in the near future [47]. Numerous studies computa-
tionally identified large-scale HPIs and built HPI networks for
viruses and bacteria [20, 24, 48–56].
Although prediction of human PPIs is a well-established area,
modeling of interspecies interactions is comparably new. Still, sev-
eral attempts focused on computational approaches to identify
HPIs [34], most of which rely on sequence homology [49, 52,
54, 57–63]. Homology-based approaches are successful only if the
sequence similarity is high, but not all virulence factors have homo-
logs in human. For instance, a secreted protein of H. pylori, VacA,
does not have sequence similarity with any other known viral,
bacterial, or eukaryotic proteins [64], but it alters signaling
through several host pathways [65]. Thus, sequence-based meth-
ods cannot detect VacA’s HPIs, highlighting the importance of
considering the 3D structures of proteins in predicting HPIs.
There are also sequence-based comparative methods that consider
structure [48, 55, 56, 61, 62, 66–70]; interologs (interacting
homologs/conserved interactions) [71, 72]; and transcriptome
data [73]. Available structure-based techniques often depend on
global structural similarity rather than interface mimicry
[55, 69]. One method combines interface data with sequence
homology and gene expression, but the predicted interacting host
and pathogenic proteins should satisfy a minimum of 80% sequence
identity over at least 50% of template host PPI complexes [66]. To
the best of our knowledge, none of the current approaches utilizes
320 Emine Guven-Maiorov et al.
2 Methods
2.1 Modeling HPIs Here, we review the first and to date only computational approach
that utilizes solely interface mimicry to predict putative HPIs and
their 3D structures as complexes [74]. Local structural resemblance
is sufficient; there is no need for sequence similarity. This approach
reveals not only targets of pathogenic proteins and how they inter-
act, but also the host endogenous PPIs which may be disrupted by
these potential HPIs. Figure 1 displays our workflow. Generally, the
interacting protein partners are known from docking studies and
the main purpose is to discern how they interact structurally.
Therefore, inputs of the docking algorithms are structures of the
two monomeric target proteins to be docked to each other. How-
ever, when dealing with HPIs, the main aim is to identify the
interacting partners, as well as how they interact. Normally, the
pathogenic proteins (one of the targets in a docking study) are
known but not their partners in the host (second target). Hence,
before performing docking, we need to identify those potential
host interactors.
To accomplish this, we generate all known human interfaces—
including endogenous and exogenous—in the PiFace interface
database, as described in [14]. Each interface has two chains (part-
ners/sides). There are 26,236 human interfaces in our template set.
Then, we structurally align these interfaces with the pathogenic
proteins by MultiProt [80]. The structural alignment thresholds
for the number of matching interface residues and the hot spots
follow the PRISM algorithm [81–84]. If the pathogenic protein is
aligned with one side of the human interface, it may interact with
the complementary side. Thus, the pathogenic protein can compete
with the first side of the interface—with which it is structurally
aligned—to bind to the second side, thereby abrogating the endog-
enous binary interaction in the template PPI (Fig. 1). Structural
complementarity does not necessarily guarantee chemical comple-
mentarity and favorable interaction energy. For instance, 8 KSHV
Interface-Based Structural Prediction of Novel Host-Pathogen Interactions 321
Fig. 1 Workflow of our interface-based HPI modeling approach. In the first step, we extract human interfaces
from the PDB. Then, we obtain the structures of the pathogenic proteins from the PDB. Before docking, we
need to identify the potential HPI pairs since docking programs require two target proteins. To do that, we
structurally align the pathogenic proteins with the human interfaces in our template set. If the pathogenic
protein is aligned with the B-side of the interface, it can interact with the complementary A-side. After
determining potential HPI pairs, we perform docking of these pairs with PRISM [81–84] and Rosetta (local
refinement) [85–87] to select the energetically favorable ones. We further assess the likelihood that the HPI
models take place in the cell based on the percent match of the interface residues with the template interface
and probability of the template interface being a real biological interface. In the final optional step, we filter our
energetically favorable HPI results according to tissue expression of the human proteins by checking whether
the interactors of the pathogenic proteins are expressed in the same tissue where the pathogen resides
host protein with a higher affinity. For some template PPIs, Rosetta
gives extremely low unrealistic I_sc, due to intermolecular disulfide
bonds. To correct this, we calculate Rosetta I_sc with both includ-
ing and disregarding the disulfide bonds. We consider the HPIs as
favorable interactions if they have I_sc below 5 with both Rosetta
scorings. Note that Rosetta I_sc does not have units nor reflects the
real binding free energy. It only gives an idea whether an interaction
is favorable or not.
To further evaluate the likelihoods of our HPI models, we
calculate the “percent match” of the interfaces by taking the ratio
of the number of interface residues that are aligned with the patho-
genic protein to the number of interface residues in the endoge-
nous template PPI. Each template interface is assigned with a
weight based on the size of the endogenous template interface
such that larger interfaces have higher weights. If the template
interfaces have less than 30 residues (n < 30), the weight is 0.5; if
30 < n < 50, weight is 1; if 50 < n < 80 weight is 1.5; and if n > 80
(very large interface), the weight is 2. Score1 given in Table 1 is the
product of the interface percent match and the corresponding
interface weight.
We employ the EPPIC (Evolutionary Protein-Protein Interface
Classifier) [88], to evaluate whether the template interfaces are real
biological interfaces or crystal artifacts. The EPPIC server gives the
probability of a particular interface to be biological. Score2 in
Table 1 is the product of Score1 and the probability of being a
biological interface. The higher the Score2, the more confidence
we have that a particular HPI model would take place in the cell, as
they are better mimics of real biological endogenous interfaces (see
Note 3).
Finally, with an optional step, the results can be filtered accord-
ing to tissue expression, checking whether the host partners of the
pathogenic proteins are expressed in the same tissue where the
pathogen resides. We take the tissue expression data from the
Human Protein Atlas, which includes 19,709 human proteins,
mapping to 7106 human PDBs [89, 90]. If the pathogen is a
bacterial species, it resides in only certain tissues. For instance,
Helicobacter pylori is mainly in the stomach and gastrointestinal
tract, making it reasonable to focus on human proteins that are
expressed in these tissues. However, if the pathogen is a virus, it can
infect several different—if not all—tissues. Therefore, filtering
according to tissue expression is an optional step depending on
the pathogen type (see Note 4).
2.2 Constructing the As we have the complex (bound) structures of the predicted HPIs,
Structural it is possible to construct the structural interspecies interaction
Superorganism network. Our template set serves as the human endogenous binary
Network interactions. 26,236 interfaces map to 3366 distinct human PPIs.
The predicted HPIs serve as exogenous interactions. So, all
Table 1
HPIs for KSHV proteins
2.3 Case Study Our interface-based HPI modeling method was successfully
applied to H. pylori before and can be applied to any commensal
or pathogenic microorganism. As a case study to illustrate the utility
of the concept, here we applied it to KSHV, infection of which is
associated with a blood/lymph vessel cancer—Kaposi’s sarcoma—
and lymphoma [95]. We modeled its HPIs and constructed its
structural superorganism network. We analyzed eight KSHV pro-
teins, vCyclin, vFLIP, vBCL2, vIL6, vIRF1, vIRF2, and viral che-
mokines (K4 and K6). We found 96 putative HPIs. All our HPI
models have 3D structures as complexes (see Note 5). Table 1
shows some examples from these 96 HPIs and Table 2 displays
the human PPIs that are potentially disrupted by these HPIs.
Our HPI candidates may elucidate the roles of KSHV in mod-
ulation of host signaling and contribution to malignant transfor-
mation. For instance, we found that KSHV chemokines and
cytokines, like K4, K6, and vIL6, target many human chemokine
and cytokine receptors (Fig. 2). Signaling through the cytokine and
chemokine receptors is critical for T-cell recruitment to the infected
Table 2
Potentially disrupted endogenous host PPIs due to predicted KSHV HPIs
KSHV protein Human PPI disrupted by KSHV protein PDB for the human PPI disrupted
K4 CCL4-CCL4 2x6lBD
K4 CXCR4-SDF1 2k03CD
K6 CCL5-CCL5 1u4lAB
vCyclin CDK4-CCND3 3g33AD
vCyclin CDK2-CCNE1 1w98AB
vIL6 IL12B-IL23A 3duhBD
vIL6 INAR1-IFNW1 3se4AB
vIRF1 UBP21-RL40 3i3tGH
vFLIP TNR6-FADD 3ezqIJ
vBCL2 ITA2B-ITB3 2vdkAB
Fig. 2 KSHV proteins mimic the human protein-protein interfaces, blocking human PPIs. (a) Endogenous
human PPI between IL12B and IL23A. (b) Our HPI model between vIL6 and IL12B. (c) Superimposed view of
PPI and HPI shows that vIL6 almost perfectly mimics the interface on IL23A to bind to IL12B. (d) through (l)
also show the superimposed structures of endogenous human PPIs and modeled HPIs. Human proteins are
shown in cyan and pink; and KSHV proteins are shown in gray. Gray proteins bind to pink proteins by hijacking
the interface on cyan proteins (only the interface similarity is enough, no need for global structural similarity).
Thus, they may block the pink-cyan protein interactions
326 Emine Guven-Maiorov et al.
Fig. 3 KSHV proteins mimic not only host interactions, but also other HPIs from other species (a), (b), and (c).
Figures show the superimposed structures of our HPI models for KSHV with the known exogenous interactions
with proteins from other species. Pink proteins are from human, greens are proteins from other pathogens,
and gray proteins are KSHV proteins. Gray proteins bind to pink proteins by hijacking the interfaces on green
proteins
Interface-Based Structural Prediction of Novel Host-Pathogen Interactions 327
Fig. 4 Structural superorganism network for KSHV and human, where all binary interactions have structures as
complexes. Endogenous human interactions (black edges) are obtained from crystal structures in PDB
(template interface set), where human proteins are shown as gray circular nodes. Exogenous interactions
(red edges) are our HPI models for 8 KSHV proteins, where viral proteins are shown as blue diamond nodes. (a)
KSHV proteins target the highly connected part of the human PPI network. (b) Structural HPI network without
the endogenous template interactions. Most targets of individual KSHV proteins are distinct, but some are
shared across different KSHV proteins
3 Concluding Remarks
Table 3
Functional enrichment of KSHV-targeted human proteins by DAVID [93, 94]
Number of
genes
KEGG pathways enriched % P value KSHV-targeted human proteins
Cytokine-cytokine 10 13.9 7.20E 05 CCL3, CCL2, CCL13, TNR6, CCL4,
receptor interaction ACVR1, CCL5, CXCR4, INAR1,
IL12B
Chemokine signaling 9 12.5 9.80E 05 RHOA, CCL3, CCL2, CCL13, CCL4,
pathway CCL5, JAK2, CCL14, CXCR4
Herpes simplex infection 8 11.1 5.60E 04 CCL2, C1QBP, TNR6, CDK2, CCL5,
JAK2, INAR1, IL12B
Measles 7 9.7 6.10E 04 TNR6, CCND3, CDK4, CDK2, JAK2,
INAR1, IL12B
p53 signaling pathway 5 6.9 1.90E 03 TNR6, CCND3, CASP9, CDK4,
CDK2
Influenza A 7 9.7 2.40E 03 CCL2, TNR6, CASP9, CCL5, JAK2,
INAR1, IL12B
Pathways in cancer 10 13.9 3.50E 03 RHOA, ITA2B, FGFR2, TNR6,
CASP9, CDK4, CDK2, CXCR4,
ARHGB, BMP2
Hepatitis B 6 8.3 5.70E 03 CCNA2, TNR6, CASP9, CDK4,
CDK2, INAR1
Chagas disease (American 5 6.9 9.20E 03 CCL3, CCL2, TNR6, CCL5, IL12B
trypanosomiasis)
Toll-like receptor signaling 5 6.9 9.80E 03 CCL3, CCL4, CCL5, INAR1, IL12B
pathway
PI3K-Akt signaling 8 11.1 1.90E 02 ITA2B, FGFR2, CCND3, CASP9,
pathway CDK4, CDK2, JAK2, INAR1
African trypanosomiasis 3 4.2 2.80E 02 TNR6, HBA, IL12B
Small-cell lung cancer 4 5.6 3.00E 02 ITA2B, CASP9, CDK4, CDK2
Glutathione metabolism 3 4.2 6.20E 02 GSTA4, GSTP1, GSTM2
Cell cycle 4 5.6 7.60E 02 CCNA2, CCND3, CDK4, CDK2
Viral carcinogenesis 5 6.9 8.00E 02 CCNA2, RHOA, CCND3, CDK4,
CDK2
Signaling pathways 4 5.6 1.00E 01 FGFR2, ACVR1, JAK2, BMP2
regulating pluripotency
of stem cells
4 Notes
Acknowledgments
This project has been funded in whole or in part with federal funds
from the National Cancer Institute, National Institutes of Health,
under contract number HHSN261200800001E. The content of
this publication does not necessarily reflect the views or policies of
the Department of Health and Human Services, nor does mention
of trade names, commercial products, or organizations imply
endorsement by the US Government. This research was supported
(in part) by the Intramural Research Program of the NIH, National
Cancer Institute, Center for Cancer Research. This study utilized
the high-performance computational capabilities of the Biowulf
PC/Linux cluster at the National Institutes of Health (NIH),
Bethesda, MD (http://biowulf.nih.gov).
References
1. Durmus S, Cakir T, Ozgur A, Guthke R JF, Delohery T, Weghorst CM, Weinstein IB,
(2015) A review on computational systems Moss SF (2000) Chronic helicobacter pylori
biology of pathogen-host interactions. Front infection induces an apoptosis-resistant pheno-
Microbiol 6:235. https://doi.org/10.3389/ type associated with decreased expression of
fmicb.2015.00235 p27(kip1). Infect Immun 68(9):5321–5328
2. Stebbins CE, Galan JE (2001) Structural mim- 9. Guven-Maiorov E, Tsai CJ, Nussinov R (2016)
icry in bacterial virulence. Nature 412 Pathogen mimicry of host protein-protein
(6848):701–705. https://doi.org/10.1038/ interfaces modulates immunity. Semin Cell
35089000 Dev Biol 58:136–145. https://doi.org/10.
3. Sal-Man N, Biemans-Oldehinkel E, Finlay BB 1016/j.semcdb.2016.06.004
(2009) Structural microengineers: pathogenic 10. Tsai CJ, Lin SL, Wolfson HJ, Nussinov R
Escherichia coli redesigns the actin cytoskele- (1996) A dataset of protein-protein interfaces
ton in host cells. Structure 17(1):15–19. generated with a sequence-order-independent
https://doi.org/10.1016/j.str.2008.12.001 comparison technique. J Mol Biol 260
4. Kahn RA, Fu H, Roy CR (2002) Cellular (4):604–620. https://doi.org/10.1006/jmbi.
hijacking: a common strategy for microbial 1996.0424
infection. Trends Biochem Sci 27 11. Tsai CJ, Lin SL, Wolfson HJ, Nussinov R
(6):308–314. https://doi.org/10.1016/ (1996) Protein-protein interfaces: architec-
S0968-0004(02)02108-4 tures and interactions in protein-protein inter-
5. Finlay BB, McFadden G (2006) Anti- faces and in protein cores. Their similarities and
immunology: evasion of the host immune sys- differences. Crit Rev Biochem Mol Biol 31
tem by bacterial and viral pathogens. Cell 124 (2):127–152. https://doi.org/10.3109/
(4):767–782. https://doi.org/10.1016/j.cell. 10409239609106582
2006.01.034 12. Keskin O, Nussinov R (2005) Favorable scaf-
6. Moody CA, Laimins LA (2010) Human papil- folds: proteins with different sequence, struc-
lomavirus oncoproteins: pathways to transfor- ture and function may associate in similar ways.
mation. Nat Rev Cancer 10(8):550–560. Protein Eng Des Sel 18(1):11–24. https://doi.
https://doi.org/10.1038/nrc2886 org/10.1093/protein/gzh095
7. Filippova M, Song H, Connolly JL, Dermody 13. Keskin O, Nussinov R (2007) Similar binding
TS, Duerksen-Hughes PJ (2002) The human sites and different partners: implications to
papillomavirus 16 E6 protein binds to tumor shared proteins in cellular pathways. Structure
necrosis factor (TNF) R1 and protects cells 15(3):341–354. https://doi.org/10.1016/j.
from TNF-induced apoptosis. J Biol Chem str.2007.01.007
277(24):21730–21739. https://doi.org/10. 14. Cukuroglu E, Gursoy A, Nussinov R, Keskin O
1074/jbc.M200113200 (2014) Non-redundant unique interface struc-
8. Shirin H, Sordillo EM, Kolevska TK, tures as templates for modeling protein
Hibshoosh H, Kawabata Y, Oh SH, Kuebler
Interface-Based Structural Prediction of Novel Host-Pathogen Interactions 331
interactions. PLoS One 9(1):e86738. https:// 24. Shapira SD, Gat-Viks I, Shum BO, Dricot A,
doi.org/10.1371/journal.pone.0086738 de Grace MM, Wu L, Gupta PB, Hao T, Silver
15. Muratcioglu S, Guven-Maiorov E, Keskin O, SJ, Root DE, Hill DE, Regev A, Hacohen N
Gursoy A (2015) Advances in template-based (2009) A physical and regulatory map of host-
protein docking by utilizing interfaces towards influenza interactions reveals pathways in
completing structural interactome. Curr Opin H1N1 infection. Cell 139(7):1255–1267.
Struct Biol 35:87–92. https://doi.org/10. https://doi.org/10.1016/j.cell.2009.12.018
1016/j.sbi.2015.10.001 25. Zhang L, Villa NY, Rahman MM,
16. Franzosa EA, Garamszegi S, Xia Y (2012) Smallwood S, Shattuck D, Neff C,
Toward a three-dimensional view of protein Dufford M, Lanchbury JS, Labaer J, McFad-
networks between species. Front Microbiol den G (2009) Analysis of vaccinia virus-host
3:428. https://doi.org/10.3389/fmicb. protein-protein interactions: validations of
2012.00428 yeast two-hybrid screenings. J Proteome Res
17. Franzosa EA, Xia Y (2011) Structural princi- 8(9):4311–4318. https://doi.org/10.1021/
ples within the human-virus protein-protein pr900491n
interaction network. Proc Natl Acad Sci U S 26. Khadka S, Vangeloff AD, Zhang C,
A 108(26):10538–10543. https://doi.org/ Siddavatam P, Heaton NS, Wang L,
10.1073/pnas.1101440108 Sengupta R, Sahasrabudhe S, Randall G,
18. Guven-Maiorov E, Tsai CJ, Nussinov R (2017) Gribskov M, Kuhn RJ, Perera R, LaCount DJ
Structural host-microbiota interaction net- (2011) A physical interaction network of den-
works. PLoS Comput Biol 13(10):e1005579. gue virus and human proteins. Mol Cell Prote-
https://doi.org/10.1371/journal.pcbi. omics 10(12):M111.012187. https://doi.
1005579 org/10.1074/mcp.M111.012187
19. Bhavsar AP, Guttman JA, Finlay BB (2007) 27. Jager S, Cimermancic P, Gulbahce N, Johnson
Manipulation of host-cell pathways by bacterial JR, McGovern KE, Clarke SC, Shales M,
pathogens. Nature 449(7164):827–834. Mercenne G, Pache L, Li K, Hernandez H,
https://doi.org/10.1038/nature06247 Jang GM, Roth SL, Akiva E, Marlett J,
Stephens M, D’Orso I, Fernandes J, Fahey M,
20. Uetz P, Dong YA, Zeretzke C, Atzler C, Mahon C, O’Donoghue AJ, Todorovic A,
Baiker A, Berger B, Rajagopala SV, Morris JH, Maltby DA, Alber T, Cagney G,
Roupelieva M, Rose D, Fossum E, Haas J Bushman FD, Young JA, Chanda SK, Sund-
(2006) Herpesviral protein networks and their quist WI, Kortemme T, Hernandez RD, Craik
interaction with the human proteome. Science CS, Burlingame A, Sali A, Frankel AD, Krogan
311(5758):239–242. https://doi.org/10. NJ (2011) Global landscape of HIV-human
1126/science.1116804 protein complexes. Nature 481
21. von Schwedler UK, Stuchell M, Muller B, (7381):365–370. https://doi.org/10.1038/
Ward DM, Chung HY, Morita E, Wang HE, nature10719
Davis T, He GP, Cimbora DM, Scott A, Kraus- 28. Pichlmair A, Kandasamy K, Alvisi G,
slich HG, Kaplan J, Morham SG, Sundquist WI Mulhern O, Sacco R, Habjan M, Binder M,
(2003) The protein network of HIV budding. Stefanovic A, Eberle CA, Goncalves A,
Cell 114(6):701–713 Burckstummer T, Muller AC, Fauster A,
22. Calderwood MA, Venkatesan K, Xing L, Chase Holze C, Lindsten K, Goodbourn S,
MR, Vazquez A, Holthaus AM, Ewence AE, Kochs G, Weber F, Bartenschlager R, Bowie
Li N, Hirozane-Kishikawa T, Hill DE, Vidal M, AG, Bennett KL, Colinge J, Superti-Furga G
Kieff E, Johannsen E (2007) Epstein-Barr virus (2012) Viral immune modulators perturb the
and virus human protein interaction maps. human molecular network by common and
Proc Natl Acad Sci U S A 104 unique strategies. Nature 487
(18):7606–7611. https://doi.org/10.1073/ (7408):486–490. https://doi.org/10.1038/
pnas.0702332104 nature11289
23. de Chassey B, Navratil V, Tafforeau L, Hiet 29. Rozenblatt-Rosen O, Deo RC, Padi M,
MS, Aublin-Gex A, Agaugue S, Meiffren G, Adelmant G, Calderwood MA, Rolland T,
Pradezynski F, Faria BF, Chantier T, Le Grace M, Dricot A, Askenazi M, Tavares M,
Breton M, Pellet J, Davoust N, Mangeot PE, Pevzner SJ, Abderazzaq F, Byrdsong D, Car-
Chaboud A, Penin F, Jacob Y, Vidalain PO, vunis AR, Chen AA, Cheng J, Correll M,
Vidal M, Andre P, Rabourdin-Combe C, Lot- Duarte M, Fan C, Feltkamp MC, Ficarro SB,
teau V (2008) Hepatitis C virus infection pro- Franchi R, Garg BK, Gulbahce N, Hao T,
tein network. Mol Syst Biol 4:230. https://doi. Holthaus AM, James R, Korkhin A,
org/10.1038/msb.2008.66 Litovchick L, Mar JC, Pak TR, Rabello S,
332 Emine Guven-Maiorov et al.
Rubio R, Shen Y, Singh S, Spangle JM, Proteopathogen, a protein database for study-
Tasan M, Wanamaker S, Webber JT, ing Candida albicans--host interaction. Prote-
Roecklein-Canfield J, Johannsen E, Barabasi omics 9(20):4664–4668. https://doi.org/10.
AL, Beroukhim R, Kieff E, Cusick ME, Hill 1002/pmic.200900023
DE, Munger K, Marto JA, Quackenbush J, 39. Wattam AR, Abraham D, Dalay O, Disz TL,
Roth FP, DeCaprio JA, Vidal M (2012) Inter- Driscoll T, Gabbard JL, Gillespie JJ, Gough R,
preting cancer genomes using systematic host Hix D, Kenyon R, Machi D, Mao C, Nordberg
network perturbations by tumour virus pro- EK, Olson R, Overbeek R, Pusch GD,
teins. Nature 487(7408):491–495. https:// Shukla M, Schulman J, Stevens RL, Sullivan
doi.org/10.1038/nature11288 DE, Vonstein V, Warren A, Will R, Wilson
30. Guven Maiorov E, Keskin O, Gursoy A, Nussi- MJ, Yoo HS, Zhang C, Zhang Y, Sobral BW
nov R (2013) The structural network of (2014) PATRIC, the bacterial bioinformatics
inflammation and cancer: merits and chal- database and analysis resource. Nucleic Acids
lenges. Semin Cancer Biol 23(4):243–251. Res 42(Database issue):D581–D591. https://
https://doi.org/10.1016/j.semcancer.2013. doi.org/10.1093/nar/gkt1099
05.003 40. Urban M, Pant R, Raghunath A, Irvine AG,
31. Guven-Maiorov E, Keskin O, Gursoy A, Pedro H, Hammond-Kosack KE (2015) The
VanWaes C, Chen Z, Tsai CJ, Nussinov R Pathogen-Host Interactions database
(2015) The architecture of the TIR domain (PHI-base): additions and future develop-
signalosome in the toll-like Receptor-4 signal- ments. Nucleic Acids Res 43(Database issue):
ing pathway. Sci Rep 5:13128. https://doi. D645–D655. https://doi.org/10.1093/nar/
org/10.1038/srep13128 gku1165
32. Guven-Maiorov E, Keskin O, Gursoy A, Nus- 41. Xiang Z, Tian Y, He Y (2007) PHIDIAS: a
sinov R (2015) A structural view of negative pathogen-host interaction data integration
regulation of the toll-like receptor-mediated and analysis system. Genome Biol 8(7):R150.
inflammatory pathway. Biophys J 109 https://doi.org/10.1186/gb-2007-8-7-r150
(6):1214–1226. https://doi.org/10.1016/j. 42. Bleves S, Dunger I, Walter MC,
bpj.2015.06.048 Frangoulidis D, Kastenmuller G, Voulhoux R,
33. Acuner-Ozbabacan ES, Engin BH, Guven- Ruepp A (2014) HoPaCI-DB: host-Pseudo-
Maiorov E, Kuzu G, Muratcioglu S, monas and Coxiella interaction database.
Baspinar A, Chen Z, Van Waes C, Gursoy A, Nucleic Acids Res 42(Database issue):
Keskin O, Nussinov R (2014) The structural D671–D676. https://doi.org/10.1093/nar/
network of Interleukin-10 and its implications gkt925
in inflammation and cancer. BMC Genomics 43. Guirimand T, Delmotte S, Navratil V (2015)
15(Suppl 4):S2. https://doi.org/10.1186/ VirHostNet 2.0: surfing on the web of virus/
1471-2164-15-S4-S2 host molecular interactions data. Nucleic Acids
34. Nourani E, Khunjush F, Durmus S (2015) Res 43(Database issue):D583–D587. https://
Computational approaches for prediction of doi.org/10.1093/nar/gku1121
pathogen-host protein-protein interactions. 44. Li Y, Wang C, Miao Z, Bi X, Wu D, Jin N,
Front Microbiol 6:94. https://doi.org/10. Wang L, Wu H, Qian K, Li C, Zhang T,
3389/fmicb.2015.00094 Zhang C, Yi Y, Lai H, Hu Y, Cheng L, Leung
35. Brito AF, Pinney JW (2017) Protein-protein KS, Li X, Zhang F, Li K, Li X, Wang D (2015)
interactions in virus-host systems. Front ViRBase: a resource for virus-host ncRNA-
Microbiol 8:1557. https://doi.org/10.3389/ associated interactions. Nucleic Acids Res 43
fmicb.2017.01557 (Database issue):D578–D582. https://doi.
36. Durmus Tekir S, Cakir T, Ardic E, Sayilirbas org/10.1093/nar/gku903
AS, Konuk G, Konuk M, Sariyer H, Ugurlu A, 45. Calderone A, Licata L, Cesareni G (2015) Vir-
Karadeniz I, Ozgur A, Sevilgen FE, Ulgen KO usMentha: a new resource for virus-host pro-
(2013) PHISTO: pathogen-host interaction tein interactions. Nucleic Acids Res 43
search tool. Bioinformatics 29 (Database issue):D588–D592. https://doi.
(10):1357–1358. https://doi.org/10.1093/ org/10.1093/nar/gku830
bioinformatics/btt137 46. Kwofie SK, Schaefer U, Sundararajan VS, Bajic
37. Kumar R, Nanduri B (2010) HPIDB--a unified VB, Christoffels A (2011) HCVpro: hepatitis C
resource for host-pathogen interactions. BMC virus protein interaction database. Infect Genet
Bioinformatics 11(Suppl 6):S16. https://doi. Evol 11(8):1971–1977. https://doi.org/10.
org/10.1186/1471-2105-11-S6-S16 1016/j.meegid.2011.09.001
38. Vialas V, Nogales-Cadenas R, Nombela C, 47. Arnold R, Boonen K, Sun MG, Kim PM
Pascual-Montano A, Gil C (2009) (2012) Computational analysis of
Interface-Based Structural Prediction of Novel Host-Pathogen Interactions 333
88. Duarte JM, Srebniak A, Scharer MA, Capitani Heijne G, Nielsen J, Ponten F (2015) Proteo-
G (2012) Protein interface classification by mics. Tissue-based map of the human prote-
evolutionary analysis. BMC Bioinformatics ome. Science 347(6220):1260419. https://
13:334. https://doi.org/10.1186/1471- doi.org/10.1126/science.1260419
2105-13-334 91. Yang H, Ke Y, Wang J, Tan Y, Myeni SK, Li D,
89. Uhlen M, Bjorling E, Agaton C, Szigyarto CA, Shi Q, Yan Y, Chen H, Guo Z, Yuan Y, Yang X,
Amini B, Andersen E, Andersson AC, Yang R, Du Z (2011) Insight into bacterial
Angelidou P, Asplund A, Asplund C, virulence mechanisms against host immune
Berglund L, Bergstrom K, Brumer H, response via the Yersinia pestis-human pro-
Cerjan D, Ekstrom M, Elobeid A, Eriksson C, tein-protein interaction network. Infect
Fagerberg L, Falk R, Fall J, Forsberg M, Bjork- Immun 79(11):4413–4424. https://doi.org/
lund MG, Gumbel K, Halimi A, Hallin I, 10.1128/IAI.05622-11
Hamsten C, Hansson M, Hedhammar M, 92. Shannon P, Markiel A, Ozier O, Baliga NS,
Hercules G, Kampf C, Larsson K, Wang JT, Ramage D, Amin N,
Lindskog M, Lodewyckx W, Lund J, Schwikowski B, Ideker T (2003) Cytoscape: a
Lundeberg J, Magnusson K, Malm E, software environment for integrated models of
Nilsson P, Odling J, Oksvold P, Olsson I, biomolecular interaction networks. Genome
Oster E, Ottosson J, Paavilainen L, Persson A, Res 13(11):2498–2504. https://doi.org/10.
Rimini R, Rockberg J, Runeson M, 1101/gr.1239303
Sivertsson A, Skollermo A, Steen J, 93. Huang d W, Sherman BT, Lempicki RA (2009)
Stenvall M, Sterky F, Stromberg S, Bioinformatics enrichment tools: paths toward
Sundberg M, Tegel H, Tourle S, Wahlund E, the comprehensive functional analysis of large
Walden A, Wan J, Wernerus H, Westberg J, gene lists. Nucleic Acids Res 37(1):1–13.
Wester K, Wrethagen U, Xu LL, Hober S, Pon- https://doi.org/10.1093/nar/gkn923
ten F (2005) A human protein atlas for normal
and cancer tissues based on antibody proteo- 94. Huang d W, Sherman BT, Lempicki RA (2009)
mics. Mol Cell Proteomics 4(12):1920–1932. Systematic and integrative analysis of large gene
https://doi.org/10.1074/mcp.M500279- lists using DAVID bioinformatics resources.
MCP200 Nat Protoc 4(1):44–57. https://doi.org/10.
1038/nprot.2008.211
90. Uhlen M, Fagerberg L, Hallstrom BM,
Lindskog C, Oksvold P, Mardinoglu A, 95. Dissinger NJ, Damania B (2016) Recent
Sivertsson A, Kampf C, Sjostedt E, advances in understanding Kaposi’s sarcoma-
Asplund A, Olsson I, Edlund K, Lundberg E, associated herpesvirus. F1000Res 5:F1000.
Navani S, Szigyarto CA, Odeberg J, https://doi.org/10.12688/f1000research.
Djureinovic D, Takanen JO, Hober S, Alm T, 7612.1
Edqvist PH, Berling H, Tegel H, Mulder J, 96. Luther SA, Cyster JG (2001) Chemokines as
Rockberg J, Nilsson P, Schwenk JM, regulators of T cell differentiation. Nat Immu-
Hamsten M, von Feilitzen K, Forsberg M, nol 2(2):102–107. https://doi.org/10.1038/
Persson L, Johansson F, Zwahlen M, von 84205
Chapter 19
Abstract
Intrinsically disordered proteins and regions are involved in a wide range of cellular functions, and they
often facilitate protein-protein interactions. Molecular recognition features (MoRFs) are segments of
intrinsically disordered regions that bind to partner proteins, where binding is concomitant with a transition
to a structured conformation. MoRFs facilitate translation, transport, signaling, and regulatory processes
and are found across all domains of life. A popular computational tool, MoRFpred, accurately predicts
MoRFs in protein sequences. MoRFpred is implemented as a user-friendly web server that is freely available
at http://biomine.cs.vcu.edu/servers/MoRFpred/. We describe this predictor, explain how to run the
web server, and show how to interpret the results it generates. We also demonstrate the utility of this web
server based on two case studies, focusing on the relevance of evolutionary conservation of MoRF regions.
Key words Intrinsic disorder, Prediction, Molecular recognition features, MoRFs, Protein-protein
interactions, MoRFpred
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_19, © Springer Science+Business Media, LLC, part of Springer Nature 2019
337
338 Christopher J. Oldfield et al.
2.1 Datasets For training of MoRFpred, a set of MoRFs was constructed begin-
ning with known binding regions from Protein Data Bank (PDB)
[28]. Bound peptides from PDB were carefully filtered for clear
binding to a longer protein chain and mapped back to their source
proteins. This procedure resulted in a dataset of 842 MoRFs. To
avoid training and testing on similar proteins, these MoRFs were
grouped into 427 clusters and divided into testing and training sets.
This gave training and testing sets with 421 and 419 MoRFs,
respectively, with no protein more than 30% identical between the
two sets (see Note 1).
A set of negative examples that do not contain MoRFs with
near certainty were constructed from protein chains that have been
completely structurally characterized by X-ray crystallography at a
high resolution. The chance of intrinsic disorder in the negative set
was minimized by only selecting monomeric proteins without large
cofactors that contained no missing residues due to lack of electron
density. Further, any protein with a significant amount of predicted
intrinsic disorder, >30% of residues, was discarded. Filtering for
proteins with less than 30% identity resulted in a set of 28 proteins.
2.2 Architecture MoRFpred is a support vector machine (SVM) over a rich feature
space merged with a sequence similarity-based prediction (Fig. 1).
Features considered for the linear kernel SVM predictor included
five disorder prediction methods [29–32], relative solvent accessi-
ble surface prediction [33], B-factor prediction [34], PSI-BLAST-
generated position-specific scoring matrices (PSSMs), and amino
acid propensity scales from AAindex [35]. Two broad sets of fea-
tures were used from each of these methods: (1) per residue over a
window of 25 residues and (2) values aggregated over a window.
Aggregation methods included taking the difference over a window
of 25 residues and a smaller window, which captures the features
found to be useful for previous MoRF predictors. For example,
previous MoRF predictors relied on elevated predicted disorder
surrounding a predicted MoRF, but depressed values for the
MoRF region itself. Indeed, the corresponding difference-based
aggregation was found to be one of the strongest MoRF features.
Feature selection for the SVM predictor was based on a best-
first iterative addition of ranked features. Features were ranked
based on a combination of biserial correlations [36] and single-
feature predictive performance, where poorly correlated or
performing features were removed from consideration. Iterative
addition of features was based on a modified fivefold cross-
validation procedure, where a feature was only added if it improved
prediction performance by at least 1%.
340 Christopher J. Oldfield et al.
Fig. 1 Architecture of MoRFpred. The input sequence is used to generate sequence properties, from which
input features are derived by windowed averaging. A support vector machine predicts MoRFs based on these
input features. The SVM prediction is merged with similarity-based predictions to produce the final MoRFpred
score, where scores above 0.5 are predicted MoRFs (M) and those less than 0.5 are predicted non-MoRFs (n)
2.4 Web Server The MoRFpred web server is freely available at http://biomine.cs.
vcu.edu/servers/MoRFpred/. The server can be accessed with an
Internet connection and any modern web browser. All computa-
tions that are needed to complete predictions are performed on the
server side.
On our web server, sequences submitted for prediction will be
returned within 20 min of submission (see Note 2). The runtime of
MoRFpred is dominated by the PSI-BLAST prediction, whose
runtime varies with protein length and database similarity.
The main server page is where proteins are submitted for
prediction. The web server only requires FASTA sequences of the
proteins of interest to preform MoRFpred predictions. Up to five
FASTA-formatted protein sequences may be entered into the large
text entry field per submission. An e-mail address is required for
each submission. All required programs for generating prediction
features, including PSI-BLAST, and disorder, RSA, and B-factor
predictions, are run automatically by scripts on the server. Upon
completion of predictions for each submission, the server will send
an e-mail notification with links to the prediction results.
2.5 Running From the main server page, three steps are required to submit
MoRFpred sequences to obtain the MoRFpred’s predictions (Fig. 2, steps are
highlighted with red numbers corresponding to the step):
342 Christopher J. Oldfield et al.
Fig. 2 Primary MoRFpred page, for submission of sequences for prediction. Red numbers indicate the
sequence of steps required to submit a prediction
2.6 MoRFpred The results page includes a link to the raw results (Fig. 3, red 1) as
Results well as a color-coded text display of MoRFpred results (Fig. 3, red
2). The raw results (results.csv) file gives results for each submitted
sequence, each in three lines, which are comma delimited:
1. The input sequence: the FASTA header followed by each resi-
due of the input sequence.
Predicting Functions of Disordered Proteins 343
Fig. 3 MoRFpred prediction results page. Red numbers correspond to the primary features of the results page
3 Case Studies
3.1 Case Study: p53 Because of its crucial biological roles in regulation of apoptosis,
genomic stability, and inhibition of angiogenesis, as well as many
344 Christopher J. Oldfield et al.
Fig. 5 Case study: p53. The correspondence between intrinsic disorder predictions (red line), sequence
conservation (blue line), binding regions (orange boxes), and predicted MoRF regions (green boxes) is shown.
Binding regions are discussed in the text. Sequence conservation is calculated from a set of p53 orthologs
(OrthoDB) as the relative profile entropy over maximum entropy-weighted sequence (large values indicate
greater conservation)
3.2 Case Study: Endoribonucleases are hydrolytic enzymes that catalyze the endo-
RNase E nucleolytic cleavage of RNA, have various specificities, are univer-
sally present in all organisms, and typically operate under tight
cellular regulation. Endoribonucleases are involved in the matura-
tion, modification, and degradation of different RNAs [69]. There
are at least five endoribonucleases in E. coli (RNases I*, III, E, G,
P). Among various activities attributed to RNase E are processing
of transfer RNA, 9S ribosomal RNA, catalytic RNA of RNase P,
transfer/messenger RNA (t/mRNA) that rescues stalled ribosomes
[70–72], and general mRNA decay [73].
Being one of the larger E. coli proteins, RNase E consists of
1061 amino acid residues [74, 75]. There are two functionally
different domains in this protein, the catalytic N-terminal domain
(NTD; residues 1–498) and the regulatory C-terminal domain
(CTD; residues 499–1061) [76–78]. Although the NTD is rela-
tively conserved and has numerous homologues [79], there is little
sequence conservation in the CTD [80], which is also characterized
by low sequence complexity. The purified CTD was shown to be
mostly disordered by a set of biophysical techniques, such as limited
proteolysis, SDS–PAGE, SAXS, and far-UV CD [81]. Despite
being highly disordered, the CTD was shown to interact with
other degradosome components and with structured RNA
[81]. In agreement with these experimental data, computational
analysis clearly indicated that the NTD of RNase E was expected to
be mostly structured, whereas the CTD had characteristics of a
highly disordered protein [81].
The CTD is highly disordered, which is in agreement with the
high values of the putative propensities for disorder generated for
this protein with VSL2B [68] (see Fig. 6, red line). CTD is also
characterized by the presence of four regions of increased structural
propensity (labeled as segments A, B, C, and D, respectively),
which correspond to MoRFs. The four MoRFs were correctly
identified by the MoRFpred method (green boxes). Importantly,
all these segments are related to various biological activities of
RNase E, such as membrane targeting and CTD self-association
(segment A corresponding to residues 565–585) or interactions
with the components of the RNA degradosome, helicase
(segment B, which is a portion of the arginine-rich domain (resi-
dues 628–843)) [78, 82], enolase (segment C (residues 833–850))
[81], and polynucleotide phosphorylase PNPase (segment D,
RNase E residues 1021–1061) [81]. Like in the case of p53,
some of the MoRF regions (see Fig. 6, segments C and D) are
concomitant with a substantial decrease in the putative propensity
for disorder (red line), but the remaining two regions do not
register these dips. However, MoRFpred is still capable of identify-
ing these MoRF regions, in spite of their high propensity for
disorder and lack of conservation (blue line).
348 Christopher J. Oldfield et al.
RNA binding
Binding regions A B C D
Predicted MoRFs
1.0 3.0
Conservation scroe
Disorder scroe
2.0
0.5
1.0
0.0 0.0
450 500 550 600 650 700 750 800 850 900 950 1,000 1,050
Residue index
Fig. 6 Case study: RNase E. The correspondence between intrinsic disorder predictions (red line), sequence
conservation (blue line), binding regions (orange boxes), and predicted MoRF regions (green boxes) is shown.
Binding regions are discussed in the text. Sequence conservation is calculated from a set of RNase E orthologs
(OrthoDB) as the relative profile entropy over maximum entropy-weighted sequence (large values indicate
greater conservation)
4 Notes
References
1. Wang C, Uversky VN, Kurgan L (2016) Disor- Characterization of molecular recognition fea-
dered nucleiome: abundance of intrinsic disor- tures, MoRFs, and their binding partners. J
der in the DNA- and RNA-binding proteins in Proteome Res 6(6):2351–2366
1121 species from Eukaryota, Bacteria and 13. Oldfield CJ, Cheng Y, Cortese MS, Romero P,
Archaea. Proteomics 16(10):1486–1498 Uversky VN, Dunker AK (2005) Coupled
2. Peng Z, Yan J, Fan X, Mizianty MJ, Xue B, folding and binding with alpha-helix-forming
Wang K, Hu G, Uversky VN, Kurgan L molecular recognition elements. Biochemistry
(2015) Exceptionally abundant exceptions: 44(37):12454–12470
comprehensive characterization of intrinsic dis- 14. Yan J, Dunker AK, Uversky VN, Kurgan L
order in all domains of life. Cell Mol Life Sci 72 (2016) Molecular recognition features
(1):137–151 (MoRFs) in three domains of life. Mol BioSyst
3. Habchi J, Tompa P, Longhi S, Uversky VN 12(3):697–710
(2014) Introducing protein intrinsic disorder. 15. Cheng Y, Oldfield CJ, Meng J, Romero P,
Chem Rev 114(13):6561–6588 Uversky VN, Dunker AK (2007) Mining
4. Dunker AK, Babu MM, Barbar E, α-helix-forming molecular recognition features
Blackledge M, Bondos SE, Dosztányi Z, with cross species sequence alignments. Bio-
Dyson HJ, Forman-Kay J, Fuxreiter M, chemistry 46(47):13468–13477
Gsponer J, Han K-H, Jones DT, Longhi S, 16. Malhis N, Gsponer J (2015) Computational
Metallo SJ, Nishikawa K, Nussinov R, identification of MoRFs in protein sequences.
Obradovic Z, Pappu RV, Rost B, Selenko P, Bioinformatics 31(11):1738–1744
Subramaniam V, Sussman JL, Tompa P, 17. Disfani FM, Hsu WL, Mizianty MJ, Oldfield
Uversky VN (2013) What’s in a name? Why CJ, Xue B, Dunker AK, Uversky VN, Kurgan L
these proteins are intrinsically disordered. (2012) MoRFpred, a computational tool for
Intrinsically Disord Proteins 1(1):e24157 sequence-based prediction and characteriza-
5. Brown CJ, Takayama S, Campen AM, Vise P, tion of short disorder-to-order transitioning
Marshall TW, Oldfield CJ (2002) Evolutionary binding regions in proteins. Bioinformatics 28
rate heterogeneity in proteins with long disor- (12):i75–i83
dered regions. J Mol Evol 55:104 18. Malhis N, Jacobson M, Gsponer J (2016)
6. Meszaros B, Tompa P, Simon I, Dosztanyi Z MoRFchibi SYSTEM: software tools for the
(2007) Molecular principles of the interactions identification of MoRFs in protein sequences.
of disordered proteins. J Mol Biol 372 Nucleic Acids Res 44:W488
(2):549–561 19. Jones DT, Cozzetto D (2015) DISOPRED3:
7. Trudeau T, Nassar R, Cumberworth A, Wong precise disordered region predictions with
ET, Woollard G, Gsponer J (2013) Structure annotated protein-binding activity. Bioinfor-
and intrinsic disorder in protein autoinhibition. matics 31(6):857–863
Structure 21(3):332–341 20. Fang C, Noguchi T, Tominaga D, Yamana H
8. Varadi M, Guharoy M, Zsolyomi F, Tompa P (2013) MFSPSSMpred: identifying short
(2015) DisCons: a novel tool to quantify and disorder-to-order binding regions in disor-
classify evolutionary conservation of intrinsic dered proteins based on contextual local evolu-
protein disorder. BMC Bioinformatics 16 tionary conservation. BMC Bioinformatics
(1):153 14:300
9. Ait-Bara S, Carpousis AJ, Quentin Y (2015) 21. Xue B, Dunker AK, Uversky VN (2010) Retro-
RNase E in the gamma-Proteobacteria: conser- MoRFs: identifying protein binding sites by
vation of intrinsically disordered noncatalytic normal and reverse alignment and intrinsic dis-
region and molecular evolution of microdo- order prediction. Int J Mol Sci 11
mains. Mol Genet Genomics 290(3):847–862 (10):3725–3747
10. Davey NE, Cyert MS, Moses AM (2015) Short 22. Puntervoll P, Linding R, Gemünd C,
linear motifs – ex nihilo evolution of protein Chabanis-Davidson S, Mattingsdal M,
regulation. Cell Commun Signal 13(1):43 Cameron S, Martin DMA, Ausiello G,
11. Mohan A, Oldfield CJ, Radivojac P, Vacic V, Brannetti B, Costantini A, Ferrè F, Maselli V,
Cortese MS, Dunker AK, Uversky VN (2006) Via A, Cesareni G, Diella F, Superti-Furga G,
Analysis of molecular recognition features Wyrwicz L, Ramu C, McGuigan C,
(MoRFs). J Mol Biol 362(5):1043–1059 Gudavalli R, Letunic I, Bork P, Rychlewski L,
12. Vacic V, Oldfield CJ, Mohan A, Radivojac P, Küster B, Helmer-Citterich M, Hunter WN,
Cortese MS, Uversky VN, Dunker AK (2007) Aasland R, Gibson TJ (2003) ELM server: a
350 Christopher J. Oldfield et al.
new resource for investigating short functional 35. Kawashima S, Pokarowski P, Pokarowska M,
sites in modular eukaryotic proteins. Nucleic Kolinski A, Katayama T, Kanehisa M (2008)
Acids Res 31(13):3625–3630 AAindex: amino acid index database, progress
23. Meszaros B, Dosztanyi Z, Simon I (2012) Dis- report 2008. Nucleic Acids Res 36(Database
ordered binding regions and linear motifs-- issue):D202–D205
bridging the gap between two models of 36. Tate RF (1954) Correlation between a discrete
molecular recognition. PLoS One 7(10): and a continuous variable. Point-Biserial corre-
e46829 lation. Ann Math Statist 25(3):603–607
24. Peng Z, Wang C, Uversky VN, Kurgan L 37. Zhao R, Gish K, Murphy M, Yin Y,
(2017) Prediction of disordered RNA, DNA, Notterman D, Hoffman WH, Tom E, Mack
and protein binding regions using DisoRDP- DH, Levine AJ (2000) Analysis of
bind. Methods Mol Biol 1484:187–203 p53-regulated gene expression patterns using
25. Meszaros B, Simon I, Dosztanyi Z (2009) Pre- oligonucleotide arrays. Genes Dev 14
diction of protein binding regions in disor- (8):981–993
dered proteins. PLoS Comput Biol 5(5): 38. Balint EE, Vousden KH (2001) Activation and
e1000376 activities of the p53 tumour suppressor pro-
26. Dosztanyi Z, Meszaros B, Simon I (2009) tein. Br J Cancer 85(12):1813–1823
ANCHOR: web server for predicting protein 39. el-Deiry WS (1998) Regulation of p53 down-
binding regions in disordered proteins. Bioin- stream genes. Semin Cancer Biol 8
formatics 25(20):2745–2746 (5):345–357
27. Khan W, Duffy F, Pollastri G, Shields DC, 40. Yu J, Zhang L, Hwang PM, Rago C, Kinzler
Mooney C (2013) Predicting binding within KW, Vogelstein B (1999) Identification and
disordered protein regions to structurally char- classification of p53-regulated genes. Proc
acterised peptide-binding domains. PLoS One Natl Acad Sci U S A 96(25):14517–14522
8(9):e72838 41. Sax JK, El-Deiry WS (2003) p53-induced gene
28. Berman HM, Westbrook J, Feng Z, expression analysis. Methods Mol Biol
Gilliland G, Bhat TN, Weissig H, Shindyalov 234:65–71
IN, Bourne PE (2000) The protein data bank. 42. Fridman JS, Lowe SW (2003) Control of apo-
Nucleic Acids Res 28(1):235–242 ptosis by p53. Oncogene 22(56):9030–9040
29. Dosztanyi Z, Csizmok V, Tompa P, Simon I 43. Anderson CW, Appella E (2004) Signaling to
(2005) IUPred: web server for the prediction the p53 tumor suppressor through pathways
of intrinsically unstructured regions of proteins activated by genotoxic and nongenotoxic
based on estimated energy content. Bioinfor- stress. In: Bradshaw RA, Dennis EA (eds)
matics 21(16):3433–3434 Handbook of cell signaling. Academic Press,
30. Ward JJ, McGuffin LJ, Bryson K, Buxton BF, New York, pp 237–247
Jones DT (2004) The DISOPRED server for 44. Gottlieb TM, Leal JF, Seger R, Taya Y, Oren M
the prediction of protein disorder. Bioinfor- (2002) Cross-talk between Akt, p53 and
matics 20(13):2138–2139 Mdm2: possible implications for the regulation
31. McGuffin LJ (2008) Intrinsic disorder predic- of apoptosis. Oncogene 21(8):1299–1303
tion from the analysis of multiple protein fold 45. Nicholson KM, Anderson NG (2002) The pro-
recognition models. Bioinformatics 24 tein kinase B/Akt signalling pathway in human
(16):1798–1804 malignancy. Cell Signal 14(5):381–395
32. Mizianty MJ, Stach W, Chen K, Kedarisetti 46. Abraham AG, O’Neill E (2014) PI3K/Akt-
KD, Disfani FM, Kurgan L (2010) Improved mediated regulation of p53 in cancer. Biochem
sequence-based prediction of disordered Soc Trans 42(4):798–803
regions with multilayer fusion of multiple 47. Muller PA, Vousden KH (2013) p53 mutations
information sources. Bioinformatics 26(18): in cancer. Nat Cell Biol 15(1):2–8
i489–i496
48. Soussi T, Beroud C (2001) Assessing TP53
33. Faraggi E, Xue B, Zhou Y (2009) Improving status in human tumours to evaluate clinical
the prediction accuracy of residue solvent outcome. Nat Rev Cancer 1(3):233–240
accessibility and real-value backbone torsion
angles of proteins by guided-learning through 49. Bookstein R (1994) Tumor suppressor genes
a two-layer neural network. Proteins 74 in prostatic oncogenesis. J Cell Biochem Suppl
(4):847–856 19:217–223
34. Schlessinger A, Yachdav G, Rost B (2006) 50. Pencik J, Wiebringhaus R, Susani M, Culig Z,
PROFbval: predict flexible and rigid residues Kenner L (2015) IL-6/STAT3/ARF: the
in proteins. Bioinformatics 22(7):891–893 guardians of senescence, cancer progression
Predicting Functions of Disordered Proteins 351
and metastasis in prostate cancer. Swiss Med 63. Lowe ED, Tews I, Cheng KY, Brown NR,
Wkly 145:w14215 Gul S, Noble ME, Gamblin SJ, Johnson LN
51. Wolff JM, Stephenson RN, Jakse G, Habib FK (2002) Specificity determinants of recruitment
(1994) Retinoblastoma and p53 genes as prog- peptides bound to phospho-CDK2/cyclin
nostic indicators in urological oncology. Urol A. Biochemistry 41(52):15625–15634
Int 53(1):1–5 64. Avalos JL, Celic I, Muhammad S, Cosgrove
52. Joerger AC, Ang HC, Veprintsev DB, Blair MS, Boeke JD, Wolberger C (2002) Structure
CM, Fersht AR (2005) Structures of p53 can- of a Sir2 enzyme bound to an acetylated p53
cer mutants and mechanism of rescue by peptide. Mol Cell 10(3):523–535
second-site suppressor mutations. J Biol 65. Mujtaba S, He Y, Zeng L, Yan S, Plotnikova O,
Chem 280(16):16030–16037 Sachchidanand SR, Zeleznik-Le NJ, Ronai Z,
53. Canadillas JM, Tidow H, Freund SM, Ruther- Zhou MM (2004) Structural mechanism of the
ford TJ, Ang HC, Fersht AR (2006) Solution bromodomain of the coactivator CBP in p53
structure of p53 core domain: structural basis transcriptional activation. Mol Cell 13
for its instability. Proc Natl Acad Sci U S A 103 (2):251–263
(7):2109–2114 66. Rustandi RR, Baldisseri DM, Weber DJ (2000)
54. Wang Y, Rosengarth A, Luecke H (2007) Structure of the negative regulatory domain of
Structure of the human p53 core domain in p53 bound to S100B(betabeta). Nat Struct
the absence of DNA. Acta Crystallogr D Biol Biol 7(7):570–574
Crystallogr 63(Pt 3):276–281 67. Oldfield CJ, Meng J, Yang JY, Yang MQ,
55. Joerger AC, Fersht AR (2008) Structural biol- Uversky VN, Dunker AK (2008) Flexible
ogy of the tumor suppressor p53. Annu Rev nets: disorder and induced fit in the associa-
Biochem 77:557–582 tions of p53 and 14-3-3 with their partners.
56. Uversky VN, Oldfield CJ, Midic U, Xie H, BMC Genomics 9(Suppl 1):S1
Xue B, Vucetic S, Iakoucheva LM, 68. Peng K, Radivojac P, Vucetic S, Dunker AK,
Obradovic Z, Dunker AK (2009) Unfoldomics Obradovic Z (2006) Length-dependent pre-
of human diseases: linking protein intrinsic dis- diction of protein intrinsic disorder. BMC Bio-
order with diseases. BMC Genomics 10(Suppl informatics 7:208
1):S7 69. Ehretsmann CP, Carpousis AJ, Krisch HM
57. Bianco R, Ciardiello F, Tortora G (2005) Che- (1992) Specificity of Escherichia coli endoribo-
mosensitization by antisense oligonucleotides nuclease RNase E: in vivo and in vitro analysis
targeting MDM2. Curr Cancer Drug Targets of mutants in a bacteriophage T4 mRNA pro-
5(1):51–56 cessing site. Genes Dev 6(1):149–159
58. Moll UM, Petrenko O (2003) The MDM2- 70. Huang H, Liao J, Cohen SN (1998) Poly(A)-
p53 interaction. Mol Cancer Res 1 and poly(U)-specific RNA 30 tail shortening by
(14):1001–1008 E. coli ribonuclease E. Nature 391
59. Nag S, Qin J, Srivenugopal KS, Wang M, (6662):99–102
Zhang R (2013) The MDM2-p53 pathway 71. Kushner SR (2002) mRNA decay in Escheri-
revisited. J Biomed Res 27(4):254–271 chia coli comes of age. J Bacteriol 184
60. Kussie PH, Gorina S, Marechal V, Elenbaas B, (17):4658–4665 discussion 4657
Moreau J, Levine AJ, Pavletich NP (1996) 72. Ow MC, Kushner SR (2002) Initiation of
Structure of the MDM2 oncoprotein bound tRNA maturation by RNase E is essential for
to the p53 tumor suppressor transactivation cell viability in E. coli. Genes Dev 16
domain. Science 274(5289):948–953 (9):1102–1115
61. Bochkareva E, Kaustov L, Ayed A, Yi GS, Lu Y, 73. Steege DA (2000) Emerging features of
Pineda-Lucena A, Liao JC, Okorokov AL, mRNA decay in bacteria. RNA 6
Milner J, Arrowsmith CH, Bochkarev A (8):1079–1090
(2005) Single-stranded DNA mimicry in the 74. Casaregola S, Jacq A, Laoudj D, McGurk G,
p53 transactivation domain interaction with Margarson S, Tempete M, Norris V, Holland
replication protein A. Proc Natl Acad Sci U S IB (1992) Cloning and analysis of the entire
A 102(43):15412–15417 Escherichia coli ams gene. ams is identical to
62. Mora P, Carbajo RJ, Pineda-Lucena A, Sanchez hmp1 and encodes a 114 kDa protein that
del Pino MM, Perez-Paya E (2008) Solvent- migrates as a 180 kDa protein. J Mol Biol 228
exposed residues located in the beta-sheet (1):30–40
modulate the stability of the tetramerization 75. Claverie-Martin F, Diaz-Torres MR, Yancey
domain of p53--a structural and combinatorial SD, Kushner SR (1991) Analysis of the altered
approach. Proteins 71(4):1670–1685 mRNA stability (ams) gene from Escherichia
352 Christopher J. Oldfield et al.
coli. Nucleotide sequence, transcriptional anal- of 16S rRNA. Biochem Biophys Res Commun
ysis, and homology of its product to MRP3, a 259(2):483–488
mitochondrial ribosomal protein from Neuros- 80. Kaberdin VR, Miczak A, Jakobsen JS,
pora crassa. J Biol Chem 266(5):2843–2851 Lin-Chao S, McDowall KJ, von Gabain A
76. Lopez PJ, Marchand I, Joyce SA, Dreyfus M (1998) The endoribonucleolytic N-terminal
(1999) The C-terminal half of RNase E, which half of Escherichia coli RNase E is evolution-
organizes the Escherichia coli degradosome, arily conserved in Synechocystis sp. and other
participates in mRNA degradation but not bacteria but not the C-terminal half, which is
rRNA processing in vivo. Mol Microbiol 33 sufficient for degradosome assembly. Proc Natl
(1):188–199 Acad Sci U S A 95(20):11637–11642
77. Cohen SN, McDowall KJ (1997) RNase E: still 81. Callaghan AJ, Aurikko JP, Ilag LL, Gunter
a wonderfully mysterious enzyme. Mol Micro- Grossmann J, Chandran V, Kuhnel K,
biol 23(6):1099–1106 Poljak L, Carpousis AJ, Robinson CV, Sym-
78. McDowall KJ, Cohen SN (1996) The mons MF, Luisi BF (2004) Studies of the
N-terminal domain of the rne gene product RNA degradosome-organizing domain of the
has RNase E activity and is non-overlapping Escherichia coli ribonuclease RNase E. J Mol
with the arginine-rich RNA-binding site. J Biol 340(5):965–979
Mol Biol 255(3):349–355 82. Taraseviciene L, Bjork GR, Uhlin BE (1995)
79. Wachi M, Umitsuki G, Shimizu M, Takada A, Evidence for an RNA binding region in the
Nagai K (1999) Escherichia coli cafA gene Escherichia coli processing endoribonuclease
encodes a novel RNase, designated as RNase E. J Biol Chem 270(44):26391–26398
RNase G, involved in processing of the 50 end
Chapter 20
Abstract
The native state of proteins is composed of conformers in dynamical equilibrium. In this chapter, different
issues related to conformational diversity are explored using a curated and experimentally based database
called CoDNaS (Conformational Diversity in the Native State). This database is a collection of redundant
structures for the same sequence. CoDNaS estimates the degree of conformational diversity using different
global and local structural similarity measures. It allows the user to explore how structural differences
among conformers change as a function of several structural features providing further biological informa-
tion. This chapter explores the measurement of conformational diversity and its relationship with sequence
divergence. Also, it discusses how proteins with high conformational diversity could affect homology
modeling techniques.
Key words Conformational diversity, CoDNaS database, Conformers, Native state, Protein dynam-
ics, Protein evolution
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_20, © Springer Science+Business Media, LLC, part of Springer Nature 2019
353
354 Alexander Miguel Monzon et al.
2 Methods
Table 1
Databases of protein conformational diversity
2.2.1 Database Different conformers for each protein were identified and extracted
Implementation, Biological from the PDB using the following protocol:
Annotation, and External
– BLASTClust [25] was run against all protein chains deposited in
Links
the PDB to obtain all available clusters at 95% of local sequence
identity with a minimum coverage of 0.90 between all the
sequences in the cluster. A limit at 95% was set to include
putative sequence variations for a given protein. However, to
avoid the inclusion of homologous structures in a given CoD-
NaS entry, UniProt accession numbers were used to check that
all conformers belong to the same protein.
– The only considered clusters were those with at least two struc-
tures and with a resolution of less than 4.00 Å for each of the
crystallographic structures.
– To estimate the structural dissimilarity between conformers in
each cluster, C-alpha root mean square deviation (RMSD) using
MAMMOTH (see Note 3) [26] was calculated for all the possi-
ble pairs of conformers for each protein. The maximum C-alpha
RMSD value for each protein entry was registered as a measure
of the conformational diversity extension.
– Additionally, all conformers for a given protein were clustered
using a hierarchical procedure according to the RMSD values
between them. This enables users to identify different confor-
mational substates present in the native state of the protein.
356 Alexander Miguel Monzon et al.
2.2.2 Working Case: Human ephrin type-A receptor 4 (EphA4) is a tyrosine kinase
Conformational Diversity receptor. Eph receptors and their ephrin ligands are both anchored
of the Ephrin Type-A onto the plasma membrane and are subdivided into two subclasses
Receptor 4 (A and B) based on their sequence conservation and binding pre-
ferences [32]. In general, type-A receptors bind to ephrin but in
particular EphA4 is the only receptor capable of binding to all nine
ephrins and other small molecules with overlapped interfaces. Bind-
ing pattern in EphA4 can be explained exploring its ensemble of
conformers. EphA4 has two groups of conformers: closed and open
forms which have been biologically characterized and identified by
molecular dynamic simulations and NMR studies [33]. Hence,
open and closed conformations of the EphA4 can be easily explored
in CoDNaS using the information provided by the hierarchical
clustering based on the RMSD values between all pairs of confor-
mers (Fig. 1). It is interesting to note the differences between the
29 conformers available in CoDNaS. Ten of them were obtained by
nuclear magnetic resonance (NMR) and 19 by X-ray diffraction. It
is possible to find this protein in CoDNaS searching by its UniProt
accession number “P54764” and to access the entry page (protein
pool identifier in CoDNaS is “2WO1_A”). The entry page includes
a set of boxes with different information about the protein, such as
protein overview, structural information, conformers, clusters of
conformational states, and information about the pair of maximum
conformational diversity. EphA4 has a maximum conformational
diversity of RMSD ¼ 3.23 Å between the structures 2WO3 chain A
and 2LW8 chain A, model 7.
Fig. 1 Dendrogram of the EphA4 conformations. We can observe different conformational substates due to the
experimental method used and transitions between open (red) and closed (blue) conformations. Filled nodes
indicate that the conformer has ligand
Exploring Protein Conformational Diversity 357
Fig. 2 Comparison between conformers of the EphA4 based on clustering information. (a) Superimposition of
ten conformers from the NMR ensemble (PDB code ¼ 2LW8). (b) Superimposition of 16 closed conformations
(blue) of the EphA4. (c) Superimposition of three open conformations (red) of the EphA4
Figure 1 shows two main groups at the top, one containing all
NMR conformers and the other containing all X-ray conformers
(see also Note 1). Among the group of X-ray conformers, we can
observe two branches which separate open and closed conforma-
tions of the EphA4. Filled nodes indicate conformers in complex
with the ligand. Superimposition of these three different groups
(NMR, X-ray closed, and X-ray open) reveals a high conformational
variability in the regions of the B–C, D–E, G–H, and J–K loops (see
Fig. 2) [34]. In particular, the flexibility of the D–E and J–K loops,
which move upon binding to ephrin ligands, may be directly asso-
ciated with EphA4 function and binding pattern.
2.3 Practical Issues The extension of the conformational diversity was studied in a curated
Concerning dataset (see Note 2) of ~5000 proteins with more than 5 conformers
Conformational (see Note 6) per protein [35]. This study found three protein classes
Diversity based on their dynamical behavior: rigid, malleable, and partially
disordered proteins. Approximately 60% of the analyzed proteins are
2.3.1 How Large Are part of the first group, the rigid proteins. Conformational diversity of
Conformational Changes each protein was measured as the maximum RMSD (see Note 5) after
in Known Structural Space? an all-versus-all conformer pairwise comparisons. The RMSD distri-
bution of rigid proteins has a peak in 0.8 Å, a value close to the
crystallographic error which is near 0.5 Å (see Fig. 3). This result
agrees with earlier studies that found a positive skewed distribution
of RMSD [19, 36]. It also agrees with a previous work that found an
average RMSD of 0.5 Å after comparisons between structures of the
same protein in unbound states, a value slightly different from the
observed between apo and substrate-bound forms [37]. Apparently,
large-scale protein motions are not necessary to sustain biological
function in the majority of the studied proteins. This observation is
supported by the finding that even small changes between conformers
could greatly affect catalytic parameters and biological behavior of
enzymes [38, 39].
358 Alexander Miguel Monzon et al.
2.3.2 Which Kind The tail of the distribution shown in Fig. 3 has mainly IDPs,
of Proteins Have Larger malleable and partially disordered proteins in particular, and a
Conformational Changes? minor proportion of globular or ordered proteins [35]. It is impor-
tant to note that IDPs contain very flexible regions which several
times appear as missing residues in the structures derived from
crystallographic studies. Almost half of these IDPs show order-
disorder transitions; that is, they have regions that are disordered
in one group of conformers but ordered in alternative conforma-
tions. Surprisingly, regions gaining order upon ligand binding are
almost as common as the ones gaining disordered regions upon
binding. IDPs showing order-disorder transitions reach the highest
RMSD values in their aligned ordered regions [17]. The high
RMSD values between conformers are related to the increase of
structural differences in the globular or ordered region of IDPs.
These differences can be high due to very flexible loops or regions
adopting variable conformations (e.g., malleable and partially dis-
ordered proteins).
In reference to globular or ordered proteins, large conforma-
tional movements have been previously described by M. Gerstein in
Exploring Protein Conformational Diversity 359
2.3.3 Importance Template-based modeling (TBM) is based on the fact that homol-
of the Conformational ogous proteins with detectable sequence similarity possess similar
Diversity in Homology 3D structures. Pioneering work by Chothia and Lesk found that
Modeling structural divergence increases with evolutionary distance,
measured as identity percentage, following a nonlinear relationship
[51]. Very similar sequences show modest structural differences,
which suddenly increase when percentage of sequence identity
360 Alexander Miguel Monzon et al.
Fig. 4 RMSD versus percent of sequence identity. RMSD values were obtained from an all-versus-all
comparison between two homologous proteins considering all their conformers. The figure contains about
3.5 million comparisons
drops below 30%. Their results and conclusions have been verified
by numerous studies [52–56]. These studies found moderate-to-
high correlation coefficients between different parameters related
to structural and sequence similarity, i.e., RMSD versus identity
percentage and evolutionary distance. They also found linear and
nonlinear behavior, and an invariably low structural variation at
100% identity (~0.5 Å). However, when conformational diversity
is taken into account the relationship between sequence and struc-
tural divergence is more complex [57]. Figure 4 shows how
RMSDs between homologous proteins change as a function of
identity percentage. This figure was derived from an all-versus-all
pairwise alignment between all the conformers for 2024 proteins
from 524 families. It is possible to observe that at around 100%
identity (the conformational diversity of the protein) (see Note 4)
several proteins show RMSDs as high as those reached by sequence
divergence during evolution (say about 30–40% identity). This
means that the structural divergence is a complex process since a
given sequence (at 100% identity) could reach several angstroms of
conformational (structural) diversity. Interestingly, if we split the
population of proteins according to their corresponding degree of
conformational diversity (rigids and highly dynamical proteins) we
can observe in Fig. 5 that the rigid proteins could certainly be more
suitable to TBM methodologies than highly dynamic ones. The
rigid proteins show an average RMSD of 0.39 Å at 100% identity,
meaning that more similar sequences have more similar structures.
This last statement, basic to TBM reliability, apparently is not true
for highly dynamical proteins (average RMSD at 100% 1.17 Å).
Exploring Protein Conformational Diversity 361
Fig. 5 Maximum RMSD versus sequence percent identity. Points refer to the maximum RMSD obtained from
an all-versus-all comparison between conformers from two homologous proteins. Red dots are pairs of highly
dynamic homologous proteins (conformational diversity >0.5 Å) and blue dots are pairs of rigid proteins
(conformational diversity <¼0.5 Å)
3 Notes
Acknowledgments
References
1. Gerstein M, Lesk AM, Chothia C (1994) disordered proteins. Curr Opin Struct Biol
Structural mechanisms for domain movements 18:756–764
in proteins. Biochemistry 33:6739–6749 8. Boehr DD, McElheny D, Dyson HJ et al
2. Gerstein M, Krebs W (1998) A database of (2006) The dynamic energy landscape of dihy-
macromolecular motions. Nucleic Acids Res drofolate reductase catalysis. Science
26:4280–4290 313:1638–1642
3. Gu Y, Li D-W, Brüschweiler R (2015) Decod- 9. Tsai CJ, Del Sol A, Nussinov R (2009) Protein
ing the mobility and time scales of protein allostery, signal transmission and dynamics: a
loops. J Chem Theory Comput 11:1308–1314 classification scheme of allosteric mechanisms.
4. Gora A, Brezovsky J, Damborsky J (2013) Mol BioSyst 5:207–216
Gates of enzymes. Chem Rev 113:5871–5923 10. Hilser VJ (2010) Biochemistry. An ensemble
5. Perutz MF, Bolton W, Diamond R et al (1964) view of allostery. Science 327:653–654
Structure of haemoglobin. An X-ray examina- 11. James LC, Roversi P, Tawfik DS (2003) Anti-
tion of reduced horse haemoglobin. Nature body multispecificity mediated by conforma-
203:687–690 tional diversity. Science 299:1362–1367
6. Popovych N, Sun S, Ebright RH et al (2006) 12. Smock RG, Gierasch LM (2009) Sending sig-
Dynamically driven protein allostery. Nat nals dynamically. Science 324:198–203
Struct Mol Biol 13:831–838 13. Yogurtcu ON, Bora Erdemli S, Nussinov R
7. Dunker AK, Keith Dunker A, Silman I et al et al (2008) Restricted mobility of conserved
(2008) Function and structure of inherently residues in protein-protein interfaces in molec-
ular simulations. Biophys J 94:3475–3485
364 Alexander Miguel Monzon et al.
14. Lynch TJ, Bell DW, Sordella R et al (2004) 28. Sillitoe I, Lewis TE, Cuff A et al (2015) CATH:
Activating mutations in the epidermal growth comprehensive structural and functional anno-
factor receptor underlying responsiveness of tations for genome sequences. Nucleic Acids
non-small-cell lung cancer to gefitinib. N Res 43:D376–D381
Engl J Med 350:2129–2139 29. Bairoch A (2000) The ENZYME database in
15. Tokuriki N, Stricher F, Serrano L et al (2008) 2000. Nucleic Acids Res 28:304–305
How protein stability and new functions trade 30. Potenza E, Di Domenico T, Walsh I et al
off. PLoS Comput Biol 4:e1000002 (2015) MobiDB 2.0: an improved database of
16. Zea DJ, Miguel Monzon A, Fornasari MS et al intrinsically disordered and mobile proteins.
(2013) Protein conformational diversity corre- Nucleic Acids Res 43:D315–D320
lates with evolutionary rate. Mol Biol Evol 31. Ashburner M, Ball CA, Blake JA et al (2000)
30:1500–1503 Gene ontology: tool for the unification of biol-
17. Zea DJ, Monzon AM, Gonzalez C et al (2016) ogy. The Gene Ontology Consortium. Nat
Disorder transitions and conformational diver- Genet 25:25–29
sity cooperatively modulate biological function 32. Qin H, Shi J, Noberini R et al (2008) Crystal
in proteins. Protein Sci 25:1138–1146 structure and NMR binding reveal that two
18. Best RB, Lindorff-Larsen K, DePristo MA et al small molecule antagonists target the high
(2006) Relation between native ensembles and affinity ephrin-binding channel of the EphA4
experimental structures of proteins. Proc Natl receptor. J Biol Chem 283:29473–29484
Acad Sci U S A 103:10901–10906 33. Qin H, Lim L, Song J (2012) Protein dynamics
19. Burra PV, Zhang Y, Godzik A et al (2009) at Eph receptor-ligand interfaces as revealed by
Global distribution of conformational states crystallography, NMR and MD simulations.
derived from redundant models in the PDB BMC Biophys 5:2
points to non-uniqueness of the protein struc- 34. Bowden TA, Aricescu AR, Nettleship JE et al
ture. Proc Natl Acad Sci U S A (2009) Structural plasticity of eph receptor A4
106:10505–10510 facilitates cross-class ephrin signaling. Struc-
20. Berman HM, Westbrook J, Feng Z et al (2000) ture 17:1386–1397
The Protein Data Bank. Nucleic Acids Res 35. Monzon AM, Zea DJ, Fornasari MS et al
28:235–242 (2017) Conformational diversity analysis
21. Wei G, Xi W, Nussinov R et al (2016) Protein reveals three functional mechanisms in pro-
ensembles: how does nature harness thermo- teins. PLoS Comput Biol 13:1–29
dynamic fluctuations for life? The diverse func- 36. Parisi G, Zea DJ, Monzon AM et al (2015)
tional roles of conformational ensembles in the Conformational diversity and the emergence
cell. Chem Rev 116:6516. https://doi.org/ of sequence signatures during evolution. Curr
10.1021/acs.chemrev.5b00562 Opin Struct Biol 32:58–65
22. Marino-Buslje C, Monzon AM, Zea DJ et al 37. Gutteridge A, Thornton J (2005) Conforma-
(2017) On the dynamical incompleteness of tional changes observed in enzyme crystal
the Protein Data Bank. Brief Bioinform. structures upon substrate binding. J Mol Biol
https://doi.org/10.1093/bib/bbx084 346:21–28
23. Monzon AM, Juritz E, Fornasari MS et al 38. Mesecar AD, Stoddard BL, Koshland DE Jr
(2013) CoDNaS: a database of conformational (1997) Orbital steering in the catalytic power
diversity in the native state of proteins. Bioin- of enzymes: small structural changes with large
formatics 29:2512–2514 catalytic consequences. Science 277:202
24. Monzon AM, Rohr CO, Fornasari MS et al 39. Koshland DE (1998) Conformational changes:
(2016) CoDNaS 2.0: a comprehensive data- how small is big enough? Nat Med
base of protein conformational diversity in the 4:1112–1114
native state. Database 2016:baw038 40. Rashin AA, Rashin AHL, Jernigan RL (2010)
25. Altschul SF, Gish W, Miller W et al (1990) Diversity of function-related conformational
Basic local alignment search tool. J Mol Biol changes in proteins: coordinate uncertainty,
215:403–410 fragment rigidity, and stability. Biochemistry
26. Ortiz AR, Strauss CEM, Olmea O (2002) 49:5683–5704
MAMMOTH (matching molecular models 41. Juritz E, Palopoli N, Fornasari S et al (2013)
obtained from theory): an automated method Protein conformational diversity modulates
for model comparison. Protein Sci sequence divergence. Mol Biol Evol 30:79–87
11:2606–2621 42. Liu Y, Bahar I (2012) Sequence evolution cor-
27. The UniProt Consortium (2017) UniProt: the relates with structural dynamics. Mol Biol Evol
universal protein knowledgebase. Nucleic 29:2253–2263
Acids Res 45:D158–D169
Exploring Protein Conformational Diversity 365
43. Saldaño TE, Monzon AM, Parisi G et al (2016) coupling between sequence and structure vari-
Evolutionary conserved positions define pro- ation. Proteins 61:535–544
tein conformational diversity. PLoS Comput 56. Illergård K, Ardell DH, Elofsson A (2009)
Biol 12:e1004775 Structure is three to ten times more conserved
44. Jeon J, Nam H-J, Choi YS et al (2011) Molec- than sequence--a study of structural response
ular evolution of protein conformational in protein cores. Proteins 77:499–508
changes revealed by a network of evolutionarily 57. Monzon AM, Zea DJ, Marino-Buslje C et al
coupled residues. Mol Biol Evol (2017) Homology modeling in a dynamical
28:2675–2685 world. Protein Sci 26:2195
45. Codoñer FM, Fares MA (2008) Why should 58. Sikic K, Tomic S, Carugo O (2010) Systematic
we care about molecular coevolution? Evol comparison of crystal and NMR protein struc-
Bioinformatics Online 4:29–38 tures deposited in the protein data bank. Open
46. de Oliveira SHP, Shi J, Deane CM (2017) Biochem J 4:83–95
Comparing co-evolution methods and their 59. Kufareva I, Abagyan R (2012) Methods of pro-
application to template-free protein structure tein structure comparison. In: Orry AJW,
prediction. Bioinformatics 33:373–381 Abagyan R (eds) Homology modeling: meth-
47. Morcos F, Jana B, Hwa T et al (2013) Coevo- ods and protocols. Humana Press, Totowa, NJ,
lutionary signals across protein lineages help pp 231–257
capture multiple protein conformations. Proc 60. Siew N, Elofsson A, Rychlewski L et al (2000)
Natl Acad Sci U S A 110:20533–20538 MaxSub: an automated measure for the assess-
48. Rodriguez-Rivas J, Marsili S, Juan D et al ment of protein structure prediction quality.
(2016) Conservation of coevolving protein Bioinformatics 16:776–785
interfaces bridges prokaryote–eukaryote 61. Velankar S, Dana JM, Jacobsen J et al (2013)
homologies in the twilight zone. Proc Natl SIFTS: structure integration with function,
Acad Sci U S A 113:15018–15023 taxonomy and sequences resource. Nucleic
49. Zea DJ, Monzon AM, Parisi G, et al (2018) Acids Res 41:D483–D489
How is structural divergence related to evolu- 62. Zea DJ, Anfossi D, Nielsen M et al (2016)
tionary information?, Molecular Phylogenetics MIToS.jl: Mutual information tools for protein
and Evolution, Available online 25 June 2018, sequence analysis in the Julia language. Bioin-
ISSN 1055-7903, https://doi.org/10.1016/ formatics 33(4):564–565
j.ympev.2018.06.033 63. Zoete V, Michielin O, Karplus M (2002) Rela-
50. Sfriso P, Duran-Frigola M, Mosca R et al tion between sequence and structure of HIV-1
(2016) Residues coevolution guides the sys- protease inhibitor complexes: a model system
tematic identification of alternative functional for the analysis of protein flexibility. J Mol Biol
conformations in proteins. Structure 315:21–52
24:116–126 64. Hrabe T, Li Z, Sedova M et al (2016)
51. Chothia C, Lesk AM (1986) The relation PDBFlex: exploring flexibility in protein struc-
between the divergence of sequence and struc- tures. Nucleic Acids Res 44:D423–D428
ture in proteins. EMBO J 5:823–826 65. Maguid S, Fernández-Alberti S, Parisi G et al
52. Koehl P, Levitt M (2002) Sequence variations (2006) Evolutionary conservation of protein
within protein families are linearly related to backbone flexibility. J Mol Evol 63:448–457
structural variations. J Mol Biol 2836:551–562 66. Pettersen EF, Goddard TD, Huang CC et al
53. Hubbard TJ, Blundell TL (1987) Comparison (2004) UCSF chimera--a visualization system
of solvent-inaccessible cores of homologous for exploratory research and analysis. J Comput
proteins: definitions useful for protein model- Chem 25:1605–1612
ling. Protein Eng 1:159–171 67. Lee RA, Razaz M, Hayward S (2003) The
54. Russell RB, Barton GJ (1994) Structural fea- DynDom database of protein domain motions.
tures can be unconserved in proteins with sim- Bioinformatics 19:1290–1291
ilar folds. An analysis of side-chain to side-chain 68. Amemiya T, Koike R, Kidera A et al (2012)
contacts secondary structure and accessibility. J PSCDB: a database for protein structural
Mol Biol 244:332. https://doi.org/10.1006/ change upon ligand binding. Nucleic Acids
jmbi.1994.1733 Res 40:D554–D558
55. Wen B, Lampe JN, Roberts AG et al (2005)
Evolutionary plasticity of protein families:
Chapter 21
Abstract
Antibodies are proteins of the adaptive immune system; they can be designed to bind almost any molecule,
and are increasingly being used as biotherapeutics. Experimental antibody design is an expensive and time-
consuming process, and computational antibody design methods can now be used to help develop new
therapeutics and diagnostics. Within the design pipeline, accurate antibody structure modeling is essential,
as it provides the basis for antibody-antigen docking, binding affinity prediction, and estimating thermal
stability. Ideally, models should be rapidly generated, allowing the exploration of the breadth of antibody
space. This allows methods to replicate the natural processes of antibody diversification (e.g., V(D)J
recombination and somatic hypermutation), and cope with large volumes of data that are typical of next-
generation sequencing datasets. Here we describe ABodyBuilder and PEARS, algorithms that build and
mutate antibody model structures. These methods take ~30 s to generate a model antibody structure.
Key words Antibody structure prediction, Side-chain prediction, Accuracy estimation, Developability
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_21, © Springer Science+Business Media, LLC, part of Springer Nature 2019
367
368 Jinwoo Leem and Charlotte M. Deane
CDRH1 CDRH3
CDRL3 CDRL1
VH
VH
CDRH2 CDRL2
VL
VL
CL 1
CH
CH
1
CL
Immunoglobulin
CH2
Domains
V: Variable CH2 VH domain VL Domain
C: Constant
CH3
CH3
Heavy Chain
Light Chain
Fig. 1 Structure of an antibody molecule. (a) Antibodies are formed from two pairs of two protein chains: the
heavy chains (green) and the light chains (cyan). Each chain has a series of immunoglobulin domains, known
as the variable (V) or constant (C) regions. The two variable domains combine to form the variable fragment
(Fv), and at the tip of the Fv are the CDR loops, which form the majority of the antigen-binding site. (b) The
variable fragment has six CDR loops: CDRH1, CDRH2, and CDRH3 from the VH domain, and CDRL1, CDRL2,
and CDRL3 on the VL
Target antigen
of interest
Sequence
design
Fig. 2 Starting from an initial target sequence, it is imperative to build a model structure of the antibody
[24–26]. Next, the model antibody structure is tested for a particular function; for example, they are docked to
the target antigen for predicting binding affinity [27, 28]. From this newly formed complex, the antibody
structure is allowed to mutate, leading to a new antibody sequence. This cycle is repeated, leading to multiple
possible designs
1.1 Antibody Antibody structure prediction can cover a broad range of problems,
Structural Modeling such as CDR loop prediction [36–39] and predicting the orienta-
tion between the variable domains [40, 41]. This chapter specifi-
cally focuses on predicting the structure of the Fv [24, 25, 28], as
this is the domain that is primarily responsible for antigen binding.
Antibody structure prediction is usually undertaken in a
template-based manner as the frameworks of antibody structures
are highly conserved. Most protocols follow a similar procedure,
with minor variations; as an example, Fig. 3 shows an overview of
the ABodyBuilder algorithm.
For a target antibody sequence, modeling programs first iden-
tify one or more template structure(s) to model the framework
region. Templates can be selected from the Protein Data Bank
(PDB) [44], or from a curated database, such as the Structural
Antibody Database (SAbDab) [2]. The coordinates of the template
structure(s) are copied and used as a scaffold for subsequent steps.
Next, the orientation between the VH and VL domains is predicted.
This can be done by using the VH–VL orientation of the template
structure [24, 25, 45], machine learning techniques [40], or
computational docking algorithms [26].
In the third stage, the CDR loops are modeled. This is often
done by knowledge-based methods, such as FREAD [36, 46,
47]. Using a database of previously observed structural fragments,
FREAD predicts the CDR loop structure based on sequence simi-
larity to the target CDR sequence and anchor geometry [47]. If a
suitable fragment is not available, CDR loops can be predicted by
ab initio methods, such as MODELLER [48] and Rosetta
[28, 37]. Programs such as Sphinx use both fragment-based and
ab initio techniques for predicting the CDR loops, which is partic-
ularly useful for the CDRH3 loop [38]. For the CDRH1, CDRH2,
CDRL1, CDRL2, and CDRL3 loops, it is possible to predict the
canonical form of the loop based on sequence [10, 49].
Finally, the torsion angles of the side chains, known as the χ
angles, are predicted using only the backbone information alone.
Some modeling methods, such as ABodyBuilder, rely on dedicated
side-chain prediction tools [24, 25, 42]. Other pipelines use a built-
in side-chain prediction algorithm [43, 50], or a solvation model
[51]. Following side-chain prediction, the model structure in some
protocols undergoes energy minimization [42, 43].
Antibody Structure Modeling 371
1.2 Directed Side-chain prediction methods can be used to predict all the side
Evolution by Side- chains on a model structure, or they can be used to introduce
Chain Prediction mutations [e.g., 33, 34]. It is assumed that in most cases, changing
a single-amino-acid residue has little impact on the overall structure
of a protein [33]. Thus, in silico mutation can be considered a
specialized case of side-chain prediction.
In the traditional side-chain prediction problem, every resi-
due’s χ angle(s) must be predicted. In order to simplify the confor-
mational search space, the χ angles are described in discrete forms,
known as rotamers. Side-chain prediction methods generate pre-
dictions by sampling rotamers from rotamer libraries, which
describe the probability of a rotamer for a given structural property.
The most common structural property is the ϕ/ψ angles of the
backbone [53, 54]. Other properties such as secondary structure
[55] or an amino acid’s position in a protein fragment [56] have
also been used. For PEARS, our antibody-specific side-chain pre-
dictor, rotamer probabilities are dependent on their IMGT position
[57]. Numbering schemes such as the IMGT scheme provide a
method for comparing the amino acid sequences of two or more
antibodies. In theory, a given position should represent a specific
part of the immunoglobulin domain, and capture features such as
the distribution of amino acids. While there are various schemes
available [8, 58], the IMGT scheme is often preferred as it has a
clear correlation to structure [57, 59].
2 Materials
2.2 Additional To view the model structures locally, users are recommended to use
Software PyMOL (https://sourceforge.net/projects/pymol/), which is
Requirements available for Linux, Macintosh, and Windows. Users can download
Antibody Structure Modeling 373
3 Methods
3.1 ABodyBuilder In the sequence submission form, submit the amino acid sequence
of the target antibody. In order to model a paired antibody (includ-
3.1.1 Sequence
ing single-chain Fvs), submit sequences for both the heavy and light
Annotation
chains, while for single-domain antibodies (for example, VHH
antibodies), submit the sequence for one chain (see Note 1). In
the text below, we describe the procedure for paired antibodies.
1. The submitted target sequence is numbered by ANARCI [31],
which uses a database of hidden Markov models (HMMs; see
Note 1) to number antibody sequences. During this process,
the antibody’s framework region and CDR loops are identified
using the definitions from [9].
3.1.2 Framework Once the sequence has been annotated by ANARCI, ABodyBuilder
Template Selection searches for a template framework structure from SAbDab [2].
and Orientation Prediction
1. ABodyBuilder identifies the template with the highest
sequence identity to the target sequence across the framework
region. If there is an antibody structure that is at least 80%
sequence-identical for both chains, ABodyBuilder uses this
structure as a single “global” template. Otherwise, it uses a
“hybrid” template where two templates, one for the VH and
VL, are used. See Note 2 for example of template selections.
2. If ABodyBuilder finds a global template, its orientation is used.
For hybrid templates, the orientation of the antibody with the
highest global sequence identity is used. See Note 2 for exam-
ple of orientation selections.
3.1.3 Prediction The CDR loops are predicted by a combination of FREAD [46, 47]
of the CDR Loops and MODELLER [48]. The loops are predicted in the order of
CDRL2, CDRH2, CDRL1, CDRH1, CDRL3, and then CDRH3.
The ordering is based on our ability to predict each CDR loop
individually, and the frequency of Cβ-Cβ contacts between CDR
loops. The CDRL2 and CDRH2 loops are predicted first because
they are usually modeled with the highest accuracy and there are no
contacts between them. This is followed by CDRL1 and CDRH1
as they are the next best predicted loops, and then the CDRL3 and
CDRH3.
1. FREAD is a database method; a CDR loop-specific database is
used to predict each loop, i.e., a CDRL3-specific database is
used to predict the CDRL3 loop. FREAD selects loops using
an environment-specific substitution, anchor RMSD, and
374 Jinwoo Leem and Charlotte M. Deane
checks for clashes with the scaffold (i.e., the framework region
and existing CDR loops). If there are no suitable fragments in
the CDR-specific database, FREAD uses an antibody-specific
database, which includes fragments from all six CDR loops.
2. If FREAD does not find a suitable prediction, ABodyBuilder
searches for a length-matched loop with the highest BLO-
SUM62 score to the target CDR loop sequence.
3. If a length-matched sequence-similar loop is not available,
MODELLER is used to model the loop ab initio.
3.1.4 Side-Chain Once the CDR loops are predicted on the template framework
Prediction structure, the side chains of the model are predicted using
PEARS. PEARS uses an IMGT position-dependent distribution
of amino acid rotamers in antibody structures.
1. PEARS first builds the disulfide bridges in the antibody struc-
ture, typically between IMGT positions H23-H104 and
L23-L104.
2. Next, PEARS identifies side chain types that are known to have
a unimodal χ1 angle distribution (e.g., L116 tyrosine). The
side chains at these positions are predicted first using rotamers
with the same χ1 angle bin. If there are no suitable predictions,
these positions are predicted in the next step.
3. The remaining side chains are predicted by dead-end elimina-
tion [60] and then graph decomposition, similar to other side-
chain prediction methods [33, 61]. If no suitable predictions
can be made, only a Cβ atom is placed.
3.1.5 Annotation Once ABodyBuilder completes the modeling process (~30 s), the
of Model Structure user is immediately redirected to the results page, summarizing the
and Download Links templates that were used for the framework and the CDRs (Fig. 4).
In addition, sequence alignments of the model and target
sequences are provided. Users can choose to submit the structure
for paratope prediction (Antibody i-Patch) [27] or epitope predic-
tion (EpiPred) [32], or view the model structure for model accu-
racy and sequence liabilities (Fig. 4).
3.2 Pears The first step requires the user to upload the structure of the
antibody, with or without the antigen (Fig. 5), and specify the
3.2.1 Structure
antibody chains, for example, “HL.” To mutate residues in the
Input Form
antibody structure, the desired amino acid sequence of the anti-
body is then submitted (see Note 3). PEARS generates the mutated
structure and the user is directed to a results page with the anti-
body, renumbered in the IMGT scheme, that will be available,
along with a text file listing all the predicted χ angles.
Antibody Structure Modeling 375
Fig. 4 Screenshots of the ABodyBuilder results and viewer pages. Once a model is built, users are directed to
the results page (top) that lists the template structures that were used to model different regions of the
antibody. The viewer page (bottom) shows the model using BioPV [62]
376 Jinwoo Leem and Charlotte M. Deane
Fig. 5 Screenshot of the PEARS input and results pages. In the input page (top), users can submit a modified
sequence of the antibody (see Note 3). The output page (bottom) shows the final prediction, and users can
download a tab-separated file that lists the χ angles in the final model
3.2.2 Mutation When mutating the antibody structure, PEARS aligns the submit-
of the Input Structure ted sequence to the amino acid sequence in the structure (see Note
3). In the single-mutation case, PEARS simply uses the lowest
energy rotamer to fit into the target position. Otherwise, it runs
dead-end elimination and graph decomposition.
Antibody Structure Modeling 377
3.2.3 Resolving Clashes When PEARS predicts the side-chain structure of a target position,
in the Structure it uses a KD-tree algorithm to check for clashes. Two atoms are
considered to clash if they are closer than 63% of the sum of their
van der Waal’s radii, which is similar to previously established cut-
offs [34]. If clashes are detected, PEARS first adds Gaussian noise
to the χ angles; if this does not resolve the clashes, no predictions
are made, and the position is left with only a Cβ atom.
4 Notes
References
1. Georgiou G, Ippolito GC, Beausang J, Busse Friedrich GA, Bradley A (2014) Complete
CE, Wardemann H, Quake SR (2014) The humanization of the mouse immunoglobulin
promise and challenge of high-throughput loci enables efficient therapeutic antibody dis-
sequencing of the antibody repertoire. Nat covery. Nat Biotech 32:356–363
Biotechnol 32:158–168 15. Liu X, Taylor RD, Griffin L, Coker S-F,
2. Dunbar J, Krawczyk K, Leem J, Baker T, Adams R, Ceska T, Shi J, Lawson ADG, Baker
Fuchs A, Georges G, Shi J, Deane CM (2014) T (2017) Computational design of an epitope-
SAbDab: the structural antibody database. specific Keap1 binding antibody using hotspot
Nucleic Acids Res 42:D1140–D1146 residues grafting and CDR loop swapping. Sci
3. Chames P, Van Regenmortel M, Weiss E, Baty Rep 7:41306
D (2009) Therapeutic antibodies: successes, 16. Lippow SM, Wittrup KD, Tidor B (2007)
limitations and hopes for the future. Br J Phar- Computational design of antibody-affinity
macol 157:220–233 improvement beyond in vivo maturation. Nat
4. Kuroda D, Shirai H, Jacobson MP, Nakamura Biotechnol 25:1171–1176
H (2012) Computer-aided antibody design. 17. Choi Y, Hua C, Sentman CL, Ackerman ME,
Protein Eng Des Sel 25:507–521 Bailey-Kellogg C (2015) Antibody humaniza-
5. Reichert JM (2017) Antibodies to watch in tion by structure-based computational protein
2017. MAbs 9:167–181 design. MAbs 7:1045–1057
6. Weiner GJ (2015) Building better monoclonal 18. Miklos AE, Kluwe C, Der BS, Pai S, Sircar A,
antibody-based therapeutics. Nat Rev Cancer Hughes RA, Berrondo M, Xu J, Codrea V,
15:361–370 Buckley PE, Calm AM, Welsh HS, Warner
7. Schroeder HW, Cavacini L (2010) Structure CR, Zacharko MA, Carney JP, Gray JJ,
and function of immunoglobulins. J Allergy Georgiou G, Kuhlman B, Ellington AD
Clin Immunol 125:41–52 (2012) Structure-based design of super-
charged, highly thermoresistant antibodies.
8. Chothia C, Lesk A (1987) Canonical structures Chem Biol 19:449–455
for the hypervariable regions of immunoglobu-
lins. J Mol Biol 196:901–917 19. Olimpieri PP, Marcatili P, Tramontano A
(2015) Tabhu: tools for antibody humaniza-
9. North B, Lehmann A, Dunbrack RL (2011) A tion. Bioinformatics 31:434–435
new clustering of antibody CDR loop confor-
mations. J Mol Biol 406:228–256 20. Lewis SM, Wu X, Pustilnik A, Sereno A,
Huang F, Rick HL, Guntas G, Leaver-Fay A,
10. Nowak J, Baker T, Georges G, Kelm S, Smith EM, Ho C, Hansen-Estruch C, Cham-
Klostermann S, Shi J, Sridharan S, Deane CM berlain AK, Truhlar SM, Conner EM, Atwell S,
(2016) Length-independent structural simila- Kuhlman B, Demarest SJ (2014) Generation of
rities enrich the antibody CDR canonical class bispecific IgG antibodies by structure-based
model. MAbs 8:751–760 design of an orthogonal Fab interface. Nat
11. Dunbar J, Fuchs A, Shi J, Deane CM (2013) Biotechnol 32:191–198
ABangle: Characterising the VH-VL orienta- 21. Dunbar J, Knapp B, Fuchs A, Shi J, Deane CM
tion in antibodies. Protein Eng Des Sel (2014) Examining variable domain orienta-
26:611–620 tions in antigen receptors gives insight into
12. Foote J, Winter G (1992) Antibody framework TCR-like antibody design. PLoS Comput Biol
residues affecting the conformation of the 10:1–10
hypervariable loops. J Mol Biol 224:487–499 22. Lapidoth GD, Baran D, Pszolla GM, Norn C,
13. McCafferty J, Griffiths AD, Winter G, Chiswell Alon A, Tyka MD, Fleishman SJ (2015) AbDe-
DJ (1990) Phage antibodies: filamentous sign: an algorithm for combinatorial backbone
phage displaying antibody variable domains. design guided by natural conformations and
Nature 348:552–554 sequences. Proteins 83:1385–1406
14. Lee E-C, Liang Q, Ali H, Bayliss L, Beasley A, 23. Li T, Pantazes RJ, Maranas CD (2014) Opt-
Bloomfield-Gerdes T, Bonoli L, Brown R, MAVEn – a new framework for the de novo
Campbell J, Carpenter A, Chalk S, Davis A, design of antibody variable region models tar-
England N, Fane-Dremucheva A, Franz B, geting specific antigen epitopes. PLoS One
Germaschewski V, Holmes H, Holmes S, 9:1–17
Kirby I, Kosmac M, Legent A, Lui H, 24. Leem J, Dunbar J, Georges G, Shi J, Deane
Manin A, O’Leary S, Paterson J, Sciarrillo R, CM (2016) ABodyBuilder: automated
Speak A, Spensberger D, Tuffery L, Waddell N,
Wang W, Wells S, Wong V, Wood A, Owen MJ,
Antibody Structure Modeling 379
antibody structure prediction with data-driven 38. Marks C, Nowak J, Klostermann S, Georges G,
accuracy estimation. MAbs 8:1259–1268 Dunbar J, Shi J, Kelm S, Deane CM (2017)
25. Marcatili P, Olimpieri PP, Chailyan A, Tramon- Sphinx: merging knowledge-based and ab
tano A (2014) Antibody structural modeling initio approaches to improve protein loop pre-
with prediction of immunoglobulin structure diction. Bioinformatics 33:1346–1353
(PIGS). Nat Protoc 9:2771–2783 39. Messih MA, Lepore R, Marcatili P, Tramon-
26. Sivasubramanian A, Sircar A, Chaudhury S, tano A (2014) Improving the accuracy of the
Gray JJ (2009) Toward high-resolution structure prediction of the third hypervariable
homology modeling of antibody Fv regions loop of the heavy chains of antibodies. Bioin-
and application to antibody-antigen docking. formatics 30:2733–2740
Proteins 74:497–514 40. Bujotzek A, Dunbar J, Lipsmeier F, Sch€afer W,
27. Krawczyk K, Baker T, Shi J, Deane CM (2013) Antes I, Deane CM, Georges G (2015a) Pre-
Antibody i-Patch prediction of the antibody diction of VH-VL domain orientation for anti-
binding site improves rigid local antibody- body variable domain modeling. Proteins
antigen docking. Protein Eng Des Sel 83:681–695
26:621–629 41. Marze NA, Lyskov S, Gray JJ (2016) Improved
28. Weitzner BD, Jeliazkov JR, Lyskov S, Marze N, prediction of antibody VL-VH orientation.
Kuroda D, Frick R, Adolf-Bryfogle J, Biswas N, Protein Eng Des Sel 29:409–418
Dunbrack RL Jr, Gray JJ (2017) Modeling and 42. Yamashita K, Ikeda K, Amada K, Liang S,
docking of antibody structures with Rosetta. Tsuchiya Y, Nakamura H, Shirai H, Standley
Nat Protoc 12:401–416 DM (2014) Kotai antibody builder: automated
29. Huang P-S, Boyken SE, Baker D (2016) The high-resolution structural modeling of antibo-
coming of age of de novo protein design. dies. Bioinformatics 30:3279–3280
Nature 537:320–327 43. Bujotzek A, Fuchs A, Qu C, Benz J,
30. Khoury GA, Smadbeck J, Kieslich CA, Floudas Klostermann S, Antes I, Georges G (2015b)
CA (2014) Protein folding and de novo pro- MoFvAb: modeling the Fv region of antibo-
tein design for biotechnological applications. dies. MAbs 7:838–852
Trends Biotechnol 32:99–109 44. Berman HM, Westbrook J, Feng Z,
31. Dunbar J, Deane CM (2016) ANARCI: anti- Gilliland G, Bhat TN, Weissig H, Shindyalov
gen receptor numbering and receptor classifi- IN, Bourne PE (2000) The Protein Data Bank.
cation. Bioinformatics 32:298–300 Nucleic Acids Res 28:235–242
32. Krawczyk K, Liu X, Baker T, Shi J, Deane CM 45. Maier JKX, Labute P (2014) Assessment of
(2014) Improving B-cell epitope prediction fully automated antibody homology modeling
and its application to global antibody-antigen protocols in molecular operating environment.
docking. Bioinformatics 30:2288–2294 Proteins 82:1599–1610
33. Krivov GG, Shapovalov MV, Dunbrack RL 46. Choi Y, Deane CM (2010) FREAD revisited:
(2009) Improved prediction of protein side- accurate loop structure prediction using a data-
chain conformations with SCWRL4. Proteins base search algorithm. Proteins 78:1431–1440
77:778–795 47. Deane CM, Blundell TL (2001) CODA: a
34. Nagata K, Randall A, Baldi P (2012) SIDEpro: combined algorithm for predicting the struc-
a novel machine learning approach for the fast turally variable regions of protein models. Pro-
and accurate prediction of side-chain confor- tein Sci 10:599–612
mations. Proteins 80:142–153 48. Šali A, Blundell TL (1993) Comparative pro-
35. Almagro JC, Teplyakov A, Luo J, Sweet RW, tein modelling by satisfaction of spatial
Kodangattil S, Hernandez-Guzman F, Gilli- restraints. J Mol Biol 234:779–815
land GL (2014) Second antibody modeling 49. Adolf-Bryfogle J, Xu Q, North B, Lehmann A,
assessment (AMA-II). Proteins 82:1553–1562 Dunbrack RL Jr (2015) PyIgClassify: a data-
36. Choi Y, Deane CM (2011) Predicting antibody base of antibody CDR structural classifications.
complementarity determining region struc- Nucleic Acids Res 43:D432–D438
tures without classification. Mol BioSyst 50. Berrondo M, Kaufmann S, Berrondo M (2014)
7:3327–3334 Automated aufbau of antibody structures from
37. Finn JA, Koehler Leman J, Willis JR, given sequences using Macromoltek’s
Cisneros A, Crowe JE, Meiler J (2016) SmrtMolAntibody. Proteins 82:1636–1645
Improving loop modeling of the antibody 51. Zhu K, Day T, Warshaviak D, Murrett C,
complementarity-determining region 3 using Friesner R, Pearlman D (2014) Antibody
knowledge-based restraints. PLoS One 11: structure determination using a combination
e0154811 of homology modeling, energy-based
380 Jinwoo Leem and Charlotte M. Deane
refinement, and loop prediction. Proteins 57. Lefranc M-P, Pommié C, Ruiz M, Giudicelli V,
82:1646–1655 Foulquier E, Truong L, Thouvenin-Contet V,
52. Jarasch A, Koll H, Regula JT, Bader M, Lefranc G (2003) IMGT unique numbering
Papadimitriou A, Kettenberger H (2015) for immunoglobulin and T cell receptor vari-
Developability assessment during the selection able domains and Ig superfamily V-like
of novel therapeutic antibodies. J Pharm Sci domains. Dev Comp Immunol 27:55–77
104:1885–1898 58. Kabat EA, Wu TT, Bilofsky H, Reid-Miller M,
53. Shapovalov MV, Dunbrack RL (2011) A Perry HM (1983) Sequences of proteins of
smoothed backbone-dependent rotamer immunological interest, 3rd edn. National
library for proteins derived from adaptive ker- Institutes of Health, Bethesda
nel density estimates and regressions. Structure 59. Lefranc M-P (2014) Immunoglobulin and T
19:844–858 cell receptor genes: IMGT and the birth and
54. Towse C-L, Rysavy S, Vulovic I, Daggett V rise of Immunoinformatics. Front Immunol
(2016) New dynamic rotamer libraries: data- 5:22
driven analysis of side-chain conformational 60. Desmet J, Maeyer MD, Hazes B, Lasters I
propensities. Structure 24:187–199 (1992) The dead-end elimination theorem
55. Lovell SC, Word JM, Richardson JS, Richard- and its use in protein side-chain positioning.
son DC (2000) The penultimate rotamer Nature 356:539–542
library. Proteins 40:389–408 61. Miao Z, Cao Y, Jiang T (2011) RASP: rapid
56. Chinea G, Padron G, Hooft RWW, Sander C, modeling of protein side chain conformations.
Vriend G (1995) The use of position-specific Bioinformatics 27:3117–3122
rotamers in model building by homology. Pro- 62. Biasini M (2015) pv: v1.8.1
teins 23:415–421
Chapter 22
Abstract
Recent years have seen an explosion of interest in both sequence- and structure-based approaches toward in
silico-directed evolution. We recently developed a novel computational toolkit, CADEE, which facilitates
the computer-aided directed evolution of enzymes. Our initial work (Amrein et al., IUCrJ 4:50–64, 2017)
presented a pedagogical example of the application of CADEE to triosephosphate isomerase, to illustrate
the CADEE workflow. In this contribution, we describe this workflow in detail, including code input/
output snippets, in order to allow users to set up and execute CADEE simulations on any system of interest.
Key words Enzyme design, Directed evolution, Computational enzymology, Computational enzyme
design, Empirical valence bond
1 Introduction
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_22, © Springer Science+Business Media, LLC, part of Springer Nature 2019
381
382 Beat Anton Amrein et al.
Fig. 1 Examples of various enzymes that have been studied with the EVB approach. The experimental
activation free energies (ΔG{exp) are shown in dark blue, and the calculated activation free energies (ΔG{calc)
are shown in sky blue. DHFR, Lys, AR, CM, Try, PAS, DhlA, TIM, RlPMH, AchE, ODC, CA, ATP, and KSI denote
dihydrofolate reductase, lysozyme, aldose reductase, chorismate mutase, trypsin, a bacterial arylsulfatase,
haloalkane dehalogenase, triosephosphate isomerase, a bacterial phosphonate monoester hydrolase, acetyl-
choline esterase, orotidine monophosphate decarboxylase, carbonic anhydrase, F1-ATPase, and ketosteroid
isomerase, respectively. CC-BY adopted from Ref. [14], based on data originally presented in Refs. [16–19]
2.2 Speed and As CADEE is a distributed computing framework that runs a large
Computing Resources number of individual tasks at once, it is important to keep compu-
tational overhead low. In CADEE, this overhead minimization is
achieved with the following tricks:
1. In the case of multistep reactions, only the rate-limiting step is
simulated to an initial approximation, and other steps are only
simulated once a more limited selection of hits have been
identified, to focus resources.
2. We have included standard settings intended for the best use of
resources:
(a) Hysteresis tends to be reduced in the runs due to equili-
bration at the approximate transition state along the reac-
tion coordinate.
(b) Short thermalization and 8 ns of tandem equilibration and
EVB phases.
(c) Four replicas each with 8 EVB snapshots ¼ 32 data points
for statistics.
(d) The EVB calculations are initiated from structural snap-
shots collected every nanosecond of the initial equilibra-
tion; this allows post-calculation assignment of when the
system has sufficiently equilibrated so that stable energet-
ics are reached.
(e) By default, each enzyme variant is simulated for a total of
50 ns.
3. In order to decrease the simulation time, we advise the user to
start the EVB simulations at the transition state, propagating
the trajectories to the reactant and product complexes. This
both accelerates convergence (as the user is starting from a state
with partial bonds to reacting atoms) and, provided that suffi-
cient computational resources are available, reduces the real
time of the simulations as trajectories can be propagated in
both directions at once.
4. CADEE efficiency is high, thanks to a pleasingly parallel imple-
mentation, allowing hundreds to thousands of simulations to
be performed in parallel.
5. While these high-efficiency defaults are best suited for produc-
tion simulations, they are inconvenient for test simulations and
the initial system setup. To overcome this, a special script can be
employed, as described in Subheading 4.3.2.
In Silico Directed Evolution 385
2.3 Simulation A simpack as used by CADEE is a tarball, which contains the input
Packages (Simpacks) files for a CADEE simulation. Once a simulation has started, all the
results are appended to the simpack. It is therefore not only an
input, but also an output file. Once a simulation is started (“cadee
dyn”), the contents of the simpack are copied to a temporary folder.
The simulation is then spooled to the right position (skipping steps
that have been computed previously) and then molecular dynamics
simulations are performed. Once a simulation step has been com-
pleted, it will be compressed, collected, and appended to the sim-
pack in intervals to reduce strain on the file system. A default
CADEE simpack contains a total of 8 ns of equilibration time.
Every 1000 ps thereof, a snapshot is used to perform a medium-
length EVB simulation (each 520 ps total length).
By default, we suggest the user primarily relies on the medium-
length EVB simulations for estimating the likely activation free
energies for the constructs being tested (the longer the simulation
time, the more likely the simulations have converged). As each
simpack contains eight medium-length EVB runs, it is possible to
allow retro-actively for additional N ns initial equilibration, by
removing the data points of the first N medium-length EVB runs.
The reasoning for this advanced internal setup (in comparison to
traditional Q inputs and workflows) is that CADEE hides this
complexity from the user and hence does not cripple productivity,
but rather empowers the user during the analysis. For example, in
our model system below, we have not accounted for (i.e., removed)
the first two data points from the simpack, in order to give the
system an additional 2 ns of initial equilibration time, resulting in a
total equilibration time of 3 ns.
Technical details: Simpacks use the following nomenclature
protocol: [variant-name]_[replica].tar, for example for the wild-
type protein: “wt_0.tar, wt_1.tar, wt_2.tar, wt_3.tar” and for a
histidine 104 to alanine variant: “H104A_0.tar, H104A_1.tar,
H104A_2.tar, H104A_3.tar”, etc. For our working example,
CADEE creates four independent replicas (seeds) for each enzyme
variant, leading to a total of 4 8 ¼ 32 medium EVB runs. In
addition, we have decided to manually remove the first 2 EVB
simulations from each simpack, to allow for longer initial equilibra-
tion without increasing real simulation time, effectively yielding
4 6 ¼ 24 medium EVB energy profiles (see also the previous
paragraph). As a baseline, all simpacks must contain a topology file
(mutant.top), a simulation-ready PDB file (mutant.pdb), and the
FEP file (mutant.fep) which contains the EVB parameters for the
different reacting states for the reaction being studied. If molecular
dynamics simulations should be performed, the simpacks must
contain numbered input files (*.inp), as per the following scheme:
01_* to 09_*: initialization and thermalization, 1000_eq.inp to
4660_fep.inp: equilibration and FEP files. Files containing the
string (“_eq”) are 50 ps equilibration runs (the reason for the
386 Beat Anton Amrein et al.
3 CADEE Installation
3.1 The Wild-Type CADEE relies on the user to have already characterized and vali-
Enzyme Reaction dated the reaction mechanism of the enzyme of interest, and to
have calibrated the EVB coupling and gas-phase shift parameters
against relevant experimental or computational data (usually
corresponding to the energetics of the reaction catalyzed by the
wild-type enzyme, or the corresponding uncatalyzed reaction in
aqueous solution), as described in Ref. [14]. It is therefore crucial
that the system is carefully prepared, as the quality of data obtained
from all subsequent steps builds on the correct modeling of the
baseline reaction (i.e., for the EVB calibration).
3.2 Installation and CADEE has been written and tested on Linux machines. The
System Requirements parallel computing has been tested on a variety of Intel as well as
AMD clusters, as these systems were accessible through the Swed-
ish National Infrastructure for Computing (SNIC) at various sites
in Linköping (NSC/Triolith https://www.nsc.liu.se/), Uppsala
(UPPMAX/Tintin and Rackham https://www.uppmax.uu.se/),
and Umeå (HPC2N/Akka, Abisko, and Kebnekaise https://www.
hpc2n.umu.se/). We note that while the software was written to
run on all SNIC-provided resources, we have not, as yet, tested it
on other SNIC clusters. In addition, we have not used other
resource managers than SLURM, as this is the primary resource
manager on SNIC systems. For simplicity, we assume that the user
will be using a Debian- or Ubuntu-based system, as those systems
are under widespread use.
The CADEE installer may be downloaded from our official
GitHub repository, which is located at http://www.github.com/
kamerlinlab/cadee. For new users, we recommend following the
CADEE installation instructions described in the following sec-
tions in order to get started.
In Silico Directed Evolution 387
3.3 How to Read this Throughout this section, we assume that a modern implementation
Chapter of the Bourne again shell (bash) is installed and used by the user.
Code that needs to be typed into a terminal emulator is explicitly
identified as a Code Input Snippet:
Note that lines ending with \\ imply that the command con-
tinues on the next line (therefore “Enter” should not be used or the
command might not work as intended). Similarly, the
corresponding output is explicitly identified as a Code Output
Snippet:
3.4.2 Licensing and Q [39] needs to be licensed, downloaded, and installed, as CADEE
Downloading Q relies on the functional capabilities of this molecular simulation
package (see http://xray.bmc.uu.se/~aqwww/q/ for further
388 Beat Anton Amrein et al.
3.4.3 Licensing and SCWRL4 [24] needs to be licensed, downloaded, and installed, as
Downloading SCWRL4 CADEE utilizes the functional capabilities of this package to rapidly
predict a likely side-chain orientation (rotamer). Users are advised
to visit http://dunbrack.fccc.edu/scwrl4/ for instructions on the
licensing, download, and installation of SCWRL4.
3.5.2 Q and SCWRL First, SCWRL4 should be installed to a folder in $PATH, as also
Installation mentioned in the download instructions. Next, a copy of the Q
executables has to be placed in the folder prepared for them
($CADEE_DIR/cadee/executables/q): Once Q has been com-
piled, qfep5, qdyn5, qprep5, and qcalc5 should be copied to
$CADEE_DIR/cadee/executables/q/. Alternatively, the setup.
py script will search in the $PATH for the Q executables.
3.5.3 CADEE Installation Once all required dependencies are installed (see Subheading 3.4),
one may proceed to install CADEE:
4.1 First Start Once CADEE has been installed successfully, the CADEE wrapper
script (“cadee”) will be available on the command line. This script
has been written for the ease of users familiar with Q, as the syntax
is maintained between the two programs.
(C) Copyright 2017 Beat Anton Amrein & Shina Caroline Lynn
Kamerlin
Usage:
cadee [ prep(p) | dyn(d) | ana(a) | tool(t) ]
Multi Core Tasks:
mpirun -n X cadee dyn
mpiexec -n X cadee dyn
X == Number of cores to use; 2+.
390 Beat Anton Amrein et al.
In case the output does not resemble the above output (e.g.,
“cadee: command not found”), the installation has failed (or is
incomplete), and we suggest users refer to the troubleshooting
described in Subheading 8.2.
4.2 Preparing a As described in the introduction, to use CADEE, the valence bond
CADEE Simulation states describing different reacting species for the reaction of inter-
est need to be pre-parameterized and calibrated for running the
EVB simulations that underlie CADEE (for details about the EVB
approach, see, e.g., Refs. [15, 17]). To simplify CADEE usage and
understanding, and to allow for easy CADEE testing, we have
included a set of sample EVB input files for the user (see
$CADEE_DIR/example). For more information about the theo-
retical background, we refer the user to earlier publications [14, 15,
17].
In order to run properly, CADEE requires the following files:
1. A structure file in PDB format ($CADEE_DIR/example/wt.
pdb), comprising the wild-type enzyme with correct ionization
states for ionizable residues, solvated in a water droplet (gener-
ated by Qprep5): The initial coordinates are typically obtained
from the Protein Data Bank [40, 41] and then adjusted to be
compatible with Q.
2. A “FEP file” ($CADEE_DIR/example/wt.fep), i.e., a file con-
taining the force field parameters for the different EVB states,
for the purposes of the simulation setup: Note that Qdyn does
not distinguish between general free energy perturbation and
specific EVB calculations when reading input.
3. The qprep5 input file ($CADEE_DIR/example/wt.qpinp),
which was used to generate the initial simulation-ready PDB
file: CADEE will use this file to check the system configuration
and to prepare the relevant topologies.
4. The full path to the folder containing all topology and parameter
files needed to perform the simulation ($CADEE_DIR/exam-
ple/libraries/).
4.2.1 Preparing a CADEE To begin with our working example, a simulation may be prepared
Simulation with CADEE’s “prep” keyword:
In case the subfolder “wt” exists, CADEE will warn the user
about this. In many cases, the wt.qpinp files then need to be
adapted to a CADEE-specific format, using absolute filenames
and coordinates: The “cadee prep” command will try to automati-
cally perform these changes and create a new file, inserting “.new”
before the file extension; for example, in the example used here, this
would be “wt.new.qpinp” (if this file already exists, “cadee prep”
stops and asks the user to remove “wt.new.qpinp”). Only then are
the wild-type simpacks created and finally packed. The very last line
indicates that the simpacks are ready to use. Caution: The simpacks
have been prepared, but not yet computed. Instructions for the
computation are provided in Subheading 4.3.1. Note that instead
of deleting the old “wt.new.qpinp” the input line may be adjusted,
and instead of wt.qpinp, wt.new.qpinp may be used:
ls $CADEE_DIR/testing_example/wt
wt_0.tar wt_1.tar wt_2.tar wt_3.tar
4.2.2 Preparing for Put simply, a simpack contains input files required to run qdyn5 and
Molecular Dynamics qfep5, both of which are utilities that are needed in order to
Simulations perform and analyze EVB simulations. When CADEE is computing
simpacks, no new files are generated in the simpack folder, but the
simpacks simply increase in size from a couple of megabytes to
gigabytes. It is therefore crucial that the folder containing the
simpacks holds enough free storage to accommodate this. A sim-
pack contains all restart information needed, and if a run is inter-
rupted and later restarted, the simpack alone is enough to restart
the CADEE simulation. Simpacks should in principle not be cor-
rupted, except if CADEE has stopped ungracefully, for example if a
simulation runs out of storage. Clearly, however, in the event that
simpacks are corrupted, then the corrupted simpacks need to be
repaired before proceeding (see the troubleshooting description in
Subheading 8.3).
This command will launch “cadee dyn” with four working tasks
(plus one for input/output). Note that the log file will be only
written to standard out (the console) by default, and when using
the “| tee cadee.log” part of above command, the log is additionally
written to cadee.log. We note that the resulting simpack includes all
input files and output files; that is, “cadee dyn” will not generate
special output files, but instead the simpack files will become larger
(for more about simpacks, see Subheading 2.3). Depending on the
mpi implementation used, the “mpirun.mpich” command needs to
be adjusted (possible commands include “srun,” “mpiexec,” or
“mpirun”). The command above should create output similar to:
4.3.2 Saving Wall- In certain scenarios, it is important to get results fast, and to use the
Clock Time available resources for speed, not for efficiency, such as when a wild-
type reaction needs to be prototyped for a certain enzyme. In such a
case, CADEE ships scripts which need to be adjusted to the user’s
machine. These scripts are located in $CADEE_DIR/cadee/
tools/pcadee.sh and $CADEE_DIR/cadee/tools/srunq.sh,
respectively. Once adapted to the computer system of interest,
they can be launched by:
5 Pedagogical Examples
5.1 Example: To prepare simpacks for an alanine scan is straightforward, once the
Alanine Scan system has been correctly prepared and benchmarked. The argu-
ment needed to run an alanine scan using CADEE is “--alascan.”
Optional parameters are “--radius” (mutate all residues within a
certain radius around the center of the simulation sphere) and
“--nummuts” (prepare alanine scan inputs for the N innermost
residues, increasing the radius around the simulation center).
5.3 Example: Manual In some cases, the user might desire the raw data from a CADEE
Analysis of CADEE simulation to perform analysis with their own post-processing
Simulations scripts, and avoid information overload from the cadee.db files
(which are in sqlite3 format). For those cases, we provide a script
to convert the activation energies or the free energies to the
comma-separated value (csv) file format. The corresponding data
can then be opened with any relevant spreadsheet software (see also
Fig. 3).
Fig. 2 A screenshot of the web user interface for the analysis of alanine scans. CC-BY adapted from Ref. [14]
Fig. 3 The initial alanine scan performed on triosephosphate isomerase (TIM). The alanine scan was prepared
with the –nummut argument as follows: After the initial system setup, 48 residues distributed radially around
the center of the simulation sphere that were neither alanine nor glycine were selected, and hydrogen atoms
and heavy atoms other than backbone atoms and Cβ were removed. Each variant was subsequently
re-solvated and the CADEE simulation was started. From the displayed data, three positions were selected
to start the next simulations: L93, Y164, and T172, respectively, in positions 92, 163, and 171 in CADEE (see
main text). CC-BY adopted from Ref. [14]
5.4 Example: In some cases, the user might desire to merge cadee.db files (which
Concatenating cadee. are in sqlite3 format). For those cases, we provide a script to
db Files concatenate two or more cadee.db files.
5.5 Example: Point Once a promising hotspot site has been identified, one way to
Saturation continue the CADEE analysis is to perform computational combi-
Mutagenesis natorial saturation mutagenesis on it (Fig. 4). This can be simplified
by using a reduced set of amino acids for the calculations (see Reetz
[43]), and CADEE supports different amino acid libraries for this
purpose. A list of the amino acid libraries implemented into
CADEE is shown in Table 1.
In the current working example, we have saturated three posi-
tions to all 20 natural amino acids, which were initiated as follows:
Fig. 4 The pedagogical example and point saturation mutagenesis of residues 93, 164, and 172, compared to
the wild-type simulation on the left. The data has been sorted by residue number. The free energy profiles of
the L93W, L93Q, L93R, Y164R, T172R, L93K, Y164K, and T172K variants did not converge. CC-BY adopted
from Ref. [14]
In Silico Directed Evolution 403
Table 1
We present here CADEE’s built-in amino acid libraries, with both the
associated shortcut and the one-letter amino acid codes for each library
residue, respectivelya
163:SATURATE (+ native/wt)
[...]
INFO:root:Working on $CADEE_DIR/pedagogical_example/libmut/Y163W.
INFO:prep.pyscwrl:Clash-Score was: 0.218, will now re-run and allow Scwrl4 to modify
residues [163, 183, 667]
INFO:prep.pyscwrl:Clash-Score new: 0.460648886108 ==> Rollback!
[...]
In this case, as the clash was minor, allowing SCWRL4 to
realign residues 163, 183, and 667 did not help the Clash-Score,
so CADEE reverts to the original alignment (“Rollback!”).
INFO:root:Working on $CADEE_DIR/pedagogical_example/libmut/L092D.
INFO:prep.pyscwrl:No clashes detected.
[...]
171:SATURATE (+ native/wt)
[...]
INFO:prep.create_inputs:Creating input files for Y163R.
[...]
INFO:prep.create_inputs:Creating input files for T171H.
INFO:root:Packing L092V:
INFO:root:Pack # 0, Seed: 446064
INFO:root:Pack # 1, Seed: 900050
INFO:root:Pack # 2, Seed: 641446
INFO:root:Pack # 3, Seed: 151899
INFO:root:Packing Y163M:
INFO:root:Pack # 0, Seed: 76740
INFO:root:Pack # 1, Seed: 404917
INFO:root:Pack # 2, Seed: 499416
INFO:root:Pack # 3, Seed: 751226
[...]
Success! You find your simpacks in $CADEE_DIR/pedagogical_ex-
ample/libmut.
Fig. 5 Data obtained through partial combinatorial saturation mutagenesis at positions 93(A/G/H), 164(S/P/H/
E/C/A), and 172(W/S/R/L/D). ΔG{ denotes the calculated activation free energies for each variant, and the
error bars denote the standard deviation over 4 6 ¼ 24 EVB trajectories per variant. As displayed, some
variants have very large uncertainty in the calculated values. These instabilities can be caused by different
factors or combinations of factors (for example structural instabilities caused by the insertion of the new
residue or insufficient equilibration time). To improve the equilibration, the trend within data collection could
be studied and longer simulations conducted, or the current ones extended, or additional simulations
performed. CC-BY adopted from Ref. [14]
5.6 Example: After a reduced set of interesting amino acids have been selected by
Combinatorial the user, combinatorial saturation mutagenesis can be performed
Saturation (Fig. 5) to screen if the subsequent mutations are additive or, when
Mutagenesis introduced at the same time, cause a higher effect than the individ-
ual mutations (hysteresis). Note that CADEE was written with the
aim of testing the saturation mutagenesis of several different resi-
dues together with a single command: “cadee prep . . . --libmut.”
We have therefore decided to use the results obtained from individ-
ual point saturation at each of the three positions, choosing a subset
of amino acids to be tested at each hot spot:
5.7 CADEE CADEE provides a straightforward and fast way to generate and
Customization test hundreds to thousands of mutants of a well-parameterized
EVB reaction. To generate simpacks, CADEE relies on “simpack-
templates”: Currently, one simpack-template is included as a
default, and a second one has been used in Subheading 4.2. We
strongly recommend that users examine the existing templates and
adjust them as per user requirements: Additional templates and
documentation (readme.md) are available in $CADEE_DIR/sim-
pack_templates/.
6 Limitations of CADEE
In the case of multistep reaction profiles where only the initial rate-
limiting step was subjected to CADEE evolution, we recommend
taking the best CADEE hits and running EVB on all other reaction
steps to ensure that the proposed residue substitutions do not cause
a change in rate-limiting step. We also recommend that additional
(and longer) simulations should be run for the best hits identified
to both improve the quality of the predictions obtained and reduce
the risk of false positives due to too short sampling time.
408 Beat Anton Amrein et al.
8 Troubleshooting
8.2 CADEE First Start If the first start of the CADEE script fails, two things could have
gone wrong:
1. Problem: Setup failed. The following script can be used to check
if CADEE is installed properly.
8.3 Simpacks For some computer architectures, the compute time needed to
perform one simulation exceeds the wall-clock limit. CADEE is
hence able to restart and continue simulations, and the user can
simply resubmit the original submission file, to continue the simu-
lation with the same command, as there is no special restart flag. To
detect unfinished simpacks, the most straightforward way is to
compare the simpack sizes (/bin/ls –lS). Sometimes, however, a
node may have crashed, or a hard disk quota may have been hit, and
hence a simpack may be faulty and not finish even with enough
wall-clock time available. In those cases, it is advisable to untar the
simpack and repack it. A script to do this is:
412 Beat Anton Amrein et al.
8.3.1 Simpack A simpack contains all files necessary to perform an EVB simulation
Customization with Q. A minimal simpack hence contains files for the (1) initializa-
tion, (2) thermalization/heat-up, (3) equilibration, (4) free energy
perturbation/empirical valence bond computation, and (5) empiri-
cal valence bond free energy mapping. More detailed information
about simpacks and how they can be customized can be found in
$CADEE_DIR/simpack_templates/readme.md.
Acknowledgments
References
1. Bornscheuer UT (1998) Directed evolution of improved enzymes: how to escape from local
enzymes. Angew Chem Int Ed 37:3105–3108 minima. ChemBioChem 13:1060–1066
2. Bull AT, Ward AC, Goodfellow M (2000) 11. Barrozo A, Borstnar R, Marloie G, Kamerlin
Search and discovery strategies for biotechnol- SCL (2012) Computational protein engineer-
ogy: the paradigm shift. Microbiol Mol Biol ing: bridging the gap between rational design
Rev 64:573–606 and laboratory evolution. Int J Mol Sci
3. Tao H, Cornish VW (2002) Milestones in 13:12428–12460
directed enzyme evolution. Curr Opin Chem 12. Kiss G, Çelebi-Ölçum N, Moretti R, Baker D,
Biol 6:858–864 Houk KN (2012) Computational enzyme
4. Currin A, Swainston N, Day PJ, Kell DB design. Angew Chem Int Ed 52:5700–5725
(2015) Synthetic biology for the directed evo- 13. Romero-Rivera A, Garcia-Borràs M, Osuna S
lution of biocatalysts: navigating sequence (2017) Computational tools for the evaluation
space intelligently. Chem Soc Rev of laboratory-engineered biocatalysts. Chem
44:1172–1239 Commun 53:284–297
5. Packer MS, Liu DR (2015) Methods for the 14. Amrein BA, Steffen-Munsberg F, Szeler I,
directed evolution of proteins. Nat Rev Genet Purg M, Kulkarni Y, Kamerlin SCL (2017)
16:79–394 CADEE: computer-aided directed evolution
6. Arnold FH, Volkov AA (1999) Directed evolu- of enzymes. IUCrJ 4:50–64
tion of biocatalysts. Curr Opin Chem Biol 15. Warshel A, Weiss RM (1980) An empirical
3:54–59 valence bond approach for comparing reactions
7. J€ackel C, Kast P, Hilvert D (2008) Protein in solutions and in enzymes. J Am Chem Soc
design by directed evolution. Annu Rev Bio- 102:6218–6226
phys 37:153–173 16. Warshel A, Sharma PK, Kato M, Xiang Y,
8. Currin A, Swainston N, Day PJ, Kell DB Liu H, Olsson MHM (2006) Electrostatic
(2015) Synthetic biology for the directed evo- basis for enzyme catalysis. Chem Rev
lution of protein biocatalysts: navigating 106:320–3235
sequence space intelligently. Chem Soc Rev 17. Kamerlin SCL, Warshel A (2010) The EVB as a
44:1172–1239 quantitative tool for formulating simulations
9. Romero PA, Arnold FH (2009) Exploring pro- and analyzing biological and chemical reac-
tein fitness landscapes by directed evolution. tions. Faraday Discuss 145:71–106
Nat Rev Mol Cell Biol 10:866–876 18. Luo J, van Loo B, Kamerlin SCL (2012) Exam-
10. Gumulya Y, Sanchis J, Reetz MT (2012) Many ining the promiscuous phosphatase activity of
pathways in laboratory evolution can lead to Pseudomonas aeruginosa arylsulfatase: a
414 Beat Anton Amrein et al.
comparison to analogous phosphatases. Pro- 33. King G, Warshel A (1989) A surface con-
teins Struct Funct Bioinf 80:1211–1226 strained all-atom solvent model for effective
19. Barrozo A, Duarte F, Bauer P, Carvalho ATP, simulations of polar solutions. J Chem Phys
Kamerlin SCL (2015) Cooperative electro- 91:3647–3661
static interactions drive functional evolution in 34. Lee FS, Warshel A (1992) A local reaction field
the alkaline phosphatase superfamily. J Am method for fast evaluation of long-range elec-
Chem Soc 137:9061–9076 trostatic interactions in molecular simulations.
20. Q Official Website. http://xray.bmc.uu.se/ J Chem Phys 97:3100–3107
~aqwww/q 35. Stallman RM (2009) GCC developer commu-
21. Manual for the molecular Dynamics package nity, using the Gnu compiler collection: A Gnu
Q. http://xray.bmc.uu.se/~aqwww/q/ manual for Gcc version 4.3.3. CreateSpace.
documents/qman5.pdf p 636
22. MPI4Py. https://pypi.python.org/pypi/ 36. Gabriel E, Fagg GE, Bosilca G, Angskun T,
mpi4py Dongarra JJ, Squyres JM, Sahay V,
23. O’Boyle NM, Banck M, James CA, Morley C, Kambadur P, Barrett B, Lumsdaine A, Castain
Vandermeersch T, Hutchison GR (2011) RH, Daniel DJ, Graham RL, Woodall TS
Open babel: an open chemical toolbox. J Che- (2004) Open MPI: Goals, concept, and design
minform 3:33–33 of a next generation MPI implementation. In:
Kranzlmüller D, Kacsuk P, Dongarra J (eds)
24. Krivov GG, Shapovalov MV, Dunbrack RL Recent Advances in Parallel Virtual Machine
(2009) Improved prediction of protein side- and Message Passing Interface: 11th
chain conformations with SCWRL4. Proteins European PVM/MPI Users’ Group Meeting
Struct Funct Bioinf 77:778–795 Budapest, Hungary, September 19–22, 2004.
25. Frushicheva MP, Cao J, Chu ZT, Warshel A Proceedings. Springer Berlin Heidelberg, Ber-
(2010) Exploring challenges in rational lin, Heidelberg, pp 97–104
enzyme design by simulating the catalysis in 37. Gropp W (2002) MPICH2: A New Start for
artificial Kemp eliminase. Proc Natl Acad Sci MPI Implementations. In: Proceedings of the
107:16869–16874 9th European PVM/MPI Users’ Group
26. Frushicheva MP, Cao J, Warshel A (2011) Meeting on recent advances in parallel virtual
Challenges and advances in validating enzyme machine and message passing interface,
design proposals: the case of Kemp eliminase Springer-Verlag, p 7
catalysis. Biochemistry 50:3849–3858 38. Python Software Foundation. Python Lan-
27. Kamerlin SCL, Warshel A (2011) The empiri- guage Reference, version 2.7. http://www.
cal valence bond model: theory and applica- python.org/
tions. WIREs Comput Mol Sci 1:30–45 39. Marelius J, Kolmodin K, Feierberg I, Åqvist J
28. Amrein BA, Bauer P, Duarte F, Janfalk Carls- (1998) Q: A molecular dynamics program for
son Å, Naworyta A, Mowbray SL, free energy calculations and empirical valence
Widersten M, Kamerlin SCL (2015) Expand- bond simulations in biomolecular systems. J
ing the catalytic triad in epoxide hydrolases and Mol Graph Model 16:213–225
related enzymes. ACS Catal 5:5702–5713 40. Berman HM, Westbrook J, Feng Z,
29. Ben-David M, Sussman JL, Maxwell CI, Gilliland G, Bhat TN, Weissig H, Shindyalov
Szeler K, Kamerlin SCL, Tawfik DS (2015) IN, Bourne PE (2000) The Protein Data Bank.
Catalytic stimulation by restrained active-site Nucleic Acids Res 28:235–242
floppiness—the case of high density 41. Berman HM, Henrick K, Nakamura H (2003)
lipoprotein-bound serum paraoxonase-1. J Announcing the worldwide Protein Data Bank.
Mol Biol 427:1359–1374 Nat Struct Mol Biol 10:980–980
30. Roca M, Vardi-Kilshtain A, Warshel A (2009) 42. HPC2N. http://www.hpc2n.umu.se/
Toward accurate screening in computer-aided
enzyme design. Biochemistry 48:3046–3056 43. Reetz MT, Wu S (2008) Greatly reduced
amino acid alphabets in directed evolution:
31. Frushicheva MP, Mills MJL, Schopf P, Singh making the right choice for saturation muta-
MK, Prasad RB, Warshel A (2014) Computer genesis at homologous enzyme positions.
aided enzyme design and catalytic concepts. Chem Commun 21:5499–5501
Curr Opin Chem Biol 21:56–62
44. Murzin AG, Brenner SE, Hubbart T, Chothia
32. Carvalho ATP, Barrozo A, Doron D, Kilshtain C (1995) SCOP: a structural classification of
AV, Major DT, Kamerlin SCL (2014) Chal- proteins database for the investigation of
lenges in computational studies of enzyme sequences and structures. J Mol Biol
structure, function and dynamics. J Mol 247:536–540
Graph Model 54:62–79
In Silico Directed Evolution 415
45. Cheng H, Schaeffer RD, Liao Y, Kinch LN, for the functional annotation of proteins.
Pei J, Shi S, Kim BH, Grishin NV (2014) Nucleic Acids Res 39(Database):D225–D229
ECOD: an evolutionary classification of pro- 48. Ponting CP, Schultz J, Milpetz F, Bork P
tein domains. PLoS Comput Biol 10: (1999) SMART: identification and annotation
e1003926 of domains from signalling and extracellular
46. Finn RD, Bateman A, Clements J, Coggill P, protein sequences. Nucleic Acids Res
Eberhardt RY, Eddy SR, Heger A, 27:229–232
Hetherington K, Holm L, Mistry J, Sonnham- 49. Haft DH, Selengut JD, White O (2003) The
mer ELL, Tate J, Punta M (2014) Pfam: the TIGRFAMs database of protein families.
protein families database. Nucleic Acids Res Nucleic Acids Res 31:371–373
42:D222–D230 50. Jones DT (1999) Protein secondary structure
47. Marchler-Bauer A, Lu S, Anderson JB, prediction based on position-specific scoring
Chitsaz F, Derbyshire MK, DeWeese-Scott C, matrices. J Mol Biol 292:195–202
Fong JH, Geer LY, Geer RC, Gonzales NR, 51. Buchan DWA, Minneci F, Nugent TCO,
Gwadz M, Hurwitz DI, Jackson JD, Ke Z, Bryson K, Jones DT (2013) Scalable web ser-
Lanczycki J, Lu F, Marchler GH, vices for the PSIPRED Protein Analysis Work-
Mullokandov M, Omelchenko MV, Robertson bench. Nucleic Acids Res 41(W1):
CL, Song JS, Thanki N, Yamashita RA, W340–W348
Zhang D, Zhang N, Zheng C, Bryant SH
(2011) CDD: a conserved domain database
Correction to: Enhancing Statistical Multiple Sequence
Alignment and Tree Inference Using Structural Information
Joseph L. Herman
Correction to:
Chapter 10 in: Tobias Sikosek (ed.),
Computational Methods in Protein Evolution,
Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_10
The published version of this book included errors in code listings in Chapter 10. These
code listings have been corrected and text has been updated.
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_23, © Springer Science+Business Media, LLC, part of Springer Nature 2019
E1
INDEX
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8, © Springer Science+Business Media, LLC, part of Springer Nature 2019
417
COMPUTATIONAL METHODS IN PROTEIN EVOLUTION: METHODS IN MOLECULAR BIOLOGY
418 Index
Epistasis ...................................................... 106, 123–128, High-order epistasis ...................................................... 128
131, 133 Hmmer .................................................... 86, 91, 100, 289
Eukaryotic linear motif (ELM) .................................... 338 Homologs..........................................................69, 71, 86,
EVB, see Empirical valence bond 138, 139, 142, 153, 277–286, 302, 319, 328, 409
Evolution ...........................................................23, 50, 56, Homology .......................................................67, 70, 142,
58, 60, 61, 63, 64, 84, 106, 107, 111, 124, 136, 184, 207, 251–260, 281–284, 288, 302, 303,
138, 171, 172, 174, 179, 184, 195, 208, 211, 311, 313, 318, 319, 327
215–229, 234, 236, 245, 259, 263, 273, 287, Homology model(ing) ............................................3–7, 9,
288, 301, 303, 314, 354, 359, 360, 369, 372, 10, 12, 15, 153, 155, 159, 176, 179, 221, 235,
381–413 301, 307, 311, 354, 359–361
Evolutionary biochemistry .................................. 106, 173 Host-pathogen interaction (HPI)....................... 317–329
Evolutionary classification of protein domains Hybrid structure ................................................ 30–33, 41
(ECOD) ................................................... 235, 236, Hybrid topology ..........................................27, 32–33, 41
238, 239, 278, 280, 409 HyPhy ...........................................................108–113, 116
Evolutionary relationship ...................136, 234, 235, 253
I
F
IDP, see Intrinsically disordered protein
FastML.................................................................. 176, 179 In silico mutagenesis ................................... 179, 369, 372
Fasttree .......................................109, 111, 116, 145, 146 Interface mimicry ................................................. 318–320
Figtree................................................................... 164, 175 InterPro ...............................................173, 289, 293, 294
Fitness .................84, 124, 125, 216, 218, 219, 221, 223 Intrinsically disordered protein (IDP) ........................337,
FoldX ............................................................5–7, 179, 180 338, 353, 358
Force field .......................................................... 30–33, 39, Intrinsic disorder ........................338, 339, 343, 345, 348
43, 84, 86, 87, 306, 390 I-TASSER .......................................................15, 179, 313
Free energy calculations.............................. 20, 21, 42, 44
Free energy change (ΔG) .........................................19–44 L
Funtree ...............................263–265, 269–270, 273–274
Ligand................................................ 21, 32, 42, 51, 153,
159, 160, 186, 201, 210, 272, 303, 308–310,
G
312, 313, 355–359, 362
Gene birth ....................................................................... 63 Linux................................................................3, 4, 14, 86,
Gene duplication .................................49–57, 60, 61, 138 108, 185, 211, 372, 386, 387
Gene family......................................51–53, 55, 56, 58–61
Gene ontology (GO) .......................................... 259, 264, M
271, 288, 289, 295–298, 356 Mac OS ................................................................. 108, 309
Gene tree ......................................................52–55, 58, 59 MAFFT .................................. 66, 74, 116, 175, 177, 178
Genome ....................................................... 49, 50, 60, 64
MAMMOTH .............................................. 240, 355, 362
Genome evolution ...........................................50, 64, 288 Marginal posterior probability (MPP) ........................119,
Genome-wide detection ................................................. 67 136, 147
Github.........................................116, 185, 253, 386, 412 Markov chain Monte Carlo (MCMC)............... 107, 108,
Globins ..................... 185, 186, 195, 200, 201, 208, 210 114, 115, 175, 178, 187–193, 195, 208, 209
GO, see Gene ontology MATLAB .............................................100, 128, 131–133
Graph clustering...........................................253, 258–260 Maximum likelihood (ML) .................................... 25, 55,
Graphical user interface (GUI) ..........108–110, 185, 241
109, 112, 113, 116, 136, 137, 145, 147, 149,
Gromacs..........................................21, 30–34, 36, 42–44, 195, 200, 201, 205, 212, 220–221, 224, 228, 270
87, 95, 97, 100 MD, see Molecular dynamics
MDTraj ............................................................................ 15
H
Mean-field substitution model ............................ 221–225
Hamiltonian ..........................................22, 23, 27, 30, 42 Membrane protein ....................................................49–61
Hepatitis C virus (HCV) ..................................... 115–118 Message passing interface (MPI) ............... 108, 393, 394
HHblits.............................. 254, 256, 258, 280, 281, 302 ML, see Maximum likelihood
HH-suite .............................................................. 253–255 Modeller ..........................................................6, 7, 11, 14,
Hidden Markov Model (HMM) ............................ 85, 86, 15, 31, 154, 179, 370, 373, 374
91, 254, 256, 281, 302, 373 Model quality assessment .................................... 306–307
COMPUTATIONAL METHODS IN PROTEIN EVOLUTION: METHODS IN MOLECULAR BIOLOGY
Index 419
Model quality estimates ................................................ 309 Phylogeny ................................................................ 54, 58,
Molecular dynamics (MD) ..................................... 20, 21, 110, 111, 136–138, 142, 143, 145–147, 151,
30, 33, 35, 39, 86, 87, 94, 99, 356, 382, 385, 392 160, 161, 225, 226, 263, 295
Molecular evolution ............................................ 124, 135, PhyML ...................................................67, 109, 116, 117
136, 145, 219, 222, 226 Pmx ........................................................ 21, 30–33, 37–44
Molecular mimicry ............................................... 317–320 Point mutation ............................................. 30, 177, 179,
Molecular phylogenetics ............................................... 124 362, 403
Molecular recognition features (MoRFs) .......... 338–341, Position specific scoring matrix
343, 345–348 (PSSM).....................................139, 140, 302, 339
MoRFpred ............................................................ 337–348 PPI, see Protein-protein interaction
MPI, see Message passing interface Prediction ................................................. 2, 3, 19, 20, 39,
MPP, see Marginal posterior probability 63–80, 83–100, 107, 138, 153, 159–162, 176,
MrBayes ......................................................................... 175 204, 235, 304, 306, 309, 311, 317–329,
Multiple sequence alignment (MSA) ..................... 73, 76, 338–345, 348, 369, 370, 373–374,
85, 86, 89–93, 100, 116, 142, 172, 175–180, 376, 377, 408
183–212, 218, 220, 222, 224–227, 253, 254, Profile-HMM alignment............................................... 236
256, 259, 272 PROSITE ...................................................................... 288
Mutation........................................... 1–15, 19–21, 26–36, Protein-coding genes ................................................63–80
38–41, 44, 51, 52, 56, 60, 74–77, 84, 106, 110, Protein complex .......................................... 179, 235, 313
123–133, 175, 179, 216, 218, 219, 221–224, Protein conformation ............................... 2, 87, 353–363
369, 372, 376, 409 Protein Data Bank (PDB) ...........................................4, 6,
11, 87, 96, 97, 99, 152, 153, 185, 186, 195, 196,
N 211, 217, 218, 220, 222, 223, 225, 227, 228,
Native state ........................ 215–218, 224, 353–357, 408 235–241, 243, 244, 269, 270, 278–282, 285,
302, 306, 307, 309–312, 320, 321, 323, 324,
NJplot ................................................................... 175, 176
Non-equilibrium transitions................. 35–37, 41, 42, 44 327, 329, 339, 354, 355, 357, 362, 370, 385,
Non-synonymous substitution ........................... 110–113, 390, 397, 403
Protein domains ........................................... 6, 9, 10, 160,
116–119
Novel genes ........................................................ 50, 63, 64 234, 235, 277, 287, 288, 298, 409
Protein dynamics.................................................. 354, 359
O Protein engineering ...................................................... 173
Protein evolution ................................. 61, 172, 215–221,
Oligomeric protein ....................................................... 304 225, 226, 228, 234, 236, 354
OncoKB........................................................................... 12 Protein family ..................................................84, 93, 136,
OpenMM....................................................................... 306 138, 139, 141–143, 145, 146, 151, 153, 160,
ORF formation................................................................ 75 220, 224, 303, 314, 409
Protein folding .................................................... 1–15, 21,
P 25–27, 84, 86, 88, 177
PAML, see Phylogenetic analysis by maximum likelihood stability..................................................................... 221
Parallel tempering ................................................ 191, 192 Protein function .................................................. 2, 61, 63,
Parsimony .................................................. 52–54, 58, 145 137, 138, 164, 171, 173, 176, 354
PAUP* ........................................................................... 116 prediction................................................................. 234
PDB, see Protein Data Bank Protein-ligand complex ....................................... 153, 161
Pfam ........................................................ 6, 85, 88–90, 92, Protein-protein interaction (PPI) ............. 1–15, 19, 310,
100, 139, 220, 221, 238, 288, 289, 293–295, 409 317–320, 322, 324–327, 329
Phylogenetic analysis by maximum likelihood Protein space ........................................................ 233–245
(PAML)..................................................... 176, 224 navigation ....................................................... 233–245
Phylogenetics................................... 52, 61, 67, 122, 137, Protein stability .......................................... 3, 12, 20, 171,
138, 142, 145, 146, 151, 172, 174, 175, 178, 215–229
185, 204, 205, 215–229, 253, 259, 274 Protein structure ................................................... 3, 9, 31,
Phylogenetic tree..................................................... 66, 76, 33, 83, 84, 87, 98–100, 135, 153, 155, 184, 186,
77, 106, 109, 113, 116, 117, 136, 142–147, 161, 195–197, 220, 221, 224, 226, 234, 235,
173, 175, 184, 193, 219, 220, 224, 227, 228, 237–243, 285, 301, 305, 307, 311, 312, 329,
253, 258, 264, 265, 269, 270, 273, 295 355, 361, 382, 408, 409
COMPUTATIONAL METHODS IN PROTEIN EVOLUTION: METHODS IN MOLECULAR BIOLOGY
420 Index
Protein structure (cont.) StatAlign ............................................. 185–187, 189–190,
alignment ................................................................. 240 193, 195, 197, 211, 212
prediction........................................................ 234, 235 Statistical alignment ...................................................... 185
ProtTest ....................................................... 145, 175, 220 Structural alignment ............................................. 87, 236,
PSIPRED...................................... 86, 278, 280, 302, 409 240, 241, 273, 320, 355, 361, 362
PSSM, see Position specific scoring matrix Structural biology ........................................................... 83
PyMOL ..................................................31, 241, 244, 372 Structural modeling ................... 138, 153–159, 369–372
Python .......................................................... 8, 11, 21, 31, Structural network ............................................... 318, 324
66, 86, 89, 91, 94, 98, 100, 118, 139, 140, 144, Structure alignment ...................................................... 240
148, 149, 152, 155, 157, 160, 278, 296, 382, Structure based model (SBM)................................83–100
387, 389, 411 Structure prediction ................................ 85, 94, 153, 370
Structure space ................................... 234–235, 237–240,
Q 242, 243
Quaternary structure ........................................... 301–314 Substitution model ............................................. 111, 176,
211, 215–217, 219–221, 224–225, 227, 228
R Substitution rate.......................................... 111, 223, 227
Superorganism network......................322–324, 326, 327
RAxML ................................................109, 111, 146–148 Support vector machine (SVM) .......................... 339–341
Repeat proteins .................................................... 251–260 SWISS-MODEL ........................ 302–307, 309, 311–313
RNA ......................................................27, 115, 116, 118, Synonymous substitution .................................... 110, 111
119, 153, 155, 159, 160, 347
Root-mean-square deviation (RMSD) .................. 87, 96, T
195, 204, 205, 208–210, 212, 242, 243, 264,
355–358, 360–363, 371–373 Temperature ........................................... 22, 44, 189–192,
211, 212, 218, 219, 227, 408
Rosetta ................................... 15, 31, 313, 321, 322, 370
Thermodynamics...............................................2, 4, 7, 10,
S 19–22, 26, 27, 29, 34, 38, 40, 217, 221, 224,
226–228
SBM, see Structure based model Thermostability ......................................... 27, 31, 39, 368
SCOP ........................ 204, 235, 236, 238, 239, 242, 409
SCWRL4................... 306, 388, 389, 403, 404, 408, 409 U
Secondary structure ..................................... 7, 15, 85, 86,
93, 94, 228, 235, 303, 310, 372 UCSF chimera........................................... 87, 96, 97, 241
prediction.....................................84, 93–94, 280, 302 Uniprot ....................................................... 4–6, 9, 12, 85,
88, 92, 96, 99, 173, 254–256, 258, 355, 356, 362
Sequence alignment ................................................ 84, 87,
97, 108–111, 113, 136, 142–148, 151, 153, 157, UNIX ........................................................... 110, 139, 143
189, 197–201, 236, 264, 273, 277–286, 374, 377
V
Sequence homology............................251–260, 318, 319
Side chain prediction .................................. 370, 372, 374 Vertical analysis.............................................................. 174
Small molecule ....................................153, 272, 317, 329
Stability constrained substitution models...................216, W
219–221, 223–228 Windows ...........4, 14, 96, 108–110, 185, 309, 339, 372