Computational Methods in Protein Evolution 2019

Methods in

Molecular Biology 1851

Tobias Sikosek Editor

Computational
Methods in
Protein Evolution
METHODS IN MOLECULAR BIOLOGY
Series Editor
John M. Walker
School of Life and Medical Sciences,
University of Hertfordshire,
Hatfield, Hertfordshire AL10 9AB, UK
For further volumes:

http://www.springer.com/series/7651
Computational Methods in Protein
Evolution
Edited by
Tobias Sikosek
GlaxoSmithKline, Cellzome - a GSK company, Meyerhofstrasse 1,
Heidelberg, Baden-Württemberg, Germany
Editor
Tobias Sikosek
GlaxoSmithKline
Cellzome - a GSK company
Meyerhofstrasse 1
Heidelberg, Baden-Württemberg, Germany
ISSN 1064-3745 ISSN 1940-6029 (electronic)

Methods in Molecular Biology
ISBN 978-1-4939-8735-1 ISBN 978-1-4939-8736-8 (eBook)
https://doi.org/10.1007/978-1-4939-8736-8
Library of Congress Control Number: 2018954227
© Springer Science+Business Media, LLC, part of Springer Nature 2019, corrected publication 2019
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction
on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations
and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to
be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty,
express or implied, with respect to the material contained herein or for any errors or omissions that may have been made.
The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This Humana Press imprint is published by the registered company Springer Science+Business Media, LLC, part of
Springer Nature.
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.
Preface
Proteins are the most versatile kind of molecule that we know and the result of a long
evolutionary process. During this process, countless rearranging, mutating, and replicating
strands of DNA have managed to both encode and conserve proteins that would allow them
to replicate and stay intact and on the other hand have allowed their proteins to change and
ultimately help them replicate more than other strands of DNA. All cells make proteins in
their protein factories called ribosomes, where the DNA of a gene is translated according to
the ancient genetic code into strings of amino acids which follow the laws of thermodynam-
ics and molecular forces to fold up into specific wobbly three-dimensional shapes. Protein
evolution happens whenever an accidental “typo”—or mutation—in the gene is translated
into a modified protein, and that protein is released into the busy commotion within the cell,
packed within a dense soup of other molecules in water. Whatever this new protein does
differently than its predecessor can determine the fate of that mutation, making it either an
essential innovation, a terrible mistake that gets erased, or something that just stays around
for a while without being noticed, maybe to play a role in the distant future.
This book is a compilation of methods that can be applied to various problems related to
protein sequence and structure. It is a diverse collection of approaches ranging from broad
conceptual (“protein space”) to very specific applications (“antibody modeling”). The term
“evolution” is used slightly differently in various fields of science. While evolutionary
biologists think about the natural process of Darwinian evolution (and other post-
Darwinian forms of evolution of organisms living in populations and environments), bio-
chemists take a more design-oriented approach to evolution, using the evolutionary process
in vitro or in silico to make proteins with certain desired properties. Physicists on the other
hand use the term evolution to describe a continuous process in time that changes a system
from one to another state. While physics plays a significant role in this book, it is the first two
notions of evolution that will be described in the following chapters.
Evolutionary research has made extensive use of computers. While the result of evolu-
tion can be readily studied at the macroscopic, phenotypic level, evolutionary biology has
always had a strong theoretical component, since the actual process had been rare to directly
observe for a long time. The underlying patterns of inheritance and the interplay between
geography and population dynamics have been described in mathematical terms and have
always accompanied the progress made in the Molecular Biology of cells that eventually
elucidated the core mechanisms of inheritance: the information stored in DNA and how it is
replicated and passed on—imperfectly—to future generations. The field of Bioinformatics
was born as soon as the first sequences of genes and proteins had been published at a large
enough quantity to be amenable to direct sequence-to-sequence comparisons. The fields of
Molecular Evolution and Phylogenetics were close companions of this development where
mathematical models and computational algorithms were combined to reconstruct the most
likely evolutionary history given the observed DNA sequences. Protein sequences have been
a free giveaway due to the ready translatability of the amino acid sequence from DNA based
on the almost universal genetic code. DNA sequences became the main source material of
molecular evolution research for quite a while, further spurred by the Human Genome
Project and later the advent of the next-generation sequencing data explosion. Evolutionary
relationships within populations and among species were revealed in ever greater detail.
v
vi Preface
Still, no matter how much genetic sequence data has become available, there still have
been many aspects of how genetics translates to observable (phenotypic) changes that
cannot be understood at that level of description. Network science is another toolkit rooted
in math and computation that is used to study evolution at the genotypic to phenotypic
interface. There are networks representing physical and chemical molecular interactions
within a cell, the flow of information and cell-level “computation” and communication, as
well as more abstract networks describing the relationships and similarities between gene
and protein sequences, including the entire “universe” of known proteins. While biological
network science—often called systems biology—comes close to providing a working model
of the cellular phenotype, the real “gap” in understanding where a mutation in the DNA
sequence makes a difference to the survival and fitness of an entire organism is how physical
interactions, the “edges” or connections in systems biology networks, are a result of
biophysical properties of proteins, which can be altered by mutations. It is this point—
where changes of DNA translate into altered protein structure and function—that most of
the methods in this book are focused on.
While Molecular Evolution has been a backward-facing, almost historical, discipline in
its early days, it has increasingly matured into an “applicable” science due to its intersections
with Biochemistry and Biophysics. Protein evolution is therefore much more than just the
description of evolutionary relationships based on sequence differences. It has become a
powerful tool for interfering with the evolution of pathogens, for devising therapies against
mutation-based diseases such as cancers, and for designing novel enzymes with properties
that can go beyond naturally evolved functions. Methods from evolution can be easily
applied whenever genetic variation is at play, and this variation is what makes all humans
unique and sometimes even determines why diseases and infections affect each of us
differently.
While each chapter in this book is the unique work of its authors and there is no
predefined “narrative” to this book, some common themes become apparent.
The first theme is that of mutations of single amino acids, i.e. point mutations. Predict-
ing their effect on the physical structure of a protein is an important capability that links the
abundance of sequence information with the comparatively few known structures (Chapters
1 and 2). Other mutational mechanisms lead to gene duplication (Chapter 3) and even de
novo emergence of new genes (Chapter 4).
Likewise, the understanding of pairwise correlated mutations can be used to reveal
structure information where none is available because the fates of spatially close (and
physically interacting) amino acids are evolutionarily linked and coevolve (Chapters 5, 6
and 7).
Going back into evolutionary history, the structure and function of proteins can be
reconstructed and used productively, since these may bear similar functions to their extant
descendants yet also may have some new functional properties (Chapters 8 and 9). Many
formerly sequence-based methods such as sequence alignments and phylogenies can be
improved by applying a more structural and biophysical viewpoint (Chapters 10 and 11).
Instead of exploring similar proteins along evolutionary time, one can of course also
compare existing proteins based on their similarity in sequence and structure. A number of
classification schemes for organizing all known proteins exist, and it is possible to explore an
entire “protein universe,” often by breaking full proteins into even smaller building units
called domains (Chapters 12, 13, 14, 15 and 16). Homology modeling makes use of these
similarities by fitting the sequences of proteins without known structure to those known
structures of proteins with similar sequence (Chapter 17). This structure prediction can also
Preface vii
be extended to protein-protein interactions (Chapter 18) and even some structural proper-
ties of proteins lacking a fixed structure, i.e., disordered/unstructured proteins can be
predicted (Chapter 19). Another important aspect related to disorder is the intrinsic
dynamic nature of folded proteins that always exist as an ensemble of conformations,
some of which become favored or disfavored with evolutionary changes (Chapter 20).
Finally, evolutionary principles are at work in shaping such versatile proteins as anti-
bodies or enzymes, which can also be designed to have certain properties in silico by
applying directed evolution, i.e., where the evolutionary endpoint, but not its path, is
determined by the researcher (Chapters 21 and 22).
The book covers a wide range of computational approaches, including the dynamic
programming techniques of sequence alignments, the clustering methods of phylogenies,
physics-based approaches such as molecular dynamics simulations, and a range of statistical,
graph-based, and machine learning methods. While the authors take the time to give some
background and references in the introductory sections, this book is not a textbook, and
more detailed descriptions of underlying theory and algorithms may have to be found
elsewhere. Nevertheless, I think that there is a lot to be learned from this book for an
interdisciplinary readership.
I sincerely hope that this book offers many useful workflows and techniques that help
many researchers and students working with proteins computationally. I also strongly
encourage the reader to go beyond the individual protocol and mix and match the different
methods to come up with new innovative solutions. That’s what evolution would do.
Heidelberg, Germany Tobias Sikosek

Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Predicting the Effect of Mutations on Protein Folding

and Protein-Protein Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Alexey Strokach, Carles Corbi-Verge, Joan Teyra, and Philip M. Kim
2 Accurate Calculation of Free Energy Changes upon Amino
Acid Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Matteo Aldeghi, Bert L. de Groot, and Vytautas Gapsys
3 Protocols for the Molecular Evolutionary Analysis of Membrane
Protein Gene Duplicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Laurel R. Yohe, Liang Liu, Liliana M. Dávalos, and David A. Liberles
4 Computational Prediction of De Novo Emerged Protein-Coding Genes. . . . . . . 63
Nikolaos Vakirlis and Aoife McLysaght
5 Coevolutionary Signals and Structure-Based Models for the
Prediction of Protein Native Conformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Ricardo Nascimento dos Santos, Xianli Jiang, Leandro Martı́nez,
and Faruck Morcos
6 Detecting Amino Acid Coevolution with Bayesian Graphical Models . . . . . . . . . . 105
Mariano Avino and Art F. Y. Poon
7 Context-Dependent Mutation Effects in Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Frank J. Poelwijk
8 High-Throughput Reconstruction of Ancestral Protein
Sequence, Structure, and Molecular Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Kelsey Aadland, Charles Pugh, and Bryan Kolaczkowski
9 Ancestral Sequence Reconstruction as a Tool for the Elucidation
of a Stepwise Evolutionary Adaptation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Kristina Straub and Rainer Merkl
10 Enhancing Statistical Multiple Sequence Alignment and Tree Inference
Using Structural Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Joseph L. Herman
11 The Influence of Protein Stability on Sequence Evolution: Applications
to Phylogenetic Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Ugo Bastolla and Miguel Arenas
12 Navigating Among Known Structures in Protein Space . . . . . . . . . . . . . . . . . . . . . . 233
Aya Narunsky, Nir Ben-Tal, and Rachel Kolodny
13 A Graph-Based Approach for Detecting Sequence Homology
in Highly Diverged Repeat Protein Families. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Jonathan N. Wells and Joseph A. Marsh
ix
x Contents
14 Exploring Enzyme Evolution from Changes in Sequence, Structure,

and Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
Jonathan D. Tyzack, Nicholas Furnham, Ian Sillitoe,
Christine M. Orengo, and Janet M. Thornton
15 Identification of Protein Homologs and Domain Boundaries
by Iterative Sequence Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Dustin Schaeffer and Nick V. Grishin
16 A Roadmap to Domain Based Proteomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Carsten Kemena and Erich Bornberg-Bauer
17 Modeling of Protein Tertiary and Quaternary Structures Based
on Evolutionary Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
Gabriel Studer, Gerardo Tauriello, Stefan Bienert,
Andrew Mark Waterhouse, Martino Bertoni, Lorenza Bordoli,
Torsten Schwede, and Rosalba Lepore
18 Interface-Based Structural Prediction of Novel Host-Pathogen
Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Emine Guven-Maiorov, Chung-Jung Tsai, Buyong Ma, and Ruth Nussinov
19 Predicting Functions of Disordered Proteins with MoRFpred . . . . . . . . . . . . . . . . 337
Christopher J. Oldfield, Vladimir N. Uversky, and Lukasz Kurgan
20 Exploring Protein Conformational Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
Alexander Miguel Monzon, Maria Silvina Fornasari, Diego Javier Zea,
and Gustavo Parisi
21 High-Throughput Antibody Structure Modeling
and Design Using ABodyBuilder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
Jinwoo Leem and Charlotte M. Deane
22 In Silico-Directed Evolution Using CADEE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
Beat Anton Amrein, Ashish Runthala,
and Shina Caroline Lynn Kamerlin
Correction to: Enhancing Statistical Multiple Sequence Alignment and Tree
Inference Using Structural Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E1
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Contributors
KELSEY AADLAND Department of Microbiology & Cell Science, Institute for Food and
Agricultural Sciences, University of Florida, Gainesville, FL, USA
MATTEO ALDEGHI Max Planck Institute for Biophysical Chemistry, Computational
Biomolecular Dynamics Group, Göttingen, Germany
BEAT ANTON AMREIN Associate Scientist, Tecan Schweiz AG, M€ a nnedorf, Switzerland
MIGUEL ARENAS Department of Biochemistry, Genetics and Immunology, University of
Vigo, Vigo, Spain
MARIANO AVINO Department of Pathology and Laboratory Medicine, Western University,
London, Canada
UGO BASTOLLA Centre for Molecular Biology, Severo Ochoa (CSIC-UAM), Madrid, Spain
NIR BEN-TAL Department of Biochemistry and Molecular Biology, George S. Wise Faculty of
Life Sciences, Tel Aviv University, Tel Aviv, Israel
MARTINO BERTONI Biozentrum, University of Basel and SIB Swiss Institute of
Bioinformatics, Basel, Switzerland
STEFAN BIENERT Biozentrum, University of Basel and SIB Swiss Institute of Bioinformatics,
Basel, Switzerland
LORENZA BORDOLI Biozentrum, University of Basel and SIB Swiss Institute of
ERICH BORNBERG-BAUER Institute for Evolution and Biodiversity, University of Münster,
Münster, Germany
CARLES CORBI-VERGE Terrence Donnelly Centre for Cellular and Biomolecular Research,
University of Toronto, Toronto, ON, Canada
LILIANA M. DÁVALOS Department of Ecology and Evolution, Stony Brook University, Stony
Brook, NY, USA
CHARLOTTE M. DEANE Department of Statistics, University of Oxford, Oxford, UK
MARIA SILVINA FORNASARI Departamento de Ciencia y Tecnologı́a, Universidad Nacional
de Quilmes, CONICET, Bernal, Argentina
NICHOLAS FURNHAM London School of Hygiene and Tropical Medicine, London, UK
VYTAUTAS GAPSYS Max Planck Institute for Biophysical Chemistry, Computational
NICK V. GRISHIN Department of Biophysics, University of Texas Southwestern Medical
Center, Dallas, TX, USA; Howard Hughes Medical Institute, University of Texas
Southwestern Medical Center, Dallas, TX, USA
BERT L. DE GROOT Max Planck Institute for Biophysical Chemistry, Computational
EMINE GUVEN-MAIOROV Cancer and Inflammation Program, Leidos Biomedical Research,
Inc., Frederick National Laboratory for Cancer Research, National Cancer Institute,
Frederick, MD, USA
JOSEPH L. HERMAN Department of Biomedical Informatics, Harvard Medical School,
Boston, MA, USA
KRISTINA STRAUB Institute of Biophysics and Physical Biochemistry, University of
Regensburg, Regensburg, Germany
xi
xii Contributors
XIANLI JIANG Department of Biological Sciences, University of Texas at Dallas, Richardson,

TX, USA
SHINA CAROLINE LYNN KAMERLIN Department of Chemistry, BMC, Uppsala University,
Uppsala, Sweden
CARSTEN KEMENA Institute for Evolution and Biodiversity, University of Münster, Münster,
Germany
PHILIP M. KIM Department of Computer Science, University of Toronto, Toronto, ON,
Canada; Terrence Donnelly Centre for Cellular and Biomolecular Research, University of
Toronto, Toronto, ON, Canada; Department of Molecular Genetics, University of Toronto,
Toronto, ON, Canada
BRYAN KOLACZKOWSKI Department of Microbiology & Cell Science, Institute for Food and
Agricultural Sciences, University of Florida, Gainesville, FL, USA; Genetics Institute,
University of Florida, Gainesville, FL, USA
RACHEL KOLODNY Department of Computer Science, University of Haifa, Haifa, Israel
LUKASZ KURGAN Department of Computer Science, Virginia Commonwealth University,
Richmond, VA, USA
JINWOO LEEM Department of Statistics, University of Oxford, Oxford, UK
ROSALBA LEPORE Biozentrum, University of Basel and SIB Swiss Institute of Bioinformatics,
Basel, Switzerland
DAVID A. LIBERLES Department of Biology and Center for Computational Genetics and
Genomics, Temple University, Philadelphia, PA, USA
LIANG LIU Department of Statistics and Institute of Bioinformatics, University of Georgia,
Athens, GA, USA
BUYONG MA Cancer and Inflammation Program, Leidos Biomedical Research, Inc.,
Frederick National Laboratory for Cancer Research, National Cancer Institute, Frederick,
MD, USA
JOSEPH A. MARSH MRC Human Genetics Unit, MRC Institute of Genetics and Molecular
Medicine, University of Edinburgh, Edinburgh, UK
LEANDRO MARTÍNEZ Institute of Chemistry, University of Campinas (UNICAMP),
Campinas, SP, Brazil
AOIFE MCLYSAGHT Department of Genetics, Trinity College Dublin, Smurfit Institute of
Genetics, University of Dublin, Dublin, Ireland
RAINER MERKL Institute of Biophysics and Physical Biochemistry, University of Regensburg,
Regensburg, Germany
ALEXANDER MIGUEL MONZON Departamento de Ciencia y Tecnologı́a, Universidad
Nacional de Quilmes, CONICET, Bernal, Argentina
FARUCK MORCOS Department of Biological Sciences, University of Texas at Dallas,
Richardson, TX, USA; Center for Systems Biology, University of Texas at Dallas,
Richardson, TX, USA
AYA NARUNSKY Department of Biochemistry and Molecular Biology, George S. Wise Faculty
of Life Sciences, Tel Aviv University, Tel Aviv, Israel
RUTH NUSSINOV Cancer and Inflammation Program, Leidos Biomedical Research, Inc.,
MD, USA; Department of Human Genetics and Molecular Medicine, Sackler Institute of
Molecular Medicine, Sackler School of Medicine, Tel Aviv University, Tel Aviv, Israel
CHRISTOPHER J. OLDFIELD Department of Computer Science, Virginia Commonwealth
University, Richmond, VA, USA
Contributors xiii
CHRISTINE M. ORENGO Institute of Structural and Molecular Biology, University College

London, London, UK
GUSTAVO PARISI Departamento de Ciencia y Tecnologı́a, Universidad Nacional de Quilmes,
CONICET, Bernal, Argentina
FRANK J. POELWIJK cBio Center, Department of Biostatistics and Computational Biology,
Boston, MA, USA
ART F. Y. POON Department of Pathology and Laboratory Medicine, Western University,
London, Canada
CHARLES PUGH Department of Microbiology & Cell Science, Institute for Food and
Agricultural Sciences, University of Florida, Gainesville, FL, USA
ASHISH RUNTHALA Indian Institute of Science, Bangalore, India
RICARDO NASCIMENTO DOS SANTOS Institute of Chemistry, University of Campinas
(UNICAMP), Campinas, SP, Brazil
DUSTIN SCHAEFFER Department of Biophysics, University of Texas Southwestern Medical
Center, Dallas, TX, USA
TORSTEN SCHWEDE Biozentrum, University of Basel and SIB Swiss Institute of
IAN SILLITOE Institute of Structural and Molecular Biology, University College London,
London, UK
ALEXEY STROKACH Department of Computer Science, University of Toronto, Toronto, ON,
Canada; Terrence Donnelly Centre for Cellular and Biomolecular Research, University of
Toronto, Toronto, ON, Canada
GABRIEL STUDER Biozentrum, University of Basel and SIB Swiss Institute of Bioinformatics,
Basel, Switzerland
GERARDO TAURIELLO Biozentrum, University of Basel and SIB Swiss Institute of
JOAN TEYRA Terrence Donnelly Centre for Cellular and Biomolecular Research, University
of Toronto, Toronto, ON, Canada
JANET M. THORNTON EMBL-EBI, Wellcome Genome Campus, Hinxton, Cambridge, UK
CHUNG-JUNG TSAI Cancer and Inflammation Program, Leidos Biomedical Research, Inc.,
MD, USA
JONATHAN D. TYZACK EMBL-EBI, Wellcome Genome Campus, Hinxton, Cambridge, UK
VLADIMIR N. UVERSKY Department of Molecular Medicine and USF Health Byrd
Alzheimer’s Research Institute, Morsani College of Medicine, University of South Florida,
Tampa, FL, USA; Institute for Biological Instrumentation, Russian Academy of Sciences,
Moscow Region, Russia
NIKOLAOS VAKIRLIS Department of Genetics, Trinity College Dublin, Smurfit Institute of
Genetics, University of Dublin, Dublin, Ireland
ANDREW MARK WATERHOUSE Biozentrum, University of Basel and SIB Swiss Institute of
JONATHAN N. WELLS MRC Human Genetics Unit, MRC Institute of Genetics and
Molecular Medicine, University of Edinburgh, Edinburgh, UK
LAUREL R. YOHE Department of Geology & Geophysics, Yale University, New Haven, CT,
USA
DIEGO JAVIER ZEA Structural Bioinformatics Unit, Fundacion Instituto Leloir,
CONICET, Buenos Aires, Argentina
Chapter 1
Predicting the Effect of Mutations on Protein Folding

and Protein-Protein Interactions
Alexey Strokach, Carles Corbi-Verge, Joan Teyra, and Philip M. Kim
Abstract
The function of a protein is largely determined by its three-dimensional structure and its interactions with
other proteins. Changes to a protein’s amino acid sequence can alter its function by perturbing the energy
landscapes of protein folding and binding. Many tools have been developed to predict the energetic effect
of amino acid changes, utilizing features describing the sequence of a protein, the structure of a protein, or
both. Those tools can have many applications, such as distinguishing between deleterious and benign
mutations and designing proteins and peptides with attractive properties. In this chapter, we describe how
to use one of such tools, ELASPIC, to predict the effect of mutations on the stability of proteins and the
affinity between proteins, in the context of a human protein-protein interaction network. ELASPIC uses a
wide range of sequential and structural features to predict the change in the Gibbs free energy for protein
folding and protein-protein interactions. It can be used both through a web server and as a stand-alone
application. Since ELASPIC was trained using homology models and not crystal structures, it can be
applied to a much broader range of proteins than traditional methods. It can leverage precalculated
sequence alignments, homology models, and other features, in order to drastically lower the amount of
time required to evaluate individual mutations and make tractable the analysis of millions of mutations
affecting the majority of proteins in a genome.
Key words Computational biology, Structural biology, Bioinformatics, Protein stability, Mutations,
Protein engineering
1 Introduction
Proteins usually fold into specific, stable, three-dimensional struc-

tures, which allow them to interact with other molecules and carry
out their biological function. Protein-protein interactions play a
critical role in the regulation of numerous biological processes.
The functionality of the protein interaction network is grounded
on the specificity that proteins have for each other, which arises
from the complementarity in shape and the physicochemical com-
position of the protein interaction interface. Mutations can perturb
the interaction profile of a protein by changing the shape or the
composition of the protein interaction interface, leading to a loss,
Tobias Sikosek (ed.), Computational Methods in Protein Evolution, Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_1, © Springer Science+Business Media, LLC, part of Springer Nature 2019
1
2 Alexey Strokach et al.
or a gain, of new interaction partners. Mutations can also change

the stability of a protein, causing changes in protein conformation,
solubility, or other attributes that dictate protein function.
Although most mutations are destabilizing and detrimentally
affect the activity of a protein, this effect is usually modest, and only
a small fraction of all possible mutations entirely abolish protein
function [1]. This allows for a neutral drift in the amino acid
sequence of a protein, which can eventually, under selective pres-
sure, evolve into new functionality [2]. An immediate consequence
of the accumulated variation in the human genome is the diversity
in responses that people can have to drugs or other environmental
stimuli. For instance, different patients with the same disease often
do not react in a homogeneous manner to the same drug. A
treatment usually is more effective in a subpopulation of patients,
and it may be completely ineffective or even detrimental to others
[3]. A better understanding of the effect that the variation in the
human genome has on protein folding and protein-protein interac-
tion would help us to predict how a patient might respond to a
specific treatment. It would also improve our ability to detect and
treat genetic diseases and to develop targeted therapies against
cancer and drug-resistant pathogens [4].
Experimental approaches for evaluating the effect of mutations
on protein folding and protein-protein interaction, such as isother-
mal titration calorimetry (ITC) [5], luminescence-based mamma-
lian interactome mapping (LUMIER) [6], phage display [7], and
deep mutational scanning [8], are laborious, time-consuming, and
expensive. Accordingly, many computational methods have been
developed to predict the thermodynamic and phenotypic conse-
quences of mutations. Those computational tools can broadly be
categorized as sequence-based tools, structure-based tools, and
tools which use both sequential and structural information in
order to make their predictions.
1.1 Sequence-Based Sequence-based tools usually rely on some form of a conservation

Tools score, describing the frequency with which a particular nucleotide
or amino acid is found at the given position in domain-, protein-, or
genome-level alignments, in order to make their prediction
[9–14]. Due to their speed and scalability, sequence-based tools
are the de facto standard for annotating newly discovered variants.
However, they remain limited in their accuracy and the type of
information that they can provide [15]. In particular, they only
predict whether or not a particular mutation is likely to be deleteri-
ous and provide no information as to why that mutation may be
deleterious. This makes it difficult to act upon those predictions, for
example, by designing drugs that would curtail the effect of disease-
causing mutations or would take advantage of mutations found in
cancer.
Predicting the Effect of Mutations 3
1.2 Structure-Based Structure-based tools predict the effect of mutations on protein

Tools structure and/or function using features describing the three-
dimensional structure of the protein. They range from accurate but
computationally expensive alchemical free energy calculations, which
involve modeling the structural transition from the wild type to the
mutant protein and using different integration techniques to calcu-
late the energy of the transition [16], to quicker but more approxi-
mate techniques, which use semiempirical or statistical potentials and
assume that the backbone of the protein remains fixed [17–21]. In
theory, structure-based tools should be able to offer more insight
into the effect of missense mutations than sequence-based tools,
since the effect is directly caused by changes in protein structure
and function and not by changes in the DNA sequence. However,
since existing structure-based tools require manual setup and a crys-
tal structure of the protein being mutated, they are not being used
systematically to evaluate the effect of newly discovered mutations.
1.3 Combination Several tools have been developed that attempt to combine
of Sequence sequence- and structure-based information in order to make
and Structure more accurate predictions about the deleteriousness [22] and the
structural impact [21, 23, 24] of mutations. Those tools generally
are “meta-predictors” which integrate the results of several
sequence- and structure-based tools using machine learning algo-
rithms trained on an appropriate dataset [25, 26]. Most of those
tools remain limited in their coverage because only a small fraction
of all proteins and protein-protein interactions have an experimen-
tally determined structure [27]. ELASPIC, developed by Berliner
et al. [23], overcomes this limitation by using homology models,
instead of crystal structures, to evaluate the structural impact of
mutations. ELASPIC still achieves relatively high accuracy in pre-
dicting the effect of mutations on protein stability and protein-
protein interaction affinity, but it has much higher coverage, includ-
ing the majority of proteins in the human proteome and hundreds
of thousands of protein-protein interactions.
In this protocol, we describe how to set up and run ELASPIC
on a local machine. We describe how precalculated homology
models and other data can be downloaded and installed in order
to greatly reduce the time taken by ELASPIC to evaluate new
mutations. Finally, we show how to use ELASPIC to perform
alanine scanning of a protein-protein interaction interface and
how to evaluate the structural effect of several thousand mutations
that have been implicated in cancer.
2 Materials
This tutorial requires basic knowledge of the Linux command line

environment. While we made every effort to make the installation
of ELASPIC as simple as possible, the process remains reasonably
involved and may take several hours. If you do not wish to make
changes to the ELASPIC source code and are planning to run
under a few thousand mutations, using the ELASPIC web server
[28], available at http://elaspic.kimlab.org, is encouraged. The
web server may also be used to verify the results obtained using a
local installation of ELASPIC.
The source code for ELASPIC is available at https://gitlab.
com/kimlab/elaspic/ and is provided under an MIT license. The
documentation for ELASPIC is available at https://kimlab.gitlab.
io/elaspic/. ELASPIC should work on any Linux distribution with
a version of glibc 2.14 (e.g., CentOS 6 or newer, Ubuntu 12.04
or newer). At the moment, it does not work on Windows or
MacOS (although see Notes 1 and 2).
ELASPIC can be run using two different “pipelines,” the data-
base pipeline and the local pipeline, as shown in Fig. 1. The database
pipeline allows us to evaluate the thermodynamic impact of muta-
tions on a proteome-wide scale, without having to specify a struc-
tural template for each protein. This pipeline takes as input the
UniProt ID of the protein being mutated and one or more muta-
tions affecting that protein. At each decision node, the pipeline
queries the database to check whether or not the required informa-
tion has already been calculated. If the required data has not been
calculated, the pipeline executes the appropriate code and stores the
results in the database for later retrieval. The pipeline proceeds until
homology models of all domains in the protein, and all domain-
domain interactions involving the protein, have been calculated and
the ΔΔG has been predicted for every specified mutation. The local
pipeline can be used without downloading and installing a local
copy of the ELASPIC databases but requires a PDB structure or
template to be provided for every protein. The output from this
pipeline is saved as JSON files inside the working directory, rather
than being uploaded to the database, as in the case of the database
pipeline. Both pipelines use the same internal libraries to perform
the majority of the computation.
The ELASPIC database, required by the database pipeline,
includes many external datasets, which are listed in Table 1. The
use of the external datasets is made transparent to the ELASPIC
user, who simply has to load the data from the ELASPIC download
page (http://elaspic.kimlab.org/static/download) into their local
ELASPIC database using the elaspic database load-basic or elaspic
database load-complete commands. The only exception is the
BLAST nr database, which is required by both the database pipeline
and the local pipeline and has to be downloaded separately from the
NCBI website (although see Note 4). This is described in detail in
Subheading 3.
Database Pipeline Local Pipeline
Input:
Uniprot ID + mutation(s)
Input:
PDB [+ target sequences]
+ mutations
1.
Do we have a no
Run Provean to construct a
multiple sequence
ELASPIC internals
Provean MSA for alignment for the specified
this protein? protein.
DB elaspic_sequence.py
Input: fasta file with domain sequence
yes Create and mutate
Output: provean supporting set
2.
.mutate(mutation): to compute sequence- sequence objects
Do we have Run Modeller to create based features of a mutation.
no
homology models homology models for all
for all domains in domains in this protein.
this protein?
DB
elaspic_model.py
yes
Input: fasta file with target sequences, pdb
file of the template
3.
Run Modeller to create Create and mutate
Do we have
homology models of all pairs Output: Homology model + model
homology models
for all interactions
no
of domains mediating properties model objects
interactions involving this
involving this
protein?
protein. .mutate(mutation): to compute sequence-
DB
based features of a mutation.
yes
4.
Does the specified elaspic_predictor.py
Return None.
mutation fall no ELASPIC only works for
inside a domains
mutations that fall inside
Input: DataFrame of all features, with one
for which we have
a structural
domains. mutation per row (as if pulled out from the Compute ΔΔG
template? database)
yes Output: ΔΔG predictions
Run FoldX and other
5. programs and internal scripts
Have the features to calculate all the features
and ΔΔG values no required by the machine
been calculated learning classifier. Run the
for the specified
mutation(s)?
classifier to predict a value of
ΔΔG for every mutation and
Results
DB
every domain / domain pair.
yes
Success!
Return the predicted ΔΔG caused
by the mutation for all domains and
domain-domain interactions
Fig. 1 Schematic providing a general outline of ELASPIC. ELASPIC provides two different pipelines: a database
pipeline and a local pipeline. The database pipeline takes as input the UniProt ID of a protein and a mutation
and constructs homology models of the domains and domain-domain interactions involving the protein
automatically. The local pipeline takes as input the structure of a protein, or the sequence or a protein and
a structural template, and a mutation. It requires no precalculated data and can run in the absence of the
ELASPIC database. Both pipelines use the same code to perform the majority of the calculation
ELASPIC also uses many external programs, which are listed in

Table 2. All external dependencies, except for FoldX, are installed
automatically when ELASPIC is installed using the conda package
manager. Due to licensing restrictions, FoldX has to be down-
loaded and installed manually, as described in Subheading 3.
3 Methods
3.1 Installing 1. First, we should set the environment variables, which are
ELASPIC required for installing and using ELASPIC, in our ~/.bashrc
file. This way, those environment variables will be set whenever
we start a new bash shell. The required environment variables,
Table 1
External databases that were used in the construction of the ELASPIC database
Database Description URL

UniProt UniProt is a comprehensive collection of protein sequences http://www.uniprot.org/
[33] and their annotations. ELASPIC uses UniProt names to
identify all proteins. Homology models constructed by
ELASPIC use the UniProt canonical protein sequence
Mentha [34] Mentha is a database of protein-protein interactions. http://mentha.uniroma2.it/
Mentha combines data from several databases in the
IMEx consortium and converts gene names to UniProt
identifiers
Profs [28] Profs is a database of protein domain definitions, which are https://bitbucket.org/
obtained by integrating data from CATH and PFam (see afgiraldofo/profs
Note 3)
BLAST nr BLAST nr database includes a collection of unique protein ftp://ftp.ncbi.nlm.nih.gov/
database sequences from a variety of sources, including GenPept, blast/db/
[35] Swissprot, PDB, PRF, PIR, and RefSeq (see Note 4)
with reasonable default values, are shown below. The KEY_

MODELLER environment variable should contain our Mod-
eller license key. If we do not already have a Modeller license,
we should register for a license on the Modeller website:
https://salilab.org/modeller/registration.html. Modeller is
free for noncommercial use.
# Add the following lines to your ~/.bashrc file
export CONDA_DIR="${HOME}/miniconda3"
export LOCAL_BIN_DIR="${HOME}/.local/bin"
export PATH="${CONDA_DIR}/bin:${LOCAL_BIN_DIR}:${PATH}"
export KEY_MODELLER="Put our modeller license here!"
export BLAST_DB_DIR="${HOME}/blast"
export ELASPIC_DB_STRING="sqlite:///${HOME}/elaspic.db"
export ELASPIC_ARCHIVE_DIR="${HOME}/elaspic"
$ source ~/.bashrc
2. Download and extract the foldx executable into a folder that

has been added to the PATH environment variable. In order to
download FoldX, we first need to create an account and register
for an academic license at http://foldxsuite.crg.eu/academic-
license-info. After registering, we will receive an email with the
link to the FoldX download page. We can either download
FoldX using a web browser or using wget. If we use wget, we
first need to obtain a cookie that will permit us to download the
FoldX archive.
Table 2
External software that ELASPIC uses to construct homology models and to calculate some of the sequential and structural features
Software Description URL License

MODELLER [36] MODELLER is a popular tool for generating homology https://salilab.org/modeller/ Closed source. Free for academic
models of proteins. It requires, as input, the sequence of the use only
protein that we wish to model and the structure of a protein
that will be used as a template (see Notes 5 and 6)
Provean [37] Provean uses a residue conservation score to predict whether a http://provean.jcvi.org/index. GPL
mutation is likely to be deleterious. One advantage of php
Provean over similar tools is its use of a small “supporting
set” of diverse proteins to compute the conservation score.
This supporting set can be precalculated for each protein, and
evaluating a mutation takes under a second when the
supporting set is available
FoldX [38] FoldX is a tool for predicting the thermodynamic impact of http://foldxsuite.crg.eu Closed source. Free for academic
mutations on protein folding and protein-protein use only
interaction. ELASPIC uses a number of structural features
that are calculated by FoldX
MSMS [39] Maximal Speed Molecular Surface (MSMS) is a tool for http://mgltools.scripps.edu/ Closed source. Free for academic
calculating the solvent-accessible surface area of proteins. It packages/MSMS use only
also can be used to quickly calculate the surface mesh of a
protein (see Note 7)
STRIDE [40] STRIDE is a tool for predicting the secondary structure of http://webclu.bio.wzw.tum. Open source. Free for academic
amino acids in a protein (see Note 7) de/stride/ use only
Predicting the Effect of Mutations
7
$ mkdir -p "${LOCAL_BIN_DIR}" && cd "${LOCAL_BIN_DIR}"

$ wget --save-cookies cookies.txt --keep-session-cookies --delete-after
http://foldxsuite.crg.eu/node/732/download/{unique_token_from_email}
$ wget --load-cookies cookies.txt
http://foldxsuite.crg.eu/system/files/foldxLinux64.tar__0.gz
$ tar xf foldxLinux64.tar__0.gz
3. Download and install either Miniconda or the Anaconda

Python distribution, if we do not have them installed already.
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh -b -p "${CONDA_DIR}"
4. Add the conda channels required for installing ELASPIC to

our conda settings.
$ conda config --add channels conda-forge
$ conda config --append channels bioconda
$ conda config --append channels salilab
$ conda config --append channels kimlab
5. Install ELASPIC, including all its dependencies, into a new

conda environment. Activate the new environment, and check
that elaspic is available by running elaspic --help.
$ conda create -n elaspic ’elaspic=0.1’ parallel ipython p7zip
$ source activate elaspic
$ elaspic --help
usage: elaspic [-h] {run,database,train} . . .
optional arguments:
-h, --help show this help message and exit
command:
{run,database,train}
run Run ELASPIC
database Perform database maintenance tasks
train Train the ELASPIC classifiers
6. Download the BLAST nonredundant database, check down-

loaded files for consistency, and uncompress files into our
BLAST_DB_DIR folder.
$ mkdir -p "${BLAST_DB_DIR}" && cd "${BLAST_DB_DIR}"
$ wget ’ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr*’
$ md5sum -c *.md5
$ for file in *.gz ; do echo "Uncompressing ${file}. . ." ; tar xf ${file} ; done
7. (Optional) Create an ELASPIC database, which will contain

information about protein domains and domain interactions,
structural templates, homology models, and mutation results,
and an ELASPIC archive folder, which will store the Provean
supporting sets and homology model files. Keeping previously
calculated supporting sets and homology models drastically
speeds up the evaluation of all subsequent mutations in the
same protein.
$ elaspic database --connection-string ${ELASPIC_DB_STRING} create
$ mkdir -p "${ELASPIC_ARCHIVE_DIR}"
8. (Optional) Load precalculated data into the ELASPIC

database.
(Option A) If we do not wish to download precalculated Pro-
vean supporting sets and homology models, for example, if we
are planning to run ELASPIC for only a handful of human
proteins, we should load only the basic dataset.
$ elaspic database --connection-string ${ELASPIC_DB_STRING} load-basic \
http://elaspic.kimlab.org/static/download/latest/homo_sapiens/
(Option B) If we would like to use the Provean supporting sets

and homology models that have been calculated previously for
the human proteome, we should load the complete dataset into
our ELASPIC database, and we should also download and
extract a file containing the supporting sets and homology
models into our ELASPIC archive folder.
$ elaspic database --connection-string ${ELASPIC_DB_STRING} load-complete \
http://elaspic.kimlab.org/static/download/latest/homo_sapiens/
$ cd ${ELASPIC_ARCHIVE_DIR}
$ wget http://elaspic.kimlab.org/static/download/latest/homo_sapiens/archive.7z
$ 7z x archive.7z
Steps 7 and 8 above are required if we want to use ELASPIC to

evaluate the effect of mutations on homology models constructed
for proteins specified using a UniProt ID (i.e., database pipeline in
Fig. 1). Those steps can be skipped if we will only use ELASPIC to
evaluate the effect of mutations on provided protein structures
(local pipeline in Fig. 1).
ELASPIC can access data stored in a SQLite, MySQL, or
PostgreSQL database. SQLite is recommended if ELASPIC will
only be used on a single machine. MySQL or PostgreSQL is
recommended if ELASPIC will be used on a cluster of machines,
with a single database acting as a centralized store of information.
In the steps outlined above, the database can be changed from
SQLite to MySQL or PostgreSQL by changing the ELAS-
PIC_DB_STRING environment variable from sqlite:///
${HOME}/elaspic.db (which stores the data in a single SQLite
database file called elaspic.db, located in our home directory) to

{mysql|postgresql}://{username}:{password}@{database_ip}:
{database_port}/elaspic, where the words inside curly brackets are
replaced by database-specific values.
Protein domain definitions and structural template data are
available for a number of organisms from the ELASPIC downloads
page: http://elaspic.kimlab.org/static/download/latest/. Note,
however, that protein-protein interaction data and precalculated
Provean supporting sets and homology models are only available
for the human proteome. In the steps outlined above, we can
change the organism for which the data is downloaded by changing
all occurrences of the string homo_sapiens with the string for the
desired organism. For more information on how this data was
calculated, see the supplemental material in the ELASPIC web
server paper [28].
3.2 Running ELASPIC The first use case for ELASPIC is to predict the thermodynamic
effect of mutations on a protein or a protein-protein interaction for
3.2.1 Evaluating
which a crystal structure is available (local pipeline in Fig. 1). In this
the Effect of Mutations
case, the crystal structure of the protein can be provided to ELAS-
on a Single Protein (Local
PIC directly, and no homology model needs to be created. In the
Pipeline)
following example, we will show how to use ELASPIC to perform
alanine scanning of the dimerization interface of glutathione S-
transferase.
1. Make sure that the environment variables that we set in step 1
of the ELASPIC installation are available.
$ [[ -z ${BLAST_DB_DIR} ]] && source ~/.bashrc
2. Download a structure of glutathione S-transferase epsilon,

which we will be using for this example.
$ wget https://files.rcsb.org/download/3zml.pdb
3. Run ELASPIC to calculate Provean supporting sets and opti-

mized structural models for the protein of interest. We can see a
description of every argument accepted by the elaspic run
command by running elaspic run --help.
$ elaspic run -p 3zml.pdb -t sequence.model
4. Run ELASPIC to evaluate the structural impact of each

individual mutation. We should specify mutations using the
{chain_id}_{residue_wt}{resnum}{residue_mut} format and
separate different mutations using a comma or a colon.
$ elaspic run -p 3zml.pdb -m

A_M47A:A_P51A:A_Q52A:A_H53A:A_T63A:A_I65A:A_T66A:A_E67A:A_H69A:A_I73A:A_Y74A:
A_T77A:A_Y86A:A_P90A:A_V91A:A_Q93A:A_N97A:A_L100A:A_H101A:A_F102A:A_S104A:A_G105A:
A_R110A:A_R112A:A_F113A:A_E116A:A_R117A:A_Y121A:A_D129A:A_R130A:A_Y133A:A_K136A:
A_L140A:A_D143A:A_T144A -n 1 -vvv
Alternatively, we can use GNU parallel to process multiple

mutations in parallel.
$ parallel --res mutation_logs --joblog mutation_logs.txt \

elaspic run -p 3zml.pdb -m {1} -t mutation -n 1 -vvv ::: A_M47A A_P51A A_Q52A A_H53A
A_T63A A_I65A A_T66A A_E67A A_H69A A_I73A A_Y74A A_T77A A_Y86A A_P90A A_V91A A_Q93A
A_N97A A_L100A A_H101A A_F102A A_S104A A_G105A A_R110A A_R112A A_F113A A_E116A A_R117A
A_Y121A A_D129A A_R130A A_Y133A A_K136A A_L140A A_D143A A_T144A
5. Once the above commands have finished running, we can read

ELASPIC results from the .elaspic/sequence.json, .elaspic/
model.json, and .elaspic/mutation_{mutation}.json files using
the pandas.load_json command in Python (here {mutation}
should be replaced with the mutation of interest). All other
files generated by ELASPIC, including Provean supporting sets
and structures containing each mutation, are also stored inside
the .elaspic folder. The idxs column contains the indexes of the
chains that make up the interface for which the ΔΔG was
calculated; it is NaN if the ΔΔG is calculated for protein fold-
ing. Note that ELASPIC renumbers the PDB resnum field to
start from 1, and therefore the mutations listed in the JSON
files have a different residue number than what was provided as
input.
$ ipython>>>
import pandas as pd
>>> mutations = [’A_M47A’, ’A_P51A’, ’A_Q52A’, ’A_H53A’, ’A_T63A’, ’A_I65A’,
’A_T66A’, ’A_E67A’, ’A_H69A’, ’A_I73A’, ’A_Y74A’, ’A_T77A’, ’A_Y86A’, ’A_P90A’,
’A_V91A’, ’A_Q93A’, ’A_N97A’, ’A_L100A’, ’A_H101A’, ’A_F102A’, ’A_S104A’,
’A_G105A’, ’A_R110A’, ’A_R112A’, ’A_F113A’, ’A_E116A’, ’A_R117A’, ’A_Y121A’,
’A_D129A’, ’A_R130A’, ’A_Y133A’, ’A_K136A’, ’A_L140A’, ’A_D143A’, ’A_T144A’]
>>> results = []
>>> for mutation in mutations:
results.append(
pd.read_json(’.elaspic/mutation_{}.json’.format(mutation)))
>>> mutation_df = pd.concat(results)
>>> mutation_df[[’chain_modeller’, ’mutation’, ’ddg’, ’idxs’]].head()
chain_modeller mutation ddg idxs
0 A M44A 0.862108 NaN
1 A M44A 1.475214 [0, 1]
0 A P48A 0.936352 NaN
1 A P48A 2.066441 [0, 1]
0 A Q49A 0.353452 NaN
6. We can compare the results that we obtained by running

ELASPIC locally with the results that are obtained using the
ELASPIC web server by going to http://elaspic.kimlab.org/
result/3zml00/.
3.2.2 Evaluating A second use case for ELASPIC is to evaluate the effect of muta-
the Effect of Mutations tions in a large number of proteins and protein-protein interactions
Proteome Wide (Database for which a crystal structure may not be available (database pipeline
Pipeline) in Fig. 1). In the following example, we will show how to use
ELASPIC to predict the effect of missense mutations found in the
OncoKB database [29] on protein stability and protein-protein
interaction affinity. OncoKB is a database of mutations in known
cancer genes with well-established clinical ramifications.
1. Make sure that the environment variables that we set in step 1
of the ELASPIC installation are available.
$ [[ -z ${BLAST_DB_DIR} || -z ${ELASPIC_DB_STRING} || -z
${ELASPIC_ARCHIVE_DIR} ]] && source ~/.bashrc
2. Download oncokb.tsv file from the ELASPIC downloads page.

This file is derived from the allAnnotatedVariants.txt file
obtained from OncoKB [29]. We processed this file to convert
HGNC gene identifiers to UniProt accession numbers and to
exclude non-missense mutations.
$ wget http://elaspic.kimlab.org/static/download/protocol/oncokb.tsv
3. Process oncokb.tsv to create a file containing only unique

UniProt protein identifiers (uniprot_ids.txt) and a file contain-
ing only UniProt identifiers and mutations (uniprot_ids_and_-
mutations.txt).
$ tail -n +2 oncokb.tsv | awk ’{print $2}’ | sort -u > uniprot_ids.txt
$ tail -n +2 oncokb.tsv | awk ’{print $2 "\t" $3}’ | sort -u >
uniprot_ids_and_mutations.txt
4. Run ELASPIC to calculate Provean supporting sets and

homology models for each mutated protein. We can see a
description of every argument accepted by the elaspic run
command by running elaspic run --help. Note that calculating
Provean supporting sets and homology models is not necessary
if we downloaded precalculated data for homo_sapiens from
the ELASPIC downloads page (see step 8 in Installing
ELASPIC).
$ while read uniprot_id ; do
echo $uniport_id
elaspic run -u $uniprot_id -t sequence.model -vvv
done < uniprot_ids.txt

proteins in parallel.
$ parallel --res log_sequence_model --joblog log_sequence_model.txt \
elaspic run -u {1} -t sequence.model -vvv :::: uniprot_ids.txt
5. Run ELASPIC to evaluate each individual mutation.

$ while read uniprot_id mutation ; do
echo $uniport_id $mutation
elaspic run -u $uniprot_id -t sequence.model -vvv
done < uniprot_ids_and_mutations.txt

mutations in parallel.
$ parallel --colsep ’\t’ --res log_mutation --joblog log_mutation.txt \
elaspic run -u {1} -m {2} -t mutation -vvv :::: uniprot_ids_and_mutations.
txt
6. The results are stored in the uniprot_domain_mutation and

uniprot_domain_pair_mutation tables inside the ELASPIC
database, which is specified with the ELASPIC_DB_STRING
environment variable. We can extract all calculated mutations
using the following script.
$ ipython
>>> import os
>>> import pandas as pd
>>> import sqlalchemy as sa
>>> engine = sa.create_engine(os.environ[’ELASPIC_CONNECTION_STRING’])
>>> # Show all core mutations
>>> core_df = pd.read_sql_query("""\
SELECT uniprot_id, mutation, ddg core_ddg
FROM uniprot_domain_mutation
""", engine)
>>> core_df.head()
uniprot_id mutation core_ddg
0 P22681 C396R 0.282069
1 P22681 D390Y -0.848413
2 P22681 H398Q 1.543390
3 P22681 H398Y 2.162050
4 P22681 K382E -0.856646
>>> # Show all interface mutations
>>> interface_df = pd.read_sql_query("""\
SELECT uniprot_id, mutation, CASE WHEN uniprot_id = uniprot_id_1 THEN
uniprot_id_2 ELSE uniprot_id_1 END partner_uniprot_id, ddg interface_ddg

FROM uniprot_domain_pair udp
JOIN uniprot_domain_pair_mutation udpm USING (uniprot_domain_pair_id);
""", engine)
>>> interface_df.head()
uniprot_id mutation partner_uniprot_id interface_ddg
0 P22681 H398Q P60604 1.718820
1 P22681 H398Y P60604 0.992272
2 P22681 K382E P62253 0.496983
3 P22681 K382E P60604 0.169106
4 P22681 K382E P51668 1.109610
7. We can compare the results that we obtained by running

ELASPIC locally with the results that are obtained using the
ELASPIC web server by going to http://elaspic.kimlab.org/
result/oncokb/.
4 Notes
1. It is likely that ELASPIC would work on Windows 10 subsys-

tem for Linux (which is based on Ubuntu 14.04), although this
has not been tested.
2. ELASPIC should work on MacOS if the external dependencies,
such as MSMS and Stride, are recompiled on the MacOS
platform.
3. Profs domain definitions are no longer being updated and may
be replaced with Gene3D domain definitions in a future release
of ELASPIC.
4. A Provean supporting set includes a FASTA file with the
sequence of every protein in that supporting set. However,
due to a bug in Provean (up to, and including, version 1.1.5),
even if a previously calculated supporting set is available, Pro-
vean still requires the BLAST nr database to obtain the
sequences of the proteins in that supporting set. If this bug
were fixed, downloading the BLAST nr database would no
longer be required when the Provean supporting sets are avail-
able for all proteins that are mutated.
5. If MODELLER finds a gap in the alignment of the protein
sequence to the structural template, it will fill this gap by
constructing a loop. However, the loop modeling capabilities
of MODELLER are limited. If more plausible loops are
desired, Rosetta loop modeling [30] may be used instead.
6. I-TASSER can be used instead of Modeller to construct

homology models. According to recent CASP competi-
tions [31], I-TASSER is able to construct homology models
with the best accuracy relative to the reference crystal
structures.
7. MSMS and Stride can be replaced with MDTraj [32], which is
distributed under a more permissive LGPL license, and also can
calculate secondary structure and solvent-accessible
surface area.
8. ELASPIC has a comprehensive test suite, which is run using
GitLab continuous integration for every commit that is pushed
to the repository (see https://gitlab.com/kimlab/elaspic/
pipelines). The integration tests test-standalone and test-
database are similar to the protocols described above (although
they are run on a test database that is much smaller than the full
ELASPIC database).
Acknowledgments
Funding: P.M.K. acknowledges support from a NSERC Discovery

Grant (RGPIN-2017-064).
References
1. Rockah-Shmuel L, Tóth-Petróczy Á, Tawfik investigate the energetics of biomolecular rec-
DS (2015) Systematic mapping of protein ognition. J Mol Recognit 12:3–18
mutational space by prolonged drift reveals 6. Sahni N, Yi S, Taipale M et al (2015) Wide-
the deleterious effects of seemingly neutral spread macromolecular interaction perturba-
mutations. PLoS Comput Biol 11:e1004421 tions in human genetic disorders. Cell
2. Huber CD, Kim BY, Marsden CD, Lohmueller 161:647–660
KE (2017) Determining the factors driving 7. Sun MGF, Seo M-H, Nim S et al (2016) Pro-
selective effects of new nonsynonymous muta- tein engineering by highly parallel screening of
tions. Proc Natl Acad Sci U S A computationally designed variants. Sci Adv 2:
114:4465–4470 e1600692
3. Brender JR, Zhang Y (2015) Predicting the 8. Weile J, Sun S, Cote AG, et al (2017) Expand-
effect of mutations on protein-protein binding ing the atlas of functional missense variation for
interactions through structure-based interface human genes. BioRxiv 166595
profiles. PLoS Comput Biol 11:e1004494 9. Ng PC, Henikoff S (2003) SIFT: predicting
4. Albanaz ATS, Rodrigues CHM, Pires DEV, amino acid changes that affect protein func-
Ascher DB (2017) Combating mutations tion. Nucleic Acids Res 31:3812–3814
in genetic disease and drug resistance: under- 10. Adzhubei I, Jordan DM, Sunyaev SR (2013)
standing molecular mechanisms to guide Predicting functional effect of human missense
drug design. Expert Opin Drug Discov mutations using PolyPhen-2. Curr Protoc
12:553–563 Hum Genet Chapter 7: Unit 7.20
5. Jelesarov I, Bosshard HR (1999) Isothermal 11. Li B, Krishnan VG, Mort ME et al (2009)
titration calorimetry and differential scanning Automated inference of molecular mechanisms
calorimetry as complementary tools to
of disease from amino acid substitutions. Bio- 24. Li M, Simonetti FL, Goncearenco A, Pan-
informatics 25:2744–2750 chenko AR (2016) MutaBind estimates and
12. Kircher M, Witten DM, Jain P et al (2014) A interprets the effects of sequence variants on
general framework for estimating the relative protein-protein interactions. Nucleic Acids
pathogenicity of human genetic variants. Nat Res 44:W494–W501
Genet 46:310–315 25. Kumar MDS, Bava KA, Gromiha MM et al
13. Shihab HA, Gough J, Mort M et al (2014) (2006) ProTherm and ProNIT: thermody-
Ranking non-synonymous single nucleotide namic databases for proteins and protein–nu-
polymorphisms based on disease concepts. cleic acid interactions. Nucleic Acids Res 34:
Hum Genomics 8:11 D204–D206
14. Choi Y, Sims GE, Murphy S et al (2012) Pre- 26. Moal IH, Fernández-Recio J (2012) SKEMPI:
dicting the functional effect of amino acid sub- a structural kinetic and energetic database of
stitutions and indels. PLoS One 7:e46688 mutant protein interactions and its use in
15. Dorfman R, Nalpathamkalam T, Taylor C et al empirical models. Bioinformatics
(2010) Do common in silico tools predict the 28:2600–2607
clinical consequences of amino-acid substitu- 27. Rose PW, Prlić A, Altunkaya A et al (2017) The
tions in the CFTR gene? Clin Genet RCSB protein data bank: integrative view of
77:464–473 protein, gene and 3D structural information.
16. Shirts M, Mobley D (2013) An introduction to Nucleic Acids Res 45:D271–D281
best practices in free energy calculations. In: 28. Witvliet DK, Strokach A, Giraldo-Forero AF
Monticelli L, Salonen E (eds) Biomolecular et al (2016) ELASPIC web-server: proteome-
simulations, Methods in molecular biology. wide structure-based prediction of mutation
Humana Press, Totowa, NJ, pp 271–311 effects on protein stability and binding affinity.
17. Benedix A, Becker CM, de Groot BL et al Bioinformatics 32:1589–1591
(2009) Predicting free energy changes using 29. Chakravarty D, Gao J, Phillips SM et al (2017)
structural ensembles. Nat Methods 6:3–4 OncoKB: a precision oncology knowledge
18. Pires DEV, Ascher DB, Blundell TL (2014) base. JCO Precis Oncol 2017. https://doi.
mCSM: predicting the effects of mutations in org/10.1200/PO.17.00011
proteins using graph-based signatures. Bioin- 30. Das R, Baker D (2008) Macromolecular mod-
formatics 30:335–342 eling with rosetta. Annu Rev Biochem
19. Laimer J, Hofer H, Fritz M et al (2015) MAE- 77:363–382
STRO - multi agent stability prediction 31. Moult J, Fidelis K, Kryshtafovych A et al
upon point mutations. BMC Bioinformatics (2014) Critical assessment of methods of pro-
16:116 tein structure prediction (CASP)--round
20. Petukh M, Li M, Alexov E (2015) Predicting x. Proteins 82(Suppl 2):1–6
binding free energy change caused by point 32. McGibbon RT, Beauchamp KA, Harrigan MP
mutations with knowledge-modified et al (2015) MDTraj: a modern open library for
MM/PBSA method. PLoS Comput Biol 11: the analysis of molecular dynamics trajectories.
e1004276 Biophys J 109:1528–1532
21. Dehouck Y, Grosfils A, Folch B et al (2009) 33. Consortium TU (2015) UniProt: a hub for
Fast and accurate predictions of protein stabil- protein information. Nucleic Acids Res 43:
ity changes upon mutations using statistical D204–D212
potentials and neural networks: PoPMuSiC- 34. Calderone A, Castagnoli L, Cesareni G (2013)
2.0. Bioinformatics 25:2537–2543 mentha: a resource for browsing integrated
22. Baugh EH, Simmons-Edler R, Müller CL et al protein-interaction networks. Nat Methods
(2016) Robust classification of protein varia- 10:690–691
tion using structural modelling and large-scale 35. McGinnis S, Madden TL (2004) BLAST: at
data integration. Nucleic Acids Res the core of a powerful and diverse set of
44:2501–2513 sequence analysis tools. Nucleic Acids Res 32:
23. Berliner N, Teyra J, Çolak R et al (2014) Com- W20–W25
bining structural modeling with ensemble 36. Webb B, Sali A (2016) Comparative protein
machine learning to accurately predict protein structure modeling using MODELLER. Curr
fold stability and binding affinity effects upon Protoc Bioinformatics 54:5.6.1–5.6.37
mutation. PLoS One 9:e107353
37. Choi Y (2012) A fast computation of pairwise 39. Sanner MF, Olson AJ, Spehner J (1996)
sequence alignment scores between a protein Reduced surface: an efficient way to compute
and a set of single-locus variants of another molecular surfaces. Biopolymers 38:305–320
protein. In: Proceedings of the ACM Confer- 40. Heinig M, Frishman D (2004) STRIDE: a web
ence on Bioinformatics, Computational Biol- server for secondary structure assignment from
ogy and Biomedicine - BCB ’12. ACM, known atomic coordinates of proteins. Nucleic
New York, NY. Acids Res 32:W500–W502
38. Schymkowitz J, Borg J, Stricher F et al (2005)
The FoldX web server: an online force field.
Nucleic Acids Res 33:W382–W388
Chapter 2
Accurate Calculation of Free Energy Changes upon Amino

Acid Mutation
Matteo Aldeghi, Bert L. de Groot, and Vytautas Gapsys
Abstract
Molecular dynamics based free energy calculations allow for a robust and accurate evaluation of free energy
changes upon amino acid mutation in proteins. In this chapter we cover the basic theoretical concepts
important for the use of calculations utilizing the non-equilibrium alchemical switching methodology. We
further provide a detailed step-by-step protocol for estimating the effect of a single amino acid mutation on
protein thermostability. In addition, the potential caveats and solutions to some frequently encountered
issues concerning the non-equilibrium alchemical free energy calculations are discussed. The protocol
comprises details for the hybrid structure/topology generation required for alchemical transitions, equilib-
rium simulation setup, and description of the fast non-equilibrium switching. Subsequently, the analysis of
the obtained results is described. The steps in the protocol are complemented with an illustrative practical
application: a destabilizing mutation in the Trp cage mini protein. The concepts that are described are
generally applicable. The shown example makes use of the pmx software package for the free energy
calculations using Gromacs as a molecular dynamics engine. Finally, we discuss how the current protocol
can readily be adapted to carry out charge-changing or multiple mutations at once, as well as large-scale
mutational scans.
Key words Molecular dynamics, free energy calculations, alchemistry, amino acid mutation, pmx,
hybrid structure, hybrid topology, non-equilibrium transitions
1 Introduction
Due to the central role of the free energy in thermodynamics and

kinetics, the accurate prediction of free energy changes upon amino
acid mutation is one of the central goals in computer-aided molec-
ular design, with potential applications ranging from the engineer-
ing of thermostable proteins [1] to that of biosensors [2, 3],
sequestrants [4], and protein–protein interactions [5–7]. Predicting
mutation effects allows understanding the causes of drug resistance
Electronic supplementary material: The online version of this chapter (https://doi.org/10.1007/978-1-4939-

8736-8_2) contains supplementary material, which is available to authorized users.
19
20 Matteo Aldeghi et al.
[8, 9]. Engineered stable proteins with high affinity and specificity
toward their binding targets may also serve as biopharmaceuticals
[10, 11]. Accurate and robust estimation of the free energy differ-
ences between protein sequence variants, thus, is crucial to the
successful design of proteins with the desired thermodynamic
features.
Different approaches have thus been developed that can return
an estimate of free energy changes that relate to the different
stabilities or binding affinities of wild-type and mutant proteins.
These include fast scoring methods [12–16], implicit-solvent
approaches based on the post-processing of molecular dynamics
(MD) simulations [17–19], and the computationally more expen-
sive but theoretically rigorous (from a statistical mechanics view-
point) alchemical free energy methods [1, 20]. In this chapter, we
focus on the latter category of calculations, which are based on
all-atom computer simulations that correctly sample the Boltz-
mann distribution of microstates and inherently take into account
entropic and discrete solvent effects.
In alchemical free energy calculations, an amino acid can be
transformed into another one via a non-physical path, hence the
name that is reminiscent of the ancient practice that aimed at the
transmutation of lead into gold. The amino acid transformation can
be carried out reversibly, in what are referred to as equilibrium free
energy calculations, or irreversibly, in non-equilibrium calculations
[21]. In both cases, the amount of work needed for the transfor-
mation and free energy difference between the initial and final states
can be recovered. However, the setup of the calculations differs. In
this chapter, we discuss non-equilibrium approaches that carry out
this transformation irreversibly and describe protocols that can be
used for the accurate estimation of free energy changes upon amino
acid mutation. In the text, we use the prediction of protein stability
changes upon an amino acid mutation as an example application.
The methodology and protocol presented here are of generic char-
acter and can be applied to study other biophysical processes,
assuming a suitable thermodynamic cycle can be built, e.g., changes
in protein–protein, protein–DNA, or protein–ligand binding
affinities.
In this chapter, we first provide some background concepts that
are at the foundation of the non-equilibrium alchemical free energy
method; for a more detailed description we give references to more
specialized literature sources. Further, we concentrate on the
description of the practical steps involved in preparing and subse-
quently carrying out the free energy calculations following a gen-
eral protocol. As an example, we use a Trp cage mini protein [22]
that provides a real case on which we illustrate setting up and
running alchemical free energy calculations of protein mutation.
We assume the reader is familiar with the general principles of
molecular dynamics simulations. Throughout this chapter, we
Accurate Calculation of Free Energy Changes upon Amino Acid Mutation 21
discuss the potential caveats and solutions for some of the fre-
quently encountered issues. In the last section of the chapter, we
describe how the protocol can be easily modified and expanded to
perform large-scale mutational scans or to calculate other free
energy changes of interest, such as changes in protein–protein or
protein–ligand affinities upon protein mutation. Finally, in the
Notes section, we provide a few technical remarks that may prove
helpful when setting up alchemical free energy calculations using
Gromacs 2016 [23] and the pmx python library with the specialized
set of scripts [24].
2 Theory
In this section, we briefly review some of the central concepts that

allow the estimation of free energy differences from physics-based
computer simulations, like Monte Carlo or molecular dynamics
(MD) simulations. We place particular focus on the theoretical
foundations of non-equilibrium work (NEW) calculations and
how they can be used to estimate free energy differences along
alchemical (i.e., non-physical) paths. The interested reader can
find a broader appraisal of theoretical aspects, also including equi-
librium free energy calculations and geometrical transformations, in
the numerous excellent reviews that have been written on the
subject, [21, 25–29] as well as in the publications by Jarzinski
[30, 31], Crooks [32–34], and Hummer [35–37].
2.1 Definition of Free The free energy surface of a system determines its thermodynamic
Energy and and kinetic properties and, as such, it provides access to under-
Irreversible Work standing biophysical processes, including protein folding, ligand
binding, protein–protein association, etc. For instance, a polypep-
tide chain in solution may be found in many disordered conforma-
tions, or in ordered conformations with well-defined secondary and
tertiary structure. We can define the set of disordered conforma-
tions as the unfolded state of the system (state A), and the set of
ordered conformations as the folded state (state B). It is rarely
possible to sample the whole phase space of a protein, which
would require observing all the folded and unfolded conformations
multiple times. However, in practice free energy differences rather
than free energies are typically of interest. The difference between
the free energy of state A and B alone will give the relative equilib-
rium probability of finding the protein in its unfolded form with
respect to the folded form; i.e., the free energy difference ΔG is
proportional to the ratio of probabilities of finding the system in
state A or B:
pA e βG A
¼ ¼ e βðG A G B Þ ð1Þ
pB e βG B
pA
ΔG ¼ G A G B ¼ kB T ln ð2Þ
pB
where G is the free energy of the whole phase space of the system for
an ensemble with a fixed number of particles, constant pressure and
temperature (T), i.e., isothermal–isobaric conditions. GA is the free
energy of the unfolded state, GB is the free energy of the folded
state, and β ¼ 1/kBT, with kB is the Boltzmann constant with
T denoting the absolute temperature.
This free energy difference also determines the maximum
amount of work that can be extracted from the closed system
during a thermodynamic process, which can only be achieved in
the limit of reversibility. During a reversible process, the system is
always in thermodynamic equilibrium, which implies that only
infinitesimal changes are applied to it and the transformation is
infinitely slow. However, for any finite time interval τ, the system
will be driven out of equilibrium, resulting in heat dissipation and
hysteresis effects, so that the process will be irreversible. In fact, in
accordance to the second law of thermodynamics, the work done
during a process is on average equal or larger, due to dissipative
work, than the free energy difference between the initial and final
state:
hW ðτÞi ΔG ð3Þ
The equality holds only in the limiting case of a reversible

process where (τ ! 1), whereas for finite τ, the difference between
hW(τ)i and ΔG is caused by dissipative work and its magnitude will
also depend on the chosen thermodynamic path.
If we use a parameter λ to drive a non-equilibrium process
along a certain path, such that the process is started at λ ¼ 0 and
it is concluded at λ ¼ 1, with λ being constantly modified at each
time step, one can calculate the work performed on the system by
integrating the energetic cost required to modify it:
λ¼1
ð
∂Hðx, v, λÞ
W ðτÞ ¼ dλ ð4Þ
∂λ
λ¼0
where H is the Hamiltonian of the system, which depends on the

phase space coordinates x and velocities v of the system and the
coupling parameter λ.
2.2 Estimating Free From the considerations above, it is possible to derive estimators
Energy Differences that allow calculating free energy differences from equilibrium and
from Non-equilibrium non-equilibrium simulations. Both, the Zwanzig’s formula [38],
Simulations which lies at the basis of free energy perturbation (FEP)
approaches, and thermodynamic integration (TI) [39] make use
of ensemble averages obtained from equilibrium simulations for the
estimation of free energy differences. More recently, Jarzynski has

shown how one can derive an identity from the inequality in Eq. 3,
such that a free energy difference can also be obtained from an
ensemble of non-equilibrium simulations in which the system is
driven irreversibly from one state to another [30, 31]. In fact, it is
possible to show that both FEP and TI are limit cases of Jarzynski’s
equality, in which the non-equilibrium transformation is performed
instantaneously (infinitely fast: τ ! 0) or reversibly (infinitely
slowly: τ ! 1), respectively [21]. The Crooks Fluctuation Theo-
rem (CFT) [32–34] has further generalized the Jarzynski’s equality
by relating the equilibrium free energy difference to the ratio of
non-equilibrium work distributions collected by performing the
process in the forward and reverse directions. In the following, we
focus on non-equilibrium work (NEW) approaches. More specifi-
cally, we review the free energy estimators based on the Jarzynski’s
equality and Crooks Fluctuation Theorem (Crooks Gaussian Inter-
section and Bennet’s Acceptance Ratio).
2.2.1 Jarzynski’s The equality derived by Jarzynski in 1997 [30, 40] relates the
Equality uni-directional non-equilibrium work average to the equilibrium
free energy difference:
he βW ðτÞ i ¼ e βΔG ð5Þ
The work W depends on the chosen path connecting the initial

(λ ¼ 0) and final (λ ¼ 1) states. The parameter λ controls the time
evolution of a system with a time-dependent Hamiltonian. The
average on the left-hand side of the equation is an ensemble over
both equilibrium initial conditions and non-equilibrium transfor-
mations. In fact, the above equality requires the non-equilibrium
transitions to be started from an equilibrium ensemble; on the
other hand, there is no such requirement for the final state of the
system at the end of the transition [21, 30]. The non-equilibrium
trajectories are then weighted with the Boltzmann factor of the
external work done on the system. The work W can be calculated
from Eq. 4 by numerical integration; note how instantaneous,
rather than ensemble average (as done in TI), ∂H=∂λ values are
evaluated. In the limit of an infinitely fast (τ ! 0) or slow (τ ! 1)
transformation, Eq. 5 reduces to the Zwanzig equation and TI,
respectively [21]. In fact, if the system is brought from λ ¼ 0 to λ ¼
1 instantaneously, its configurations at both end states are the same
and W simply corresponds to the change in Hamiltonian (which,
for transformations that conserve the kinetic energy of the system,
corresponds to the change in potential energy). On the other hand,
for an infinitely slow transformation, the system is always in equi-
librium so that hW i ¼ ΔG.
From Eq. 5 one can directly estimate the free energy difference
as follows, with N being the number of non-equilibrium trajec-
tories sampled:
" #
1 X
N
d ¼ kB T ln βW
ΔG e i
ð6Þ
N i
However, in practice, this exponential estimator is affected by

statistical and systematic errors. In fact, due to the exponential
weight, the average will mostly depend on values at the tail of the
work distribution. This means that rare events where little work is
dissipated will dominate the estimate; consequently, the free energy
will converge slowly to the true value given that rare events are most
likely poorly sampled. Furthermore, it has been shown that this
estimator is biased [41, 42], i.e., it introduces a systematic error in
the free energy estimate for finite numbers of N.
2.2.2 Crooks Fluctuation Jarzynski’s equality considers the transitions in one direction only,
Theorem e.g., from λ ¼ 0 to λ ¼ 1. The Crooks Fluctuation Theorem (CFT)
takes into account the work values obtained from performing the
process in both forward (λ: 0 ! 1) and reverse (λ: 1 ! 0) direc-
tions. According to the CFT, the forward and reverse work distri-
butions relate to the free energy difference as follows:
P f ðW Þ
¼ e βðW ΔGÞ ð7Þ
P r ðW Þ
where Pf (W) and Pr (W) are the normalized probability distri-
butions of work values obtained from the forward and reverse
transformation paths. Note that Jarzynski’s equality can be derived
from Eq. 7 by integration over W [21]. With enough overlap
between the forward and reverse work distributions, the free energy
difference can be estimated directly from Eq. 7 as follows:
d ¼ W þ kB T ln P f ðW Þ
ΔG ð8Þ
P r ðW Þ
d ¼ W at the intersection of the work distributions. How-
with ΔG
ever, this approach has known limitations: firstly, for certain paths it
might be difficult to obtain substantial overlap between Pf (W) and
Pr (W). Secondly, mainly the tails of the distributions, which are
defined by rare events of low work dissipation, will contribute to
the free energy difference.
To partly alleviate these problems, one can approximate the
work distributions with an analytical function [43]. One such strat-
egy, which leads to accurate free energy estimates, was proposed by
Goette and Grubmüller [29]. By using a Gaussian approximation, a
Crooks Gaussian Intersection (CGI) estimator was derived:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
hW f i hW r i
σ21σ2 hW f i þ hW f i þ 2 σ12 σ12 ln σσfr
2
σ 2f
σ 2r f r f r
d¼
ΔG ð9Þ
1
σ 2f
1
σ 2r
where σ f and σ r are the variances of the forward and reverse work
distributions. Note that the accuracy of this estimator relies on the
Gaussian approximation. Thus, it is advised to check this assumption
by, for instance, using a statistical test like the Kolmogorov–Smirnov
test [44]. The CGI estimator does not have an analytical error
estimate, but the error can be estimated by the bootstrap
approach [45].
Another ΔG estimator, termed BAR (Bennet’s Acceptance
Ratio), does not require an analytical approximation for the work
distributions. Originally, the BAR relation was derived in 1976 by
Bennet for a system sampling two states at equilibrium and
performing instantaneous transformations between the states. Ben-
net showed that the information from the forward and reverse
distributions of the potential energy difference ( ΔU ) could be
combined in order to obtain an optimal estimate of the free energy
difference [46]. For a non-equilibrium process carried out during a
finite amount of time, the same derivation holds by substituting
ΔU with the non-equilibrium work W. In 2003, Shirts and cow-
orkers showed how the same estimator can be derived starting from
the Crooks Fluctuation Theorem using maximum-likelihood argu-
ments [47]. The BAR estimates the free energy difference by
satisfying the following relation:
Nf
X X
Nr
1 1
¼ ð10Þ
1 þ N fr e βðW i Δc c
N GÞ N r βðW j Δ GÞ
i¼1 j ¼1 1þN f
e
where Nf and Nr are the number of forward and reverse trajec-

tories. The BAR equation needs to be solved numerically, for
instance, by using a Newton–Raphson or Nelder–Mead solver
[48]. This estimator is asymptotically unbiased and an analytical
expression for its variance is available [47]. Furthermore, a conver-
gence criterion for BAR has been proposed [49].
The main assumption in the BAR derivation is that the work
values are statistically independent [46, 47]. It is thus important to
bear this in mind, because if initial configurations are selected from
an equilibrium simulation with high frequency, the resulting work
values may be correlated [21].
2.3 Free Energy To calculate a free energy difference, firstly we need to define the
Differences Upon initial and final states of interest, and secondly the path connecting
Protein Mutation: The them. If we consider the folding example already used, then the
Alchemical Path initial state would be the unfolded protein and the final state would
be the folded protein, with the free energy difference we want to
calculate being the protein folding free energy. If the structure of
Fig. 1 Schematic representation of a thermodynamic cycle to calculate changes

in protein folding free energy upon mutation ( ΔΔG Mutation Folding ). The left column
shows the folding process of a wild-type protein, with the associated folding free
energy ΔG WT Folding ; the right column shows the same folding reaction but for a
mutated protein, resulting in the folding free energy ΔG Mut Folding . The process
depicted in the bottom row corresponds to the alchemical transformation of the
wild-type unfolded protein into the mutant with the associated free energy
difference ΔG Mutation
Unfolded . The reaction in the top row corresponds to the same
alchemical transformation but done on the folded protein, so that the free energy
difference between the two mutants is ΔG Mutation
Folded . The free energy differences for
the vertical processes are computationally demanding to compute, but those for
the horizontal transformations are more accessible. Thus, ΔΔG Mutation Folding can be
calculated from the difference between ΔG MutationFolded and ΔG Unfolded
Mutation
the folded protein is known, we can then transform state B into

state A (via a reversible or irreversible process) by, for instance,
pulling the N- and C-termini apart and measuring the work needed
to unfold the protein. Although this is in principle possible, such a
large perturbation of the system will likely require a lot of compu-
tation in order to achieve convergence. However, if the interest is
in evaluating changes in folding free energy upon protein muta-
tion ( ΔΔG f olding ), it is possible to build a thermodynamic cycle
(Fig. 1) that allows to calculate this quantity via alchemical (i.e.,
non-physical) paths that introduce smaller perturbations in the
system, and which are easier to converge. Thus, thanks to the fact
that computationally we have control over the topology and poten-
tial energy function describing the system, we can take full advan-
tage of the better convergence properties of the non-physical
transformation over physical ones; i.e., it is easier to obtain accurate
results by alchemically mutating a wild-type protein into its mutant,

rather than (un)folding both of them. As the free energy is a
state variable, obtained free energy changes are path-independent.
It is therefore unproblematic to choose unphysical pathways.
2.3.1 The As shown in Fig. 1, one can define a cycle where for both the initial
Thermodynamic Cycle (unfolded) and final (folded) states the wild-type protein is trans-
formed into a mutant of interest via a non-physical path. The free
energy difference of protein folding upon an amino acid mutation
( ΔΔG Mutation
Folding ) can be recovered by following both, the physical
paths of folding the WT and mutant protein (ΔG Mut Folding ΔG Folding),
WT
and the alchemical paths of morphing the amino acids in the folded
Folded ΔG Unf olded ).
and unfolded states (ΔG Mutation Mutation
From the thermodynamic cycle in Fig. 1 it is clear that in order

to calculate ΔΔG Mutation
Folding we need to be able to simulate the protein’s
unfolded state. However, the unfolded state of the full-length
protein is by its nature poorly defined and would be challenging
to simulate [50, 51]. Therefore, short protein fragments have been
typically used [52–54]. In particular, it has been observed that
capped sequence context independent tripeptides (GXG, where X
is the mutated residue) serve as a good approximation of the
unfolded state for estimating changes in protein thermostability
[20]. In practice, the context independent are convenient to use,
as they allow to systematically precompute all possible residue
mutations. In such a way, one only needs to calculate ΔG Mutation Folded ,
while ΔG Mutation
Unf olded can be found in a precomputed table.
Although here we take protein folding as an example, the same
alchemical approach can easily be used to build other thermody-
namic cycles by changing the end states; for instance, differences in
ligand–protein, protein–protein, or protein–DNA/RNA binding
free energy can be calculated by using the apo protein as the initial
state and the complex as the final one. Note that while the
ΔG Mutation values refer to non-physical transformations, the final
ΔΔG value obtained from such cycles is that of a physical process
(e.g., folding, association, etc.) and can be directly compared to
experimental values that measure the same free energy differences.
2.3.2 Single and Dual We have described how alchemical transformations can be used to
Topology build thermodynamic cycles that allow one to calculate changes in
free energy differences upon an amino acid mutation. However,
how can one alchemically mutate one residue into another during a
simulation? Given the separate Hamiltonians at the two end states,
it is necessary to define a hybrid topology that contains both
physical states. In the specific case of mutating an amino acid into
another one, the residue being mutated must be able to represent
both the wild-type and mutant residue. This is typically achieved
using the single or dual topology approach [55–57].
Fig. 2 Example of the single and dual topology setup for the mutation of valine
into serine. Dummy atoms in the three-dimensional rendering are shown as
transparent balls and sticks, whereas in the chemical structure drawings they
are shown in grey. In the single topology approach, a methyl part of valine’s side
chain is transformed into serine’s hydroxyl group, with a carbon becoming an
oxygen, while two hydrogens are turned into non-interacting dummy particles;
all hydrogens of the second methyl are decoupled as well, while the carbon
becomes a Cβ hydrogen. In the dual topology approach, no element mutation
occurs, because both valine and serine side chains are present in both states,
where, however, only one of the two is coupled to the system, with the other one
being non-interacting
In the single topology approach (Fig. 2), a number of atoms of

state A is mapped onto the atoms of state B. This means that not
only a particle’s partial charge, but also its chemical element (i.e.,
atom type) can change according to the λ parameter. For instance,
in the example shown in Fig. 2 for a valine being mutated into a
serine, one of valine’s carbon atoms at λ ¼ 0 becomes a serine’s
oxygen at λ ¼ 1. Effectively, this means that along the alchemical
path controlled by λ, the Lennard-Jones and bonded parameters for
that particle are interpolated between those of a carbon atom and
those of an oxygen atom. Note that such change in chemical
identity implies that also the associated equilibrium bond lengths
will be modified (e.g., a C–H bond will be shorter than a C–C
bond) [55, 58, 59]. Often, the number of atoms in the two end
states is not equal, thus not all atoms of the states A and B can be
matched. Therefore, non-interacting particles are used either in
state A or B. These dummy atoms do not have electrostatic and
van der Waals (vdW) interactions with the system; however, they
maintain their bonded interactions, so that they effectively are in a
vacuum-like state. In the example in Fig. 2, five of valine’s hydro-
gen atoms are turned into dummy atoms.
In the dual topology approach, atoms that are different
between the two end states are not morphed directly, but rather
transformed into dummy particles [26, 56, 57]. For amino acids,
this effectively means that the side chains of both residues are
present at the same time. However, at λ ¼ 0 the side chain of the
initial state is interacting with the system and the side chain of the
final state is present as non-interacting particles. On the other hand,
at λ ¼ 1 the side chain of the final state is interacting and that of the
initial state is turned into non-interacting dummy atoms. This can
be seen in Fig. 2: in the initial state, the methanol side chain of
serine is decoupled, whereas in the final state it is the propyl side
chain of valine being turned off.
In practice, there does not need to be a clear separation
between a single and dual topology setup. While some atoms may
be morphed between the states following a single topology
approach, other atoms in the same system may be turned into
dummies according to a dual topology approach.
It is important to bear in mind that the free energy change (ΔG)
of the mutation differs depending on whether the single or dual
topology approach is used. This is due to the fact that the end states
are effectively different due to different dummy atom construc-
tions. In addition, in the single topology approach there is a con-
tribution to the free energy difference from the change in bond
lengths. However, the contributions to the free energy difference
resulting from the details of the atom mapping between the end
states cancel out in a thermodynamic cycle like the one in Fig. 1,
such that the final ΔΔG value is independent of how the hybrid
topology is implemented [57, 59].
Using dummy particles in alchemical transitions requires intro-
duction and annihilation of particles into the system. Such trans-
formations impose a large perturbation, e.g., creating a particle
interacting with the environment in a place of a non-interacting
dummy atom results in strong van der Waals repulsions and Cou-
lombic interactions. In turn, large forces are exerted on the atoms
which leads to instabilities in dynamics and integration artifacts. To
circumvent these issues, it is a common practice to modify,
“soften,” the non-bonded interactions during the alchemical trans-
formations. A number of functional forms and parameter sets to
such soft-cored interactions have been proposed [60–64]. Altering
the non-bonded interactions along the alchemical pathway does
not affect the final free energy estimates, because the physical end
states are still described by the correct unmodified Hamiltonian.

The official release of Gromacs 2016 implements a soft-core vari-
ant [60] allowing to modify both the van der Waals and Coulombic
interactions (see Note 1).
3 Alchemical Amino Acid Mutations
In this section we use a Trp cage mini protein [22] as a model

system to illustrate the process of performing a single amino acid
mutation. The pmx [20, 26] software package will be used to
introduce a point mutation in this 20 amino acid peptide. pmx
provides a single topology-based setup of the alchemical calcula-
tions allowing for an automated generation of hybrid amino acid
structures and topologies compatible with the Gromacs [23] MD
simulation engine.
In this example we will describe in detail the steps needed to
prepare the alchemical simulations (Fig. 3) and calculate the free
energy difference upon a tryptophan, W6, to phenylalanine muta-
tion (W6F) in the Trp cage protein. W6 is the key residue in the
hydrophobic core of this mini protein providing stability to its fold.
Fig. 3 A schematic depiction of the main steps in generating hybrid structures

and topologies for the alchemical simulations using pmx. Firstly, pmx is used to
introduce a mutation into the protein. Afterwards, the Gromacs tool pdb2gmx
generates a topology for the protein with the hybrid residue in a user chosen
molecular mechanics force field. In the last step, pmx is used again to add the
B-state parameters to the topology file
We will assess the change in the thermodynamic stability by calcu-

lating the double free energy difference ( ΔΔG ) for the W6F
mutation in the folded Trp cage and its unfolded variant approxi-
mated by a capped tripeptide (Fig. 1).
For the results of more alchemical mutations in the Trp cage
protein, see [65].
3.1 Setting Up pmx pmx is a python library that allows the convenient manipulation of
biomolecular structure and topology files. Within the framework of
pmx, a number of scripts have been developed and specifically
designed to prepare and analyze alchemical free energy calculations.
pmx generates topology files that are compatible with the Gromacs
simulation engine.
Mutations in a number of contemporary molecular mechanics
force fields are supported. This is achieved by means of
pre-generated mutation libraries compatible with the Gromacs
force field organization. After installing Gromacs and pmx, the
GMXLIB environmental variable needs to be set to specify the
path to the mutation libraries that come with the pmx package (see
Note 2).
3.2 Hybrid Structure The first step in the setup comprises the generation of the hybrid
structure for the amino acid to be mutated (Fig. 3). The only file
required for this step is the protein structure in .pdb or .gro format.
The protein structure needs to be complete, i.e. all heavy and
hydrogen atoms need to be present. In order to add missing
heavy atoms, external software needs to be used, e.g., Rosetta
[15], Modeller [66], or PyMol [67]. Furthermore, given that
structures resolved by means of X-ray crystallography usually con-
tain no hydrogen atoms, these need to be added as well. Various
software packages, like WhatIf [68] or Rosetta, offer assignment of
hydrogen coordinates for protein structures. The Gromacs tool
pdb2gmx can do this too. In fact, it is convenient to pre-process
a .pdb file with pdb2gmx because it produces a structure file with
atom names already compatible with the Gromacs internal atom
naming given the selected force field. pdb2gmx also identifies
whether any heavy atoms in a protein are missing, so that the tool
can be used to identify incomplete residues. While pdb2gmx will not
model missing heavy atoms, it will inform about such deficiencies.
Note that pdb2gmx will fail if the input structure contains molecules
that are not readily recognized by Gromacs. Therefore, molecules
that are not present in the force field file have to be removed from
the structure at this stage and processed independently.
For the Trp cage model system we use an NMR structure
(PDB-ID 1L2Y) [22] that was deposited with 38 conformers.
After manually extracting conformer #2, we pre-process the struc-
ture by running it through pdb2gmx:
gmx pdb2gmx -f 1l2y_conf2.pdb -o 1l2y_conf2_pdb2gmx.pdb

-ff amber99sb-star-ildn-mut -water none -ignh
In this example we have selected an updated version of the

Amber99sb*ILDN force field [69–71] for which the mutation
library has been pre-generated. No water model needs to be chosen
at this stage, because with this step we only want to obtain a
pre-processed structure file with added hydrogens and Gromacs
compatible atom names. The “-ignh” flag ignores the hydrogen
atoms already present in the structure and adds them again using
the pdb2gmx logic, ensuring the names of the hydrogen atoms are
compatible with Gromacs and the selected force field (see Note 3).
The output structure file obtained as described above is then
used as an input for the pmx script mutate.py:
python mutate.py -f 1l2y_conf2_pdb2gmx.pdb -o mut.pdb -ff amber99sb-
star-ildn-mut
Upon execution, the command prompts for an interactive

selection of a residue to mutate (W6) and a target amino acid
(Phe or F). When developing a workflow for a large-scale mutation
scan, it may be convenient to provide the information about the
amino acid mutations as a text file. For this purpose a “-script” flag
in mutate.py is available: this option expects a text file with an amino
acid number and the name of the residue to mutate into. In the case
of the Trp cage example: 6 Phe.
3.3 Topology At this point we use the hybrid structure from the previous step
(“mut.pdb”) as an input to pdb2gmx (Fig. 3). This time we want to
obtain the topology file containing all the information needed by
Gromacs to run the simulations. The topology file will also include
the description of the hybrid mutated residue, however, parameters
only for one physical state (state A) are defined in the output
topology file. It is also important to note that at this step the
“-ignh” flag should not be set, since the hydrogen atoms have
already been added in the previous step.
gmx pdb2gmx -f mut.pdb -o mut_pdb2gmx.pdb -ff amber99sb-star-ildn-mut
-water tip3p -p topol.top
If one wants to include a ligand that has been parameter-

ized separately, this can be added to the structure
(“mut_pdb2gmx.pdb”) and topology file (“topol.top”) at this
stage.
3.4 Hybrid Topology The generated topology file (“topol.top”) has the hybrid residue
W2F incorporated. However, it is a non-standard hybrid amino
acid with two physical states (A and B). While state A is included
in the topology, state B still needs to be included explicitly. The
required topology parameters for state B can be added by the pmx

script generate_hybrid_topology.py (Fig. 3):
python generate_hybrid_topology.py -p topol.top -o hybrid.top -ff
amber99sb-star-ildn-mut
3.5 Webserver The procedure detailed above (and summarized in Fig. 3) can also
be executed via a webserver interface: http://pmx.mpibpc.mpg.de.
Provided with a protein structure file, the pmx webserver will
perform a user-selected mutation in one of the supported molecu-
lar mechanics force fields.
The webserver runs a number of additional structure
pre-processing steps that simplify the setup procedure. While bro-
ken or incomplete proteins will not be repaired, a number of other
useful modifications are applied: residue and atom names are
matched to the force field nomenclature, terminal residues are
dealt with, and if needed hydrogen atoms may be added via
pdb2gmx. Optionally, the structure may be checked before the
mutation is performed, so that the user is informed about any
potential deficiencies in the input file. In addition, the setup offered
by the webserver is not limited to single amino acid mutations, but
also allows to prepare files for mutation scans over selected protein
chains.
3.6 Alchemical The hybrid structures and topologies we just obtained can readily
Simulations be used for MD simulations and to calculate free energy differences.
Numerous protocols for relative alchemical free energy calculations
are currently available: equilibrium approaches (TI, FEP) as well as
non-equilibrium methods. Here, we employ non-equilibrium cal-
culations based on the Crooks Fluctuation Theorem.
3.6.1 System Preparation Firstly, the hybrid structure and topology are used in preparing the
system for molecular dynamics simulations following a standard
procedure. The protein needs to be placed in a simulation box
and solvated. Then ions need to be added to neutralize the system
and, optionally, reach a desired salt concentration. These are con-
ventional steps used to prepare an ordinary MD simulation: for a
more detailed description of this procedure in Gromacs we refer the
reader to a specialized protocol [72].
3.6.2 Equilibrium Next, we set up two equilibrium simulations: one for the WT Trp
Simulations cage (W6, state A, λ ¼ 0) and another for the mutated protein (F6,
state B, λ ¼ 1) (Fig. 4). We start with an energy minimization
performed on both states separately. The parameters for the energy
minimization (.mdp) are the same as those used in non-alchemical
simulations, with the exception of two flags. The free-energy
flag has to be set to yes. This indicates that the free energy code in
Fig. 4 The procedure of non-equilibrium alchemical simulations for one leg of the
thermodynamic cycle: mutation in the folded state of a protein. Two independent
equilibrium simulations are performed by keeping the system in its physical
states: WT (λ ¼ 0) and mutant (λ ¼ 1). These simulations need to sufficiently
sample the end state ensembles, as the accuracy of the free energy estimate will
depend on the convergence of the equilibrium sampling. Typically, the equilib-
rium simulations are in the nanosecond to microsecond time range. From the
generated trajectories, snapshots are selected to start fast (typically 10–200 ps)
transitions driving the system in the forward (λ: 0 ! 1) and reverse (λ: 1 ! 0)
directions. The work values required to perform these transitions are collected
and the Crooks Fluctuation Theorem is used to calculate the free energy
difference between the two states
Gromacs will be activated for those interactions that have two sets
of parameters (states A and B) in the topology file. In addition, the
init-lambda flag has to be set to 0 for the simulation in state
A (WT Trp cage) and to 1 for state B (mutated Trp cage).
After the energy minimization runs for the A and B states are
complete, MD simulations can be started from the energy mini-
mized conformations. Similarly to the energy minimization, the
simulation parameters are identical to the conventional MD runs,
except for setting the free-energy and init-lambda flags for
the simulations in state A and B, respectively (Fig. 4). These equi-
librium runs are used to sample the relevant phase space volumes,
i.e., the conformational changes in the WT and mutated variant of
Trp cage. Therefore, the ensembles generated during the equilib-
rium runs will define how accurately the free energy difference will
be estimated. This consideration dictates the sampling time: the
simulation time should be sufficient to sample the transitions that
are considered to be relevant. For example, if a protein is known to
undergo large-scale conformational changes and the introduced
mutation may be affecting the populations of these conformers,
the simulation time has to be long enough to properly sample such
transitions. Equilibrium simulations in this case could require
microseconds or longer to converge. On the other hand, it is
often important to estimate the free energy difference for a struc-
ture that would remain close to its experimentally resolved struc-
ture. In this scenario, it is sufficient to sample smaller changes in
rotameric states of the side chains and minor backbone motions. In
previous large-scale amino acid scans investigating protein thermo-
dynamic stabilities, we have observed good agreement with experi-
mental data when using 10–20 ns of equilibrium sampling [1, 20].
Another issue to consider when choosing the sampling time is
the definition of states for which the free energy difference will be
calculated. In the Trp cage example, we are aiming to estimate the
mutation-induced free energy difference in folding free energy.
This implies that one of the end states that we need to simulate
needs to be the folded state, while the other needs to be the
unfolded state. If we were to introduce a destabilizing mutation
(in fact W6F has been shown to strongly destabilize Trp cage
[73, 74]), over a longer simulation time the protein would unfold.
Thus, the definition of the folded state used in the free energy
calculation would be violated, rendering the calculated free energy
differences inaccurate. For the Trp cage W6F mutation example,
we will use equilibrium simulations of 10 ns: short enough such
that no spontaneous unfolding occurs.
3.6.3 Non-equilibrium Once the equilibrium simulations are completed, we can proceed to
Transitions the non-equilibrium part of the simulation protocol. Fast
non-equilibrium transitions serve the purpose of connecting the
two physical states (A and B) and allow obtaining the free energy
difference between them. These transitions are started from snap-
shots extracted from the two equilibrium trajectories. From each
equilibrium trajectory, we discard the first 2 ns as equilibration
time. Then we use the last 8 ns to extract 100 frames equidistant in

time (1 frame every 80 ps) representing an equilibrium ensemble of
starting conformations for the non-equilibrium transitions. These
frames can be conveniently extracted using the Gromacs tool gmx
trjconv (see Note 4). Note that the number of non-equilibrium
transitions performed will influence the accuracy of the free energy
estimate. More transitions allow acquiring more work values, which
consequentially allow for a more accurate ΔG estimate. In previous
investigations we have observed that 50–100 transitions are gener-
ally sufficient to obtain accurate estimates of folding free energy
changes upon protein mutation [1, 20].
Another parameter influencing the accuracy of the ΔG estimate
is the transition time. Over the course of a shorter transition, the
system is driven further away from equilibrium, and more work is
dissipated along the alchemical path. In turn, the free energy esti-
mate becomes less accurate. The optimal transition time depends
on the size of the perturbation and the nature of the system, e.g.,
replacing a small residue with a large one represents a larger pertur-
bation than a small-to-small residue mutation. Larger perturba-
tions may require slower transitions to obtain free energies at the
desired level of accuracy. Transition times ranging from tens to
hundreds of picoseconds are usually enough to return accurate
results [1, 20].
For the Trp cage example considered here, we use
non-equilibrium transitions of 50 ps. Using a 2 fs time step, this
means running 25,000 integration steps. Therefore, the λ value will
be changed at a speed of 1/25,000¼4e5. For the transition
simulations in the forward (state A to B) direction, we set the
following parameters in the .mdp file:
nsteps = 25000
nstcalcenergy = 1
nstdhdl = 1
free-energy = yes
init-lambda = 0
delta-lambda = 4e-5
sc-alpha = 0.3
sc-sigma = 0.25
sc-power = 1
sc-coul = yes
nsteps defines the number of steps. The system starts at

init-lambda¼0 and is morphed into the λ ¼ 1 state over the
course of a transition in nsteps number of steps. The energy and
∂H/∂λ are calculated at every integration step (nstcalcenergy
and nstdhdl set to 1). The parameters starting with a prefix sc-
control the soft-core interactions. In this non-equilibrium protocol
we soften both the van der Waals and Coulombic interactions:

sc-coul¼yes (see Note 1).
For the transitions in the reverse direction (state B to A), the
.mdp parameters are the same, with the exception of the starting
state and the direction of the transition:
init-lambda = 1
delta-lambda = -4e-5
For each of the 100 transitions in both directions, the data we

need to collect are the ∂H/∂λ values, which are stored in the
“.dhdl.xvg” files. Integration over these values gives the work
performed during the non-equilibrium transitions in the forward
and reverse directions (Fig. 4).
3.7 Analysis The integration over the ∂H/∂λ curves and the free energy difference
estimation can be performed with the pmx script analyze_dhdl.py:
python analyze_dhdl.py -fA stateA/dhdl*.xvg -fB
stateB/dhdl*.xvg
The script will output the summary of results in a text file
containing the estimate of the free energy difference using three
estimators: Crooks Gaussian Intersection (CGI), Bennet’s Accep-
tance Ratio (BAR), and Jarzynski’s equality. While CGI and BAR
use the work distributions generated in both, forward and reverse,
directions, Jarzynski’s estimator is one-directional. We recommend
using the BAR estimation for the ΔG value, as it utilizes all the
available work values from both directions and makes no assump-
tions about the shape of the work distributions. Conveniently, the
script also generates plots of the work values over time and of their
distributions (Fig. 5), which are useful to detect potential sampling
or lack of the work distribution overlap issues.
The convergence of the results can be assessed in various ways.
Firstly, if a systematic drift of the work values over time is observed,
it usually indicates lack of convergence during the equilibrium
sampling stage. The work values are likely to drift due to a confor-
mational change and it may be important to thoroughly sample the
significant conformational motions in the protein. Lack of conver-
gence may also be deduced from the error values provided together
with the free energy estimates. The uncertainties of the CGI and
BAR estimators are sensitive to the lack of the overlap between the
forward and reverse work distributions (see Note 5). A large uncer-
tainty in the ΔG estimate indicates that the overlap between the
work distributions might be insufficient. Slower transitions keep the
system closer to equilibrium, so that less work is dissipated along the
path and the overlap between work distributions generally increases.
Running more non-equilibrium transitions increases the probability
of observing work values with low dissipation, which also contri-
butes toward good overlap of the work distributions.
Fig. 5 A standard output generated by the pmx analyze_dhdl.py script. On the

left, the work values for the forward and reverse transitions are depicted for
every starting structure. On the right, the distributions of these work values are
shown as histograms, the intersection of which allows obtaining the free energy
difference. In the current case, the BAR estimator yields a ΔG value of
4.290.63 kJ/mol for the W6F mutation in the folded state of the Trp cage
protein
The most reliable way to assess the precision of the free energy
estimates obtained is to repeat the whole procedure, including
equilibrium and non-equilibrium simulations, multiple times. The
calculated ΔG values and their spread obtained from multiple
independent calculations more accurately capture under-sampling
issues. For the Trp cage W6F mutation, we have obtained a ΔG
value of 4.290.63 kJ/mol (Fig. 5) from a single calculation.
Then, we repeated the whole calculation five times, from the system
preparation to the equilibrium and non-equilibrium simulations.
The average free energy value we obtained was of 3.73 kJ/mol
with a standard error of 0.88 kJ/mol. This result confirms that the
ΔG estimate obtained can be considered to be reliable.
3.8 Double Free So far we have calculated the free energy difference for one leg of
Energy Difference the thermodynamic cycle (Fig. 1): mutation in a folded protein. To
obtain the final double free energy difference the same procedure
needs to be performed for the unfolded Trp cage peptide. It has
been demonstrated that in the context of the alchemical free energy
calculations the unfolded state can be approximated by a capped
tripeptide with the residue of interest surrounded by two
glycines [20].
Given the tripeptide approximation, the ΔG values can be

pre-calculated for every amino acid combination of interest. The
tabulated tripeptide ΔG values can be found on the pmx webserver:
http://pmx.mpibpc.mpg.de (see Note 6). The tripeptide’s W2F
mutation in the Amber99sb*ILDN force field has an associated
free energy change ( ΔG Mutation
Unf olded ) of 17.960.32 kJ/mol. The
Folded ΔG Unf olded .
ΔΔG of interest can thus be calculated as ΔG Mutation Mutation
Thus, in our Trp cage example, the ΔΔG of folding for the W6F
mutation is estimated to be 13.670.71 kJ/mol. This calculated
estimate closely matches the experimentally measured destabiliza-
tion of 12.50.6 kJ/mol [73, 74]. A previous large-scale study
compared calculated and experimental ΔΔG values for protein
thermostability changes upon mutation for the proteins barnase
and Staphylococcal nuclease [1]. It was found that the mean
unsigned error in the predictions was of approximately 4 kJ/mol,
with the uncertainty due to finite sampling, the force field, and the
experimental error equally contributing to the discrepancy between
calculated and experimental ΔΔG values. Therefore, the calculated
ΔΔG value for the Trp cage W6F mutation falls well within the range
of the expected accuracy.
4 Miscellaneous Applications of the Protocol
In the previous section we have outlined a general protocol for the

calculation of free energy changes upon an amino acid mutation.
However, the protocol can and in some cases should be expanded
or modified in order to fit the specific needs of the problem at hand.
In particular, charge-changing mutations (e.g., Ala to Asp, Trp to
Arg, etc.) need special care as artifacts that affect the accuracy of the
calculations are otherwise introduced when applying Ewald sum-
mation for long range electrostatics. In addition, one is not limited
to a single amino acid mutation per calculation. However, when
mutating more residues at the same time, more attention should be
paid to possible convergence and work overlap issues. Finally, for
protein design applications, large mutational scans need to be set
up and run, and with a few precautions the protocol can be exe-
cuted in a more efficient and convenient manner. All these addi-
tional points are discussed in the following paragraphs.
4.1 Charge- It is often of interest to calculate free energy differences upon

Changing Mutations amino acid mutations that cause a net charge change. In principle,
there is no fundamental difference whether an alchemical mutation
is to be charge-changing or charge-conserving. The issue, however,
is of a technical nature and has to do with the treatment of long
range electrostatic interactions in molecular dynamics simulations.
The state-of-the-art long range Coulombic interaction calculations
utilize Particle Mesh Ewald (PME) summation [75, 76]. Due to
the specifics of the PME algorithm, for a non-neutral system any

extra charge is neutralized by the implicit introduction of a uniform
background charge. Taking into account the contributions of this
effect to the free energy difference requires additional, and techni-
cally involved, correction schemes [77, 78]. If neglected, significant
artifacts may occur [79]. Therefore, conserving a system’s charge
during the alchemical transition is preferred.
To accomplish this, we suggest using a double-system/single-
box setup [24, 26]. In this approach both legs of a thermodynamic
cycle, e.g., mutation in the folded and unfolded states in Fig. 1, are
placed in the same simulation box. The systems of the separate legs
in the thermodynamic cycle are set in different physical states: if the
folded state protein has a WT residue at λ ¼ 0, then the unfolded
state must be in its mutated form at λ ¼ 0. During the alchemical
transition, the system goes from λ ¼ 0 to 1, and the folded state is
transformed into the mutant while the unfolded (mutant) state is
simultaneously transformed into the WT. In this way, the charge of
the system will be conserved during the transformation, and the
free energy difference calculated already refers to the ΔΔG across
the thermodynamic cycle of interest.
The assumption underlying the double-system/single-box
approach is that both legs of the thermodynamic cycle are indepen-
dent when placed in a single simulation box. Therefore, it is impor-
tant to place the systems (e.g., the folded protein and unfolded
peptide) sufficiently far apart. The distance between the molecules
needs to be larger than the short range electrostatic cutoff. We have
obtained reliable free energy estimates by setting the distance to be
at least 3 nm between any atoms of the molecules from the separate
legs of the cycle [1, 24]. To ensure that the proteins do not diffuse
and come closer to one another during the course of simulations,
position restraints on a single backbone atom close to the center of
mass of each protein ought to be used. These position restraints
affect only the translational degrees of freedom of the proteins. The
contribution of such position restraints to the translational parti-
tion function will be the same in both legs of the thermodynamic
cycle and will cancel from the ΔΔG estimate.
4.2 More Than One The protocol in this chapter described an example of a single amino
Mutation at Once acid mutation in a protein. pmx, however, also allows introducing
multiple mutations at once as well. This can be done either by
interactively selecting more than one mutation to be applied or by
providing an external file with every mutation defined in a new line
of a text file. The pmx webserver also provides the option to
introduce multiple mutations.
The caveat of performing an alchemical transformation for
several amino acid mutations at once is a slower convergence of
the free energy estimate. Having more mutations imposes a larger
perturbation to the system. Hence, more work will be dissipated
along the path and the free energy estimate will become less accu-
rate. In such a case, performing the non-equilibrium transitions
slower may be necessary.
Another way to calculate the effect of multiple mutations is to
perform the mutations sequentially. For example, the free energy
difference of introducing the mutations X and Y at once is equal to
the combined ΔG of performing the mutation X first and in a
separate setup calculating ΔG for the Y mutation in a system where
the X mutation is already present. In fact, since free energy is a state
function, the sequence of introducing the mutations does not
influence the final ΔG estimate, thus the mutation Y can be
performed first and then the mutation X can follow. The free
energy differences calculated in all three scenarios (X and Y at
once, first X then Y, first Y then X) ought to yield the same
estimate. Therefore, the spread of these three ΔG values could
serve as an indicator of the uncertainty in the calculations.
4.3 Mutation Scan In protein design studies or large-scale mutation investigations, it is

common to perform mutation scans by replacing every amino acid
in a protein sequence with another residue; e.g., alanine scans are
often employed. The command line scripts described in the current
protocol make it easy to build workflows allowing for any number
of mutations to be introduced. The pmx webserver also allows for
an automated generation of the hybrid structures and topologies
for a mutation scan with the user-selected amino acids.
When a protein needs to be mutated multiple times, the end
state representing the wild-type does not need to be simulated at
equilibrium multiple times. The same WT equilibrium simulation
can be reused for all mutants to increase the efficiency of the
protocol. However, in order to be able to reuse the same equilib-
rium run, the generation of the hybrid structure and topology files
has to be postponed to the step after the equilibrium simulations.
Thus, effectively, the steps described in Subheadings 3.2–3.4 need
to be performed after the equilibrium simulations have been carried
out (Subheading 3.6.2). In this scenario, the equilibrium simula-
tions would be performed without invoking the free energy code
and using standard non-hybrid structures and topologies for both
the WT and mutant proteins. The hybrid topology can be gener-
ated only once, using one of the many frames extracted from the
equilibrium simulations. On the other hand, the hybrid coordinates
need to be built by the mutate.py script for all the frames extracted
from the equilibrium trajectories. For example, a custom bash script
with a for loop, calling mutate.py for each of the extracted frames,
would add the atoms for state B to all the snapshots extracted. In
such a way, the same trajectory can be reused for all of the muta-
tions of interest; for each mutation, one needs to create the topol-
ogy with the hybrid residues of interest, and generate the
corresponding hybrid structures.
5 Summary
In summary, we have presented a step-by-step protocol for the

calculation of free energy changes upon amino acid mutation. As
an example, we have shown how these calculations can be used to
estimate the destabilizing effect of the W6F mutation on the fold of
the Trp cage protein using the pmx software. Throughout the text,
we have also pointed out common issues that may be encountered
in alchemical non-equilibrium free energy calculations, as well as
their solutions. Furthermore, we discussed how the protocol can be
automated and scaled up in order to better fit the requirements of
applications that involve large mutational scans, such as protein
design. Finally, we remind the reader that while here we focussed
on amino acid mutations, pmx also allows to set up, run, and
analyze alchemical free energy calculations that involve the muta-
tion of nucleic acids and development is in progress to support
arbitrary organic molecules (ligands). Thus, overall, pmx and the
free energy calculation protocol presented here are flexible tools
that can find broad application in various fields of computational
biophysics and chemistry.
6 Notes
1. When using equilibrium alchemical free energy calculation

protocols (equilibrium TI or FEP) it is usually recommended
to perform transformations of the van der Waals interactions
after turning off the charges on the morphed atoms. In this
scenario, only the van der Waals interactions need to be soft-
cored, while the Coulombic interactions may be calculated
using the unmodified Hamiltonian. In principle, the same
procedure could be applied for the non-equilibrium transitions
as well, however, it is more convenient to perform the fast
alchemical transitions by morphing the Lennard-Jones para-
meters and charges simultaneously. In this case, both the van
der Waals and Coulombic interactions have to be modified
using the soft-core potential. If the default Gromacs 2016
soft-core implementation leads to an erratic behavior of the
∂H/∂λ curves (e.g., unreproducible spikes orders of magni-
tude larger than the average values), an alternative soft-core
implementation can be found on the pmx webserver’s down-
load section, which is more suitable for the non-equilibrium
protocol described in this chapter [63].
Depending on the software version, Gromacs may issue a
warning during grompp execution regarding the use of the
soft-core interactions when van der Waals interactions are not
decoupled. In the context of the non-equilibrium free energy
calculations it is safe to ignore this warning: flag -maxwarn.
2. In order to be able to use the hybrid/alchemical force fields

available in pmx, the environment variable $GMXLIB needs to
be set. This is required as Gromacs uses the path specified in
$GMXLIB to locate additional force field libraries. In pmx, all
available hybrid force fields can be found in $PMXHOME/data/
mutff45, where $PMXHOME is the absolute path to the pmx
source folder. Thus, to allow Gromacs to find the pmx force
fields, you should run the following command (in bash shell):
export GMXLIB¼$PMXHOME/data/mutff45
3. The “-ignh” flag tells pdb2gmx to ignore the hydrogen atoms

present in the input structure. In this way, the tool adds the
hydrogen atoms again using its own logic. This can be useful
when there are hydrogen atoms in the input structure with
atom names that are not recognized by Gromacs and/or not
present in the force field of choice. If the flag is not set,
pdb2gmx will keep the hydrogen atoms present in the structure,
which can be useful if external programs were used to deter-
mine the protonation states of the protein’s residues, or if it is
preferred to keep the protonation states determined experi-
mentally (e.g., via neutron diffraction). However, in this case
one needs to make sure the names of the hydrogen atoms
conform to the naming used in the selected force field, other-
wise pdb2gmx will raise an error. An alternative is to expand the
aminoacids.arn file in the force field library of interest to
introduce a mapping between the hydrogen atom names in the
input structure and in the force field.
4. The Gromacs command trjconv allows the user to convert
and manipulate trajectory files, and comes handy when one
wants to extract single frames to be used as starting points for
the non-equilibrium simulations. In particular, the flag -b
allows to choose the frames to discard before a certain time
defined in picoseconds. The flag -sep tells the program to
write each snapshot as a separate indexed coordinate file. The
flag -skip tells the program to extract only every n-th frame.
The flags -ur and -pbc keeps molecules intact across the
periodic boundaries. For instance, in the example with Trp
cage, we ran the following trjconv command:
gmx trjconv -f equilibrium_sim.xtc -s equilibrium
_sim.tpr -o frame_.pdb -sep -b 2001 -skip 1 -ur com
pact -pbc mol
In this way, as we saved coordinates to the trajectory file every

80 ps, we obtained 100.pdb files called frame_n.pdb, with n
from 0 to 99.
Here, we used an .xtc file to store the trajectory data and .pdb
files to extract the snapshots. These file formats contain only
the atom coordinates, but no velocities, therefore, when
generating non-equilibrium runs, the flag gen-vel¼yes in the .

mdp file needs to be set, together with a reference temperature,
to generate velocities from a Maxwell distribution. Another
option is to use the .trr files for storing the trajectory data
from equilibrium simulations. The .trr files can also store the
velocities along with the coordinates. If the .trr files are used,
the starting snapshots should be extracted as .gro files instead
of .pdb, as the .gro file format allows storing both coordinates
and velocities. Non-equilibrium runs started from initial struc-
tures generated in this way will use velocities as obtained from
the equilibrium sampling; thus, the gen-vel flag in the .mdp
file can be set to no.
5. The analytical error for the Bennet’s Acceptance Ratio estima-
tor grows very rapidly even for a minor lack of the overlap
between the work distributions. The rate of growth for the
analytical error often does not match the bootstrapped error
estimate, which warrants further investigation into BAR uncer-
tainty estimators. Nevertheless, a large value of the analytical
estimator may serve as a good indicator for the lack of conver-
gence during the non-equilibrium transitions.
6. The current implementation of the pmx mutation libraries fol-
low the single topology formalism and the bond lengths are
allowed to change between the two end states. When bond
length constraints are used during the simulations, the contri-
bution of the constraints (upon changes in bond lengths) to
δH/δλ is taken into account by Gromacs. Therefore, TI-based
approaches and the non-equilibrium free energy calculations in
Gromacs properly account for the changes in the bond length.
However, Gromacs currently does not incorporate this contri-
bution into the data used by FEP approaches to estimate free
energy differences. This means that while equilibrium FEP and
non-equilibrium approaches should return the same ΔG values
for the same mutations in theory, in practice this is not the case.
Since the mutation libraries have been generated using the
non-equilibrium free energy protocol, the tabulated values for
the tripeptide mutations should be used only in combination
with free energy calculations that make use of the δH/δλ curve
integration. For FEP-based approaches, the tripeptide muta-
tions need to be calculated separately.
References
1. Gapsys V, Michielssens S, Seeliger D, de Groot (2014) Bioluminescent sensor proteins for

BL (2016) Accurate and rigorous prediction of point-of-care therapeutic drug monitoring.
the changes in protein free energies in a large- Nat Chem Biol 10(7):598–603
scale mutation scan. Angew Chem Int Ed Engl 3. Feng J, Jester BW, Tinberg CE, Mandell DJ,
55(26):7364–7368 Antunes MS, Chari R, Morey KJ, Rios X,
2. Griss R, Schena A, Reymond L, Patiny L, Medford JI, Church GM, Fields S, Baker D
Werner D, Tinberg CE, Baker D, Johnsson K (2015) A general strategy to construct small
molecule biosensors in eukaryotes. eLife 13. Pires DEV, Ascher DB, Blundell TL (2014)
4:323–329 mCSM: predicting the effects of mutations in
4. Zhou L, Bosscher M, Zhang C, Özçubukçu S, proteins using graph-based signatures. Bioin-
Zhang L, Zhang W, Li CJ, Liu J, Jensen MP, formatics 30(3):335–342
Lai L, He C (2014) A protein engineered to 14. Schymkowitz J, Borg J, Stricher F, Nys R,
bind uranyl selectively and with femtomolar Rousseau F, Serrano L (2005) The FoldX web
affinity. Nat Chem 6(3):236–241 server: an online force field. Nucleic Acids Res
5. Correia BE, Bates JT, Loomis RJ, Baneyx G, 33(Suppl 2):W382–W388
Carrico C, Jardine JG, Rupert P, Correnti C, 15. Kortemme T, Baker D (2002) A simple physi-
Kalyuzhniy O, Vittal V, Connell MJ, Ste- cal model for binding energy hot spots in
vens E, Schroeter A, Chen M, MacPherson S, protein-protein complexes. Proc Natl Acad Sci
Serra AM, Adachi Y, Holmes MA, Li Y, Klevit USA 99(22):14116–14121
RE, Graham BS, Wyatt RT, Baker D, Strong 16. Leaver-Fay A, Tyka M, Lewis SM, Lange OF,
RK, Crowe JE, Johnson PR, Schief WR (2014) Thompson J, Jacak R, Kaufman K, Renfrew
Proof of principle for epitope-focused vaccine PD, Smith CA, Sheffler W, Davis IW, Coop-
design. Nature 507(7491):201–206 er S, Treuille A, Mandell DJ, Richter F, Ban
6. Koday MT, Nelson J, Chevalier A, Koday M, YEA, Fleishman SJ, Corn JE, Kim DE, Lys-
Kalinoski H, Stewart L, Carter L, Nieusma T, kov S, Berrondo M, Mentzer S, Popović Z,
Lee PS, Ward AB, Wilson IA, Dagley A, Smee Havranek JJ, Karanicolas J, Das R, Meiler J,
DF, Baker D, Fuller DH (2016) A computa- Kortemme T, Gray JJ, Kuhlman B, Baker D,
tionally designed hemagglutinin stem-binding Bradley P (2011) Rosetta3: an object-oriented
protein provides in vivo protection from influ- software suite for the simulation and design of
enza independent of a host immune response. macromolecules. Methods Enzymol 487
PLoS Pathog 12(2):e1005409 (C):545–574
7. Clark AJ, Gindin T, Zhang B, Wang L, 17. Petukh M, Li M, Alexov E (2015) Predicting
Abel R, Murret CS, Xu F, Bao A, Lu NJ, binding free energy change caused by point
Zhou T, Kwong PD, Shapiro L, Honig B, mutations with knowledge-modified
Friesner RA (2017) Free energy perturbation MM/PBSA method. PLoS Comput Biol 11
calculation of relative binding free energy (7):e1004276
between broadly neutralizing antibodies and 18. Beard H, Cholleti A, Pearlman D, Sherman W,
the gp120 glycoprotein of HIV-1. J Mol Biol Loving KA (2013) Applying physics-based
429(7):930–947 scoring to calculate free energies of binding
8. Fowler PW, Cole K, Gordon NC, Kearns AM, for single amino acid mutations in protein-
Llewelyn MJ, Peto TEA, Crook DW, Walker protein complexes. PLoS ONE 8(12):e82849
AS (2018) Robust prediction of resistance to 19. Moreira IS, Fernandes PA, Ramos MJ (2007)
trimethoprim in Staphylococcus aureus. Cell Computational alanine scanning mutagenesis -
Chem Biol 25:339–349 An improved methodological approach. J
9. Hauser K, Negron C, Albanese SK, Ray S, Comput Chem 28(3):644–654
Steinbrecher T, Abel R, Chodera JD, Wang L 20. Seeliger D, de Groot BL (2010) Protein ther-
(2018) Predicting resistance of clinical Abl mostability calculations using alchemical free
mutations to targeted kinase inhibitors using energy simulations. Biophys J 98
alchemical free-energy calculations. Commun (10):2309–2316
Biol 1:70 21. Chipot C, Pohorille A (eds) (2007) Free
10. Tinberg CE, Khare SD, Dou J, Doyle L, energy calculations: theory and applications in
Nelson JW, Schena A, Jankowski W, Kalodi- chemistry and biology, vol 86. Springer, Berlin
mos CG, Johnsson K, Stoddard BL, Baker D 22. Neidigh JW, Fesinmeyer RM, Andersen NH
(2013) Computational design of ligand- (2002) Designing a 20-residue protein. Nat
binding proteins with high affinity and selectiv- Struct Mol Biol 9(6):425–430
ity. Nature 501(7466):212
23. Abraham MJ, Murtola T, Schulz R, Páll S,
11. Yang W, Lai L (2017) Computational design Smith JC, Hess B, Lindahl E (2015) GRO-
of ligand-binding proteins. Curr Opin Struct MACS: high performance molecular simula-
Biol 45:67–73 tions through multi-level parallelism from
12. Brender JR, Zhang Y (2015) Predicting the laptops to supercomputers. SoftwareX 2:1–7
effect of mutations on protein-protein binding 24. Gapsys V, Michielssens S, Seeliger D, de Groot
interactions through structure-based interface BL (2015) pmx: automated protein structure
profiles. PLoS Comput Biol 11(10):e1004494 and topology generation for alchemical pertur-
bations. J Comput Chem 36(5):348–354
25. Chipot C (2014) Frontiers in free-energy cal- 41. Wood RH, Mühlbauer WCF, Thompson PT
culations of biological systems. Wiley Interdis- (1991) Systematic errors in free energy pertur-
cip Rev Comput Mol Sci 4(1):71–89 bation calculations due to a finite sample of
26. Gapsys V, Michielssens S, Peters JH, de Groot configuration space: sample-size hysteresis. J
BL, Leonov H (2015) Molecular modeling of Phys Chem 95(17):6670–6675
proteins, vol 1215. Humana Press, New York 42. Gore J, Ritort F, Bustamante C (2003) Bias
27. Pohorille A, Jarzynski C, Chipot C (2010) and error in estimates of equilibrium free-
Good practices in free-energy calculations. J energy differences from nonequilibrium mea-
Phys Chem B 114(32):10235–10253 surements. Proc Natl Acad Sci USA 100
28. Hansen N, van Gunsteren WF (2014) Practical (22):12564–12569
aspects of free-energy calculations: a review. J 43. Nanda H, Lu N, Woolf TB (2005) Using
Chem Theory Comput 10(7):2632–2647 non-Gaussian density functional fits to improve
29. Goette M, Grubmüller H (2009) Accuracy relative free energy calculations. J Chem Phys
and convergence of free energy differences cal- 122(13):134110
culated from nonequilibrium switching pro- 44. Massey FJ Jr (1951) Kolmogorov-Smirnov test
cesses. J Comput Chem 30(3):447–456 for goodness of fit. Test 46(253):68– 78
30. Jarzynski C (1997) Nonequilibrium equality 45. Efron B, Tibshirani RJ (1994) An introduction
for free energy differences. Phys Rev Lett 78 to the bootstrap, vol 5, 1st edn. Chapman and
(14):2690–2693 Hall/CRC, London/West Palm Beach
31. Jarzynski C (1997) Equilibrium free-energy 46. Bennett CH (1976) Efficient estimation of free
differences from nonequilibrium measure- energy differences from Monte Carlo data. J
ments: A master-equation approach. Phys Rev Comput Phys 22(2):245–268
E 56:5018–5035 47. Shirts MR, Bair E, Hooker G, Pande VS
32. Crooks GE (1998) Nonequilibrium measure- (2003) Equilibrium free energies from non-
ments of free energy differences for microscop- equilibrium measurements using maximum-
ically reversible Markovian systems. J Stat Phys likelihood methods. Phys Rev Lett 91
90(5/6):1481–1487 (14):140601
33. Crooks GE (1999) Entropy production fluctu- 48. Nelder JA, Mead R (1964) A simplex method
ation theorem and the nonequilibrium work for function minimization. Comput J 7
relation for free energy differences. Phys Rev (4):308–313
E 60(3):2721–2726 49. Hahn AM, Then H (2010) Measuring the
34. Crooks GE (2000) Path-ensemble averages in convergence of Monte Carlo free-energy calcu-
systems driven far from equilibrium. Phys Rev lations. Phys Rev E Stat Nonlinear Soft Matter
E 61(3):2361–2366 Phys 81(4):041117
35. Hummer G, Szabo A (2001) Free energy 50. Lindorff-Larsen K, Trbovic N, Maragakis P,
reconstruction from nonequilibrium single- Piana S, Shaw DE (2012) Structure and
molecule pulling experiments. Proc Natl Acad dynamics of an unfolded protein examined by
Sci USA 98(7):3658–3661 molecular dynamics simulation. J Am Chem
36. Hummer G (2001) Fast-growth thermody- Soc 134(8):3787–3791
namic integration: error and efficiency analysis. 51. Rauscher S, Gapsys V, Gajda MJ, Zweckstet-
J Chem Phys 114(17):7330–7337 ter M, de Groot BL, Grubmüller H (2015)
37. Hummer G, Szabo A (2005) Free energy sur- Structural ensembles of intrinsically disordered
faces from single-molecule force spectroscopy. proteins depend strongly on force field: a com-
Acc Chem Res 38(7):504–513 parison to experiment. J Chem Theory Com-
put 11(11):5513–5524
38. Zwanzig RW (1954) High-temperature equa-
tion of state by a perturbation method. 52. Prevost M, Wodak SJ, Tidor B, Karplus M
I. nonpolar gases. J Chem Phys 22 (1991) Contribution of the hydrophobic effect
(8):1420–1426 to protein stability: analysis based on simula-
tions of the Ile-96 ! Ala mutation in barnase.
39. Kirkwood JG (1935) Statistical mechanics of Proc Natl Acad Sci USA 88(23):10880–10884
fluid mixtures. J Chem Phys 3(5):300–313
53. Sneddon SF, Tobias DJ (1992) The role of
40. Cuendet MA (2006) The Jarzynski identity packing interactions in stabilizing folded pro-
derived from general Hamiltonian or teins. Biochemistry 31(10):2842–2846
non-Hamiltonian dynamics reproducing NVT
or NPT ensembles. J Chem Phys 125 54. Pitera JW, Kollman PA (2000) Exhaustive
(14):144109 mutagenesis in silico: multicoordinate free
energy calculations on proteins and peptides. 68. Vriend G (1990) WHAT IF: a molecular mod-
Proteins Struct Funct Bioinf 41(3):385–397 eling and drug design program. J Mol Graph 8
55. Pearlman DA, Kollman PA (1991) The over- (1):52–56
looked bond-stretching contribution in free 69. Hornak V, Abel R, Okur A, Strockbine B,
energy perturbation calculations. J Chem Roitberg A, Simmerling C (2006) Compari-
Phys 94(6):4532 son of multiple amber force fields and develop-
56. Pearlman DA (1994) A comparison of alterna- ment of improved protein backbone
tive approaches to free energy calculations. J parameters. Proteins Struct Funct Bioinf 65
Phys Chem 98(5):1487–1493 (3):712–725
57. Boresch S, Karplus M (1999) The role of 70. Best RB, Hummer G (2009) Optimized
bonded terms in free energy simulations: molecular dynamics force fields applied to the
1. Theoretical analysis. J Phys Chem A 103 helix-coil transition of polypeptides. J Phys
(1):103–118 Chem B 113(26):9004–9015
58. Boresch S, Karplus M (1996) The Jacobian 71. Lindorff-Larsen K, Piana S, Palmo K, Mar-
factor in free energy simulations. J Chem Phys agakis P, Klepeis JL, Dror RO, Shaw DE
105(12):5145–5154 (2010) Improved side-chain torsion potentials
59. Boresch S, Karplus M (1999) The role of for the Amber ff99SB protein force field. Pro-
bonded terms in free energy simulations. teins Struct Funct Bioinf 78(8):1950–1958
2. Calculation of their influence on free energy 72. Lindahl E (2015) Molecular dynamics simula-
differences of solvation. J Phys Chem A 103 tions. In: Molecular modeling of proteins.
(1):119–136 Springer, Berlin, pp 3–26
60. Beutler TC, Mark AE, van Schaik RC, Gerber 73. Barua B, Andersen NH (2001) Determinants
PR, van Gunsteren WF (1994) Avoiding sin- of miniprotein stability: can anything replace a
gularities and numerical instabilities in free buried H-bonded Trp sidechain? Lett Pept Sci
energy calculations based on molecular simula- 8(3–5):221–226
tions. Chem Phys Lett 222(6):529–539 74. Barua B, Lin JC, Williams VD, Kummler P,
61. Zacharias M, Straatsma TP, McCammon JA Neidigh JW, Andersen NH (2008) The
(1994) Separation-shifted scaling, a new scal- Trp-cage: optimizing the stability of a globular
ing method for Lennard-Jones interactions in miniprotein. Protein Eng Des Sel 21
thermodynamic integration. J Chem Phys (3):171–185
100:9025–9031 75. Darden T, York D, Pedersen L (1993) Particle
62. Pham TT, Shirts MR (2011) Identifying low mesh Ewald: an Nlog(N) method for Ewald
variance pathways for free energy calculations sums in large systems. J Chem Phys 98
of molecular transformations in solution phase. (12):10089–10092
J Chem Phys 135(3):034114 76. Essmann U, Perera L, Berkowitz ML, Dar-
63. Gapsys V, Seeliger D, de Groot BL (2012) den T, Lee H, Pedersen LG (1995) A smooth
New soft-core potential function for molecular particle mesh Ewald method. J Chem Phys 103
dynamics based alchemical free energy calcula- (19):8577–8593
tions. J Chem Theory Comput 8 77. Rocklin GJ, Mobley DL, Dill KA, Hünenber-
(7):2373–2382 ger PH (2013) Calculating the binding free
64. Buelens FP, Grubmüller H (2012) Linear- energies of charged species based on explicit-
scaling soft-core scheme for alchemical free solvent simulations employing lattice-sum
energy calculations. J Comput Chem 33 methods: an accurate correction scheme for
(1):25–33 electrostatic finite-size effects. J Chem Phys
65. Gapsys V, de Groot BL (2017) pmx Webserver: 139(18):184103
a user friendly interface for alchemistry. J Chem 78. Lin Y-L, Aleksandrov A, Simonson T, Roux B
Inf Model 57(2):109–114 (2014) An overview of electrostatic free energy
66. Šali A, Blundell TL (1993) Comparative pro- computations for solutions and proteins. J
tein modelling by satisfaction of spatial Chem Theory Comput 10(7):2690–2709
restraints. J Mol Biol 234(3):779–815 79. Hub JS, de Groot BL, Grubmüller H, Groen-
67. Schrödinger, LLC (2015) The PyMOL molec- hof G (2014) Quantifying artifacts in Ewald
ular graphics system, version 1.8, November simulations of inhomogeneous systems with a
2015 net charge. J Chem Theory Comput 10
(1):381–390
Chapter 3
Protocols for the Molecular Evolutionary Analysis

of Membrane Protein Gene Duplicates
Laurel R. Yohe, Liang Liu, Liliana M. Dávalos, and David A. Liberles
Abstract
Gene duplication is an important process in the evolution of gene content in eukaryotic genomes.
Understanding when gene duplicates contribute new molecular functions to genomes through molecular
adaptation is one important goal in comparative genomics. In large gene families, however, characterizing
adaptation and neofunctionalization across species is challenging, as models have traditionally quantified
the timing of duplications without considering underlying gene trees. This protocol combines multiple
approaches to detect adaptation in protein duplicates at a phylogenetic scale. We include a description of
models for gene tree-species tree reconciliation that enable different types of inference, as well as a practical
guide to their use. Although simulation-based approaches successfully detect shifts in the rate of duplica-
tion/retention, the conflation between the duplication and retention processes, the distinct trajectories of
duplicates under non-, sub-, and neofunctionalization, as well as dosage effects offer hitherto unexplored
analytical avenues. We introduce mathematical descriptions of these probabilities and offer a road map to
computational implementation whose starting point is parsimony reconciliation. Sequence evolution
information based on the ratio of nonsynonymous to synonymous nucleotide substitution rates (dN/dS)
can be combined with duplicate survival probabilities to better predict the emergence of new molecular
functions in retained duplicates. Together, these methods enable characterization of potentially
adaptive candidate duplicates whose neofunctionalization may contribute to phenotypic divergence across
species.
Key words Gene duplication, Gene tree, Birth-death models, Molecular evolution, dN/dS
1 Introduction
1.1 Gene Duplication The evolutionary mechanisms for generating novelty are key to
and Membrane understanding variation in phenotypic and taxonomic diversity
Proteins across the Tree of Life. Identifying the genetic mechanisms behind
the origin and maintenance of phenotypic diversity is therefore a
fundamental objective of evolutionary genetics. While base pair
substitutions provide a means for understanding the novel function
of existing genes, the duplication of entire genes and genomes
offers a source of new variation for functional diversification. Dupli-
cations are primary sources of innovation, from large-scale whole-
49
50 Laurel R. Yohe et al.
functional protein domains

POPULATION - LEVEL MODEL
COMPLETE DUPLICATION
PSEUDOGENIZATION DOSAGE EFFECT SUBFUNCTIONALIZATION NEOFUNCTIONALIZATION
gene copy is lost complementary function novel function in duplicate

increased protein increased protein gene copy fixes gene copy fixes
product is unstable product is adaptive
gene copy lost gene copy fixes
SPECIES - LEVEL MODEL
Species 1 Species 2 Species 3
dosage effect?
pseudogenization? neofunctionalization?
neofunctionalization?
Fig. 1 Theoretical model of single-copy gene duplication and mechanisms for how a duplicate is fixed or lost in
a population (top). Different patterns indicate different fixed amino acid differences. Grayed genes indicate
loss of function. Note changes can also happen in regulatory regions, but are not shown here. The species-
level model (bottom) is a cartoon of hypothetical scenarios that may be observed across species and their
potential mechanisms. Figure adapted from [11]
genome duplications that may prompt speciation, seen in notable

examples of teleost fish [1–3] or extraordinary polyploidy observed
across plants [4–6], to duplications of a single gene, such as the
expansion of multiple ion channels associated with the evolution of
neural system complexity [7].
Just as new species evolve from ancestral lineages, new genes
can evolve from those already present in the genome, and gene
duplication is a primary molecular mechanism for the evolution of
novel genes [8]. However, testing whether gene duplication is
adaptive remains an unresolved challenge in evolutionary biology.
In this chapter, we present an overview of the current methods used
for studying gene duplication across species, and we describe a
theoretical approach that integrates across several methodologies.
We focus specifically on detecting adaptation in small-scale duplica-
tions from a single gene. Our primary emphasis is on membrane
proteins, as many of these proteins are encoded by genes that evolve
through a birth-death process that is a central mechanism to the
model we propose.
A new gene may follow one of several trajectories after duplica-
tion (Fig. 1). Most probably, the duplicate is deleterious or neutral,
does not fix in the population, and is lost [9–12]. It may also be
Molecular Evolution of Gene Duplication 51
retained, either because it is adaptive or because of drift. The

adaptive scenario may occur by either taking on a novel coding
sequence or expression function or maintaining identical coding
and expression domain functions as its ancestor, but increasing the
expression of the gene product from redundant gene copies—a
phenomenon known as dosage effect [8–10, 13]. In a nonadaptive
scenario, the copy may fix but will likely pseudogenize after many
generations unless subfunctionalization occurs. In each of these
outcomes, the probability of gene retention and loss can be mod-
eled as a function of time. The rates of amino acid-changing and
silent substitutions that occur in each of these outcomes differs and
can be informative in determining the fate of a gene duplicate. The
overarching objective of this chapter is to quantify these distinct
processes, present methods of simulation for different models, and
synthesize the outcomes into biologically relevant interpretations
of adaptation and loss.
The domain of a protein is the coding sequence that encodes
the amino acid residues, and proteins can be composed of a single
domain or several. These domains are the evolutionary unit of a
protein, as part or all of the domain may undergo duplication or
recombination or accumulate mutations that may affect protein
function [14]. Membrane proteins are critical to several indispens-
able cellular functions including signal recognition, signal trans-
duction, and transportation of materials into and out of the cell. In
addition to these functions, membrane proteins are constrained to
maintaining domains that enable the insertion, and that preserve
the orientation, of the protein in the lipid bilayer of the cell mem-
brane [15]. Membrane proteins also show a preference for posi-
tively charged residues that interact with the cytoplasmic side of the
membrane [16]. With these constraints in mind, membrane pro-
teins that respond to extracellular signals from the environment
must also have binding sites for their respective ligands. Chemo-
sensory receptors and immune-related membrane proteins involved
in pathogen recognition encounter natural selection to detect ever-
changing environmental cues. Many genes that encode these pro-
teins evolve in a concerted birth-death fashion, in which genes
duplicate, and duplicates may evolve a new function or pseudogen-
ize [17]. This mechanism leads to a pattern of many closely related
genes with similar and divergent function that can be classified as a
multigene family.
2 Methods
2.1 Approaches There are two major approaches to investigating the evolutionary
and Limitations process of gene duplication among species: birth-death models fit to
to Studying Gene a species tree and gene tree-species tree reconciliation. Several meth-
Duplication odologies have been published using gene tree-species tree recon-
ciliation [18–22]. This approach allows detection of branches in
which duplications and losses of particular gene copies

occur, modeling the history of gene copies as a function of
speciation events. However, currently available methods are either
parsimony-based or do not estimate rates of gene retention
[18, 23]. Importantly, any computed rate of loss is a homogeneous
function of time along branches of the species tree, instead of a
function relating loss to the age of the duplicate. This is a problem
because the loss rate should not be constant through time. Instead,
the probability of gene retention decays with duplicate age, making
the loss rate a function of the time since duplication. Current inter-
specific models also conflate mutation and fixation, overlooking the
time between these events. Future work could include the develop-
ment of mutation-selection style models for gene duplication.
The second approach estimates rates of birth (duplication) and
death (pseudogenization/loss) and tests if there are increased rates
of either in different parts of the tree [24–26]. These methods
calculate the likelihood of gene family data based on a birth and
death rate while also considering branch lengths of species diver-
gence times [24, 25]. This framework allows for explicit hypothesis
testing of different birth-death rates in different parts of the tree
but is subject to several assumptions, discussed below.
We provide an overview of these methods used for studying
adaptation of gene duplication. Our examples provide a conceptual
framework on how to define biologically meaningful questions in
gene duplication analyses in a way that enables quantitative tests.
Our examples also demonstrate strong caveats and ever-present
assumptions in gene duplication analyses at the phylogenetic scale.
First, we demonstrate a gene tree-species tree reconciliation method
using parsimony. Second, we show how to test if the number of
inferred duplications and losses is significantly higher or lower than
expected under a null birth-death process through simulations.
Third, we present the theory for developing a more integrated
approach to characterize the different fates of gene duplicates.
2.1.1 Parsimony-Based One early and common approach to gene tree-species tree recon-
Reconciliation ciliation is to use the principle of parsimony to minimize either the
duplication or the loss cost associated with mapping lineages of
gene trees to branches of the species tree. This approach provides a
valuable preliminary analysis for identifying discordance between
the gene tree topology and the species tree (when the species tree
relationship is not recovered within the gene family). Early
approaches required the gene and species trees to be fully resolved
with binary nodes, but subsequent approaches relaxed this assump-
tion (see [27] for a review). As in parsimony-based tree reconstruc-
tion, the insensitivity of parsimony to duplication rates on branches
with different lengths is a potential problem. A previous study has
evaluated the relationship of different costs of accounting for gene
tree discordance to each other in a parsimony context, which

represents a starting point for comparing these with model-based
reconciliations under different models [28].
Here we provide an example of the amino acid transport pro-
tein gene family known as the amino acid-polyamine-organocation
(APC) transporters in the sap-feeding insect suborder Sternor-
rhyncha. These insects have evolved a tight symbiotic relationship
with gut bacteria that provides essential amino acids to supplement
a nutrient-poor diet of phloem. Amino acid transport proteins
facilitate the exchange of amino acids between the symbiont and
its host across the bacteriocytes. It was known that some species of
sap-feeding insects had multiple gene copies of APC transporters
[29], but whether these duplications occurred prior to the radia-
tion of sap-feeding insects was unclear. If an expansion of the
number of APC transport proteins had occurred within this clade,
it might be related to the increased reliance on nutrient supple-
ments from gut symbionts. To answer this question, a published
study implemented several reconciliation and birth-death methods
to model the evolutionary history of the gene tree [30]. We first
present the reconciliation of the APC gene tree with the Hemiptera
species tree to demonstrate parsimony inference of duplications and
losses (Fig. 2). Parsimony reconciliation was inferred using Notung
[18]. Reconciliation can also be performed using a likelihood-
based method (in this case, DupliPHY-ML [31]) that yields similar
results (Fig. 3a).
Figure 2a shows that several lineages within Sternorrhyncha
have experienced an expansion in the number of copies of APC
transporters, as well as an expansion at the base of the group.
However, in addition to the statistical inconsistency of parsimony
inference when many changes accumulate, there is no hypothesis
testing involved in describing whether any of these duplications or
losses differ than from what is expected under a null evolutionary
model of birth-death.
2.1.2 Birth-Death Models Early models for gene duplication were traditional birth and death
of Gene Duplication models. In these, the number of duplicate copies evolves through a
stochastic birth-death process in which retention and loss are
modeled with an exponential distribution [32]. Key parameters
estimated in birth-death models are the birth and death rates of
the genes, as well as the number of gene copies at each internal
node. These models set up a statistical framework that describes
how rates of gene duplication and loss may vary in different parts of
the tree.
In the context of our example with the APC transporters in
hemipteran insects, the parsimony inference suggests there may be
an increased rate of gene duplication in Sternorrhyncha compared
to other insects in the order. Likelihood-based birth-death models
A) Species Tree B) Gene Tree potato psyllid

pea aphid
8 human body louse whitefly
cicada
kissing bug
pea aphid
citrus mealybug
-3 0.6
human body louse
fruitfly
7 kissing bug kissing bug
whitefly
potato psyllid
10 cicada
fruitfly
human body louse
citrus mealybug
potato psyllid
10 cicada cicada
whitefly
fruitfly
citrus mealybug
cicada
-3 pea aphid
9 citrus mealybug
potato psyllid
whitefly
fruitfly
10 human body louse
12 potato psyllid
whitefly
Sternorrhyncha
+6 cicada
kissing bug
18 pea aphid citrus mealybug
pea aphid
12 potato psyllid
whitefly
cicada
fruitfly
+2 human body louse
kissing bug
12 12 white fly citrus mealybug

fruitfly
kissing bug
human body louse
fruitfly
human body louse
+1 fruitfly
13 potato psyllid cicada
pea aphid
potato psyllid
whitefly
fruitfly
200 150 100 50 0 human body louse
pea aphid
whitefly
Ma potato psyllid
citrus mealybug
pea aphid
potato psyllid
whitefly
fruitfly
human body louse
potato psyllid
kissing bug
potato psyllid
cicada
whitefly
kissing bug
citrus mealybug
pea aphid
Fig. 2 (a) Species tree for Hemiptera insect order, denoted with the Sternorrhyncha sap-feeding insect
suborder. The human body louse is an outgroup. The fruit fly (Drosophila melanogaster) was omitted from the
species tree for clarity. Gray boxes indicate the number of gene copies inferred for each species and at each
ancestral node. Branch labels indicate the number of duplications (+) or losses () inferred to have occurred
at each respective branch as inferred using parsimony. (b) Gene tree of the APC amino acid transporter family.
Each tip is a unique gene copy belonging to the species labeled at the tip
explicitly test whether multiple birth rates in different parts of the

tree (in this case Sternorrhyncha v. background branches) better fit
the data than a single birth rate for the entire phylogeny. A previous
study estimated the birth rate (b) for different parts of the tree and
found that a model with a single b for the entire phylogeny was a
better fit than a model with a separate estimate of b for the Sternor-
rhyncha suborder (Table 1) [30]. Thus, from this approach, evi-
dence does not support increased rates of duplication in sap-feeding
insects.
While this approach can identify the species tree branches in
which increased rates of duplication events occurred, it ignores the
gene tree. Unlike reconciliation approaches, phyletic birth-death
models simply fit parameters to numbers of gene copies, instead of
actually considering if particular orthologs or paralogs are observed
across species. Simulations of gene trees under similar birth and
death rates estimated from one’s data can provide a more thorough
understanding of a null model of birth and death rate estimates
A)
8 human body louse
-3 7 kissing bug B) C)
10
10 cicada p < 10=4 p = 0.24
600 600
-1
9 citrus mealybug
10 -2
Replicate
10
Sternorrhyncha
+8 400 400
18 pea aphid
12
+2 200 200
12 12 white fly
+1
13 potato psyllid 0 2 4 6 8 10 0 2 4 6
Duplication Loss
200 150 100 50 0
Ma
Fig. 3 Likelihood-based reconciliation of the APC transport proteins in Hemiptera. (a) Duplications and losses
labeled on branches were inferred from reconciliation analyses in DupliPHY-ML v. 1.2 [31]. Gray boxes are
number of APC transporter gene copies in each species or inferred at the ancestral node. (b) Simulation of
expected number of duplications for Sternorrhyncha under a null birth-death process. The dotted line is the
cumulative number of duplications observed from the DupliPHY-ML results. (c) Simulation of expected number
of losses for Sternorrhyncha under a null birth-death process. The dotted line is the cumulative number of
losses observed from the DupliPHY-ML results. P-values test whether the observed value is significantly
different than the null distribution. Simulations were performed using GenPhyloData within the JPrIME v. 0.3.6
software [21]. Code for simulations is available in the supplementary material of [30]
Table 1
Hemiptera APC transporter gene family parameter estimates of likelihood-based birth-death model
and likelihood ratio test results of model comparisons between a null model of a single birth rate (b)
for the entire tree or two rates of b, one for the background branches and one for sternorrhynchans
Model bbackground bSternorrhyncha ML np LR p-value

Single b 1.22 10 3
– 34.3 1 – –
Multiple b 0.73 103 2.50 103 33.5 2 1.52 0.20
ML is the log-likelihood. np is the number of parameters. LR is the likelihood ratio. Inferences were made using CAFE
v. 3.1 [40]. This model assumed the rate of birth to be equal to the rate of death. Analysis derived from [30].
under a neutral process. If the number of observed fixed duplicates

or losses differs significantly from what is estimated from simulated
data, the probability of fixation might be higher or lower than is
expected by the null birth-death process. In our example with the
APC transporter genes, the study used 1000 birth-death simula-
tions based upon a birth rate estimated from the single b model in
Table 1 [30]. From these gene trees, the expected number of
duplications and losses could be estimated for each node of the
tree. The study compared the observed values from the likelihood-
based reconciliation (Fig. 3a) and found that sternorrhynchan
insects did indeed have a significantly higher number of
duplications (but no difference in losses) compared to what was

expected under a null birth-death scenario (Fig. 3b, c). While this
approach is still subject to assumptions made by birth-death mod-
els, simulation experiments can provide a useful insight into null
expectations for the underlying evolutionary process.
We argue, however, that these methods may be testing the
wrong question. All models discussed so far conflate an increased
rate of birth, which is a Poisson process similar to mutation events,
with an increased rate of gene retention. In other words, instead of
testing for an increased “birth rate,” which should be intrinsically
stochastic and homogeneous throughout long time scales, it would
be ideal to measure an increased rate of gene retention. In the case
of increased rates of gene retention, duplicates may be subject to
selection and may indicate adaptation. Different processes lead to
gene retention (Fig. 1), and these processes can be modeled. We
propose an integrated framework to quantitatively differentiate
among different gene retention scenarios that may lead to more
biologically meaningful interpretations of adaptation that result
from gene duplication.
2.2 Modeling Several biological models have been proposed to depict the
Different Fates of Gene mechanisms that lead to different evolutionary fates for a gene
Duplicates: Integrating duplicate (Fig. 1), including pseudogenization, neofunctionaliza-
Reconciliation and tion, subfunctionalization, or dosage effect. These mechanisms
Birth-Death give rise to quite different retention dynamics that can lead to a
time-dependent loss rate of gene duplicates, expressed as a function
λ(t). For nonfunctionalization, the loss rate is constant over time.
In contrast, the loss rates of neofunctionalization and subfunctio-
nalization decline over time and have been described with a Weibull
hazard function [8]. For dosage effect, the rate of loss increases
over time unless dosage effects are combined with subsequent
neofunctionalization or subfunctionalization [33]. Alternative for-
mulations with very similar dynamics have also been proposed
[13]. Figure 4 depicts the shapes of these hazard functions under
different scenarios.
From Reconciliation Probabilities to Birth-Death Models
In most birth-death model frameworks, the time-dependent
loss rates have been incorporated in a generalized birth-death
process to model the fate of gene duplicates. This means the
evolution of the gene copies in a gene family is modeled as a pure
birth process with a time-dependent birth rate, which is a function
of the loss and birth rates in the original birth-death process. Since
the loss rate characterizes the underlying retention mechanisms, the
inference of the loss rates can identify either nonfunctionalization,
subfunctionalization, dosage, or neofunctionalization as responsi-
ble for the observed site patterns of gene family data. However, an
important caveat of all time-dependent models is that any rate of
loss that is computed is a function of time along branches of the
nonfunctionalization
neofunctionalization
subfunctionalization
dosage
λ(t)
time
Fig. 4 Shape of the hazard function through time representing the rate of gene
loss under the four different gene retention scenarios. Figure modified from [39]
species tree, instead of relating to the age of the gene duplicate.

This is a problem because the loss rate should not be constant
through time but instead be a function of the time since duplica-
tion, as the probability of gene retention decays with duplicate age.
Hence, it is more realistic to treat the loss rate as a function of
the ages of gene copies. We propose a theoretical solution. Let λ(t∗)
be the loss rate of a gene copy at age t∗. The age-dependent model
assumes the number of gene copies increases or decreases by 1 or
remains the same during an infinitesimal interval (t, t + Δt) with
probabilities described as follows [12]:
the probability of a gene duplication
P ðntþΔt ¼ nt þ 1Þ ¼ nt bΔt þ oðΔt Þ,
the probability of a gene loss
X
nt

P ðntþΔt ¼ nt 1Þ ¼ λ t∗
i Δt þ o ðΔt Þ,
i¼1
and the probability that the number of copies stays the same
!
X
nt
∗
P ðntþΔt ¼ nt Þ ¼ 1 nt b þ λ ti Δt þ oðΔt Þ:
i¼1
The parameter b is the birth rate; n is the number of gene copies
at the present time; λ t ∗i is the loss rate of gene copy i at age t ∗
i .
The three equations lead to a stochastic differential equation char-
acterizing the age-dependent birth-death process. When the loss
rate is constant (nonfunctionalization), the age-dependent birth-
death model is identical to the time-dependent birth-death model
derived from the reconstructed process (see [34] for derivation).
For neofunctionalization and subfunctionalization, it has been
demonstrated by simulation that the likelihood function of the
age-dependent model differs from that of the time-dependent

model [12], and presumably for dosage as well. However, at the
present time, there is no analytic solution to the stochastic differen-
tial equation when subfunctionalization, neofunctionalization,
and dosage are the underlying mechanisms governing the
age-dependent birth and loss rates. Research on the
age-dependent model will provide indispensable insights on the
evolution of gene duplicates.
The model we propose differs from existing approaches, as it
constrains the inference of duplication events with speciation events
while also calculating an age-dependent survival probability of gene
copies. If a speciation event occurs at ti, the probability of gene copy
retention is a survival probability Ej calculated from the hazard
function λ(t), which represents the instantaneous loss at time t.
Instead of modeling the time associated with retention or loss as
constant through time, it will actually be calculated from the
moment the duplication occurred, which can be denoted as t∗,
reflecting the age-related duplicate notation described in the equa-
tions above [8, 9, 13].
We present a simple example to demonstrate how these prob-
abilities may be calculated and how these probabilities can then be
integrated with a gene tree-species tree reconciliation framework.
Figure 5 shows an example gene tree with one specific reconcilia-
tion solution that may have occurred throughout the history of the
gene family and species phylogeny. The solution is shown based on
parsimony. In this scenario, two duplications and one loss have
occurred in the phylogeny. The probability of retention is the
product of all survival probabilities of different events in Table 2.
The hazard function λ(t) and its corresponding survival function
are different for each outcome in Fig. 5 [8], including nonfunctio-
nalization, neofunctionalization, subfunctionalization, and dosage
effect. The product of all survival probabilities occurring for each
event (e.g., Table 2) will reflect the survival probability of all
duplication
loss
t3 t3
E1
Ge
e
Tre
g1
ne
t2 E2
ies
Tr
ee
ec
t1
(E4)2 t2
Sp
t 1.2
E3 t 1.1
t0
E5 g2
D C B A D C B1 A1 B2 A2 A3
Fig. 5 Cartoon of species tree-gene tree reconciliation. Speciation times (ti) and
gene divergence times (gi) are noted on nodes. E4 is squared because it is
counting both branches from time t1. Event probabilities are listed in Table 2
Table 2
Events and probabilities of Fig. 5
Event Description Probability

R g 1 t 2
E1 Duplication and retention λðt Þdt
e 0
R g 1 t 1
E2 Retain duplicate
g 1 t 2
λðt Þdt
e
R g1
E3 Lose duplicate λðt Þdt
1 e g 1 t 2
R g1
E4 Retain duplicate λðt Þdt
e g 1 t 1
R g2
E5 Duplication and retention λðt Þdt
e 0
The probability of the reconciled tree in Fig. 5 is the product of all event probabilities.
Gray arrows indicate probabilities that do not include a speciation event. The branch
length-dependent birth rate can also be incorporated, when relevant.
duplicates in the gene tree. The best-fit hazard function model can
be determined by model selection using the Akaike or Bayesian
Information Criterion.
It should be emphasized that this example only accounts for a
single set of events for one proposed reconciliation solution, as
opposed to multiple hidden events that may have also occurred.
Integrating over all possible reconciliation histories is, in theory, the
only way to account for all possible hidden events. However, this is
not a feasible solution given the possible number of hidden events
that may have occurred. A more tractable solution is to begin with a
parsimonious reconciliation and iteratively consider hidden events
and alternative reconciliations according to a branch and bound-
style approach. In this regard, a finite set of events (such as those
shown in Table 2) for each reconciliation history can be compared
with one another, and the most probable solution among this finite
set of specific histories can be calculated.
2.3 Combining For each outcome in Fig. 1, there is an expected behavior of the
Survival Probabilities ratio rates of nonsynonymous (dN) to synonymous (dS) substitu-
with dN/dS tions (dN/dS or ω) for the gene copy (Fig. 6). The behaviors of this
ratio can reveal biologically meaningful interpretations relevant to
molecular adaptation. For example, analyses of mammalian olfac-
tory receptors, a hyperdiverse gene family that encodes G-protein-
coupled chemosensory receptors, have shown that some particular
orthologous gene groups have undergone rapid expansions and
have high dN/dS relative to the median, suggesting functional
diversification of these receptor types [35]. However, dN/dS is
not currently modeled in any methodology used to study gene
duplication, despite predictable functions under different gene
retention scenarios. When genes are initially redundant following
nonfunctionalization
neofunctionalization
subfunctionalization
dosage
dN/dS
1
birth time
Fig. 6 Expected dN/dS of duplicated copy after gene duplication
duplication, they are expected to show neutral evolution or at least

relaxation from purifying selection. Genes that nonfunctionalize
should continue to evolve with dN/dS ¼ 1, whereas duplicates
that are retained through either the neofunctionalization or sub-
functionalization process should see dN/dS decay toward a rate
consistent with non-duplicated genes as an asymptote (Fig. 6).
Indeed, it has been empirically shown that accelerated rates of
dN/dS occur after duplication and then subsequently decline
[36]. There may be little information to differentiate between
neofunctionalization and subfunctionalization with these data,
although it might be anticipated that neofunctionalizing genes at
some early point have dN/dS >1 (depending upon several factors,
including the starting value of purifying selection and the strength
of positive selection), something not expected for subfunctionali-
zation. For subfunctionalization, dN/dS may not initially be as
high as with neofunctionalization, as part of the gene is still under
strong purifying selection to maintain ancestral function. In the
case of selection for increased dosage, strong purifying selection is
expected from the moment of duplication, as duplicates are func-
tionally the same (dN/dS << 1 and constant).
One previously used approach is to approximate the age of the
duplication event by building a histogram of pairwise dN/dS values
of duplicates related to dS values [9]. Across collections of genes
from a genome, each empirical frequency distribution is a sample of
an underlying duplication process. When a gene family is known, an
alternative is to examine branch-specific changes in dN/dS in
lineages downstream from a duplication event. In this scenario,
the onset of selection post-duplication in individual lineages can
be evaluated.
The dN/dS statistic is one of the most commonly used
approaches to measure the strength of selection among species,
but it can be susceptible to false positives if there is purifying
selection on synonymous mutations [37]. Meaningful dN/dS esti-
mates may also be problematic for recent duplicates in closely
related lineages [38]. Mutation-selection models can offer a com-

plementary set of tools to estimate the strength of selection and
should also be considered in this framework [37].
3 Concluding Thoughts
Gene duplication is a fundamental mechanism underlying novel

protein function. However, the fate of a gene duplicate is complex,
and it can be challenging to determine whether or not gene dupli-
cation events are adaptive at phylogenetic time scales. Reconciling
the evolutionary history of the gene family with the species tree and
estimating rates of duplication and loss are the two most common
approaches to analyzing gene duplication, but current methods are
prone to assumptions that hinder a meaningful biological interpre-
tation of parameter estimates. We proposed an approach that inte-
grates both reconciliation and birth-death models to estimate the
probabilities of different gene retention scenarios. Future research
on the implementation of such an approach will bridge theory to
practical application for a more comprehensive understanding of
adaptive gene duplication, a key process in protein evolution.
Acknowledgements
This research was supported in part by DEB-1442142 to L.M.D.,

DEB-1701414 to L.M.D., D.A.L., and L.R.Y., and DBI-1222940
to D.A.L. and L.L.
References
1. Hoegg S, Brinkmann H, Taylor JS et al (2004) 6. Hollister JD (2015) Polyploidy: adaptation to
Phylogenetic timing of the fish-specific the genomic environment. New Phytol
genome duplication correlates with the diversi- 205:1034–1039
fication of teleost fish. J Mol Evol 59:190–203 7. Liebeskind BJ, Hillis DM, Zakon HH (2015)
2. Jaillon O, Aury J-M, Brunet F et al (2004) Convergence of ion channel genome content
Genome duplication in the teleost fish Tetra- in early animal evolution. Proc Natl Acad Sci U
odon nigroviridis reveals the early vertebrate S A 112:E846–E851
proto-karyotype. Nature 431:946–957 8. Konrad A, Teufel AI, Grahnen JA et al (2011)
3. Lien S, Koop BF, Sandve SR et al (2016) The Toward a general model for the evolutionary
Atlantic salmon genome provides insights into dynamics of gene duplicates. Genome Biol
rediploidization. Nature 533:200–205 Evol 3:1197–1209
4. The Arabidopsis Genome Initiative (2000) 9. Hughes T, Liberles DA (2007) The pattern of
Analysis of the genome sequence of the flower- evolution of smaller-scale gene duplicates in
ing plant Arabidopsis thaliana. Nature mammalian genomes is more consistent with
408:796–815 neo- than subfunctionalisation. J Mol Evol
5. De Bodt S, Maere S, Van De Peer Y (2005) 65:574–588
Genome duplication and the origin of angios- 10. Hahn MW (2009) Distinguishing among evo-
perms. Trends Ecol Evol 20:591–597 lutionary models for the maintenance of gene
duplicates. J Hered 100:605–617
11. Sikosek T, Bornberg-Bauer E (2010) Evolu- annotation using CAFE 3. Mol Biol Evol
tion after and before gene duplication? In: 30:1987–1997
Dittmar K, Liberles D (eds) Evolution after 27. Eulenstein O, Huzurbazar S, Liberles DA
gene duplication. Wiley-Blackwell, Hoboken, (2010) Reconciling phylogenetic trees. In:
NJ, pp 105–131 Dittmar K, Liberles D (eds) Evolution after
12. Zhao J, Teufel AI, Liberles DA et al (2015) A gene duplication. Wiley-Blackwell, Hoboken,
generalized birth and death process for model- NJ, pp 185–206
ing the fates of gene duplication. BMC Evol 28. Górecki P, Eulenstein O (2014) Refining dis-
Biol 15:275 cordant gene trees. BMC Bioinformatics 15:S3
13. Teufel A, Zhao J, O’Reilly M et al (2014) On 29. Duncan RP, Husnik F, Van LJT et al (2014)
mechanistic modeling of gene content evolu- Dynamic recruitment of amino acid transpor-
tion: Birth-death models and mechanisms of ters to the insect/symbiont interface. Mol Ecol
gene birth and gene retention. Computation 23:1608–1623
2:112–130 30. Dahan RA, Duncan RP, Wilson AC et al
14. Chothia C, Gough J, Vogel C et al (2003) (2015) Amino acid transporter expansions
Evolution of the protein repertoire. Science associated with the evolution of obligate endo-
300:1701–1703 symbiosis in sap-feeding insects (Hemiptera:
15. von Heijne G (2006) Membrane-protein Sternorrhyncha). BMC Evol Biol 15:52
topology. Nat Rev Mol Cell Biol 7:909–918 31. Ames RM, Money D, Ghatge VP et al (2012)
16. Poolman B, Geertsma ER, Slotboom D-J Determining the evolutionary history of gene
(2007) A missing link in membrane protein families. Bioinformatics 28:48–55
evolution. Science 315:1229–1231 32. Arvestad L, Lagergren J, Sennblad B (2009)
17. Nei M, Rooney AP (2005) Concerted and The gene evolution model and computing its
birth-and-death evolution of multigene associated probabilities. J ACM 56(7):44
families. Annu Rev Genet 39:121–152 33. Teufel AI, Liu L, Liberles DA (2016) Models
18. Chen K, Durand D, Farach-colton M (2000) for gene duplication when dosage balance
NOTUNG: a program for dating gene duplica- works as a transition state to subsequent
tions. J Comput Biol 7:429–447 neo-or sub-functionalization. BMC Evol Biol
19. Berglund-Sonnhammer AC, Steffansson P, 16:45
Betts MJ et al (2006) Optimal gene trees 34. Nee S, May RM, Harvey PH (1994) The
from sequences and species trees using a soft reconstructed evolutionary process. Philos
interpretation of parsimony. J Mol Evol Trans R Soc Lond Ser B Biol Sci 344:305–311
63:240–250 35. Niimura Y, Matsui A, Touhara K (2014)
20. Doyon JP, Ranwez V, Daubin V et al (2011) Extreme expansion of the olfactory receptor
Models, algorithms and programs for phylog- gene repertoire in African elephants and evolu-
eny reconciliation. Brief Bioinform tionary dynamics of orthologous gene groups
12:392–400 in 13 placental mammals. Genome Res
21. Sjöstrand J, Sennblad B, Arvestad L et al 24:1485–1496
(2012) DLRS: gene tree evolution in light of 36. Pegueroles C, Laurie S, Albà MM (2013)
a species tree. Bioinformatics 28:2994–2995 Accelerated evolution after gene duplication: a
22. Hermansen RA, Hvidsten TR, Sandve SR et al time-dependent process affecting just one
(2016) Extracting functional trends from copy. Mol Biol Evol 30:1830–1842
whole genome duplication events using com- 37. Spielman SJ, Wilke CO (2015) The relation-
parative genomics. Biol Proced Online 18:11 ship between dN/dS and scaled selection coef-
23. Bielawski JP, Yang Z (2003) Maximum likeli- ficients. Mol Biol Evol 32:1097–1108
hood methods for detecting adaptive evolution 38. Mugal CF, Wolf JBW, Kaj I (2014) Why time
after gene duplication. J Struct Funct Genom matters: codon evolution and the temporal
3:201–212 dynamics of dN/dS. Mol Biol Evol
24. Hahn MW, De Bie T, Stajich JE et al (2005) 31:212–231
Estimating the tempo and mode of gene family 39. Liberles DA, Teufel AI, Liu L et al (2013) On
evolution from comparative genomic data. the need for mechanistic models in computa-
Genome Res 15:1153–1160 tional genomics and metagenomics. Genome
25. Liu L, Yu L, Kalavacharla V et al (2011) A Biol Evol 5:2008–2018
Bayesian model for gene family evolution. 40. De Bie T, Cristianini N, Demuth JP et al
BMC Bioinformatics 12:426 (2006) CAFE: A computational tool for the
26. Han MV, Thomas GWC, Lugo-Martinez J et al study of gene family evolution. Bioinformatics
(2013) Estimating gene gain and loss rates in 22:1269–1271
the presence of error in genome assembly and
Chapter 4
Computational Prediction of De Novo Emerged

Protein-Coding Genes
Nikolaos Vakirlis and Aoife McLysaght
Abstract
De novo genes, that is, protein-coding genes originating from previously noncoding sequence, have gone
from being considered impossibly unlikely to being recognized as an important source of genetic novelty in
eukaryotic genomes. It is clear that de novo gene evolution is a rare but consistent feature of eukaryotic
genomes, being detected in every genome studied. However, different studies often use different compu-
tational methods, and the numbers and identities of the detected genes vary greatly. Here we present a
coherent protocol for the computational identification of de novo genes by comparative genomics. The
method described uses homology searches, identification of syntenic regions, and ancestral sequence
reconstruction to produce high-confidence candidates with robust evidence of de novo emergence. It is
designed to be easily applicable given the basic knowledge of bioinformatic tools and scalable so that it can
be applied on large and small datasets.
Key words De novo genes, Gene birth, New gene evolution, Novel genes, ORF formation, Protein-
coding genes, Genome-wide detection, Genome evolution
1 Introduction
New genes and protein functions are essential to the evolution of

novel phenotypes, to the adaptation to new environments, and to
the process of speciation [1]. Novel genes arise by reuse and
recombination of pre-existing ones but can also originate from
genomic sequences that were previously noncoding [2]. In the
latter scenario, a new gene, either in its entirety or in part, emerges
“de novo” (see [3] for detailed definitions). De novo gene emer-
gence can be thought of as true “gene birth” and has the greatest
potential to result in an entirely novel protein function, since the
novel protein will be free of constraints present in pre-existing,
already functional sequences [4]. Once considered so improbable
as to be impossible, the origin of new protein-coding genes de novo
has now been demonstrated in every eukaryotic lineage studied,
and these new genes have been shown to integrate into central
63
64 Nikolaos Vakirlis and Aoife McLysaght
cellular functions (see [5] for a complete review). More and more,
researchers are coming to recognize de novo emergence as a uni-
versal evolutionary phenomenon and to appreciate its potential as a
mechanism of rapid phenotypic innovation [6] and as a genome-
shaping force [7]. As the interest around de novo genes grows, so
does the need for their accurate identification. This, however, is not
a trivial task. De novo genes are a subset of “orphan genes” also
known as “ORFans” or species-specific genes. These are genes that
are found only in a single genome (or in a closely related group of
genomes, in which case the term taxonomically restricted gene is
used) and lack homologues in any other organism. Disentangling
the evolutionary origins of orphan genes can be challenging [8],
and the results highly depend upon the employed methodology.
The initial de novo gene studies necessarily followed a stringent,
painstaking approach involving a substantial amount of manual
curation and multiple lines of evidence [9–12]. The goal was to
provide solid proof that a functional species-specific gene had
emerged from an ancestrally noncoding or nonfunctional region.
Since then, multiple studies have adopted a different type of
approach, with more relaxed criteria, but its advantages and pitfalls
are still a matter of debate [13–17].
In this chapter, we will present what can be considered as a
stringent best practice for the identification of all protein-coding
genes that have emerged entirely de novo in a single genome.
Conceptually this is the same as identifying genes that have origi-
nated de novo on a particular branch of a tree with the only
difference being that the novel gene will be present in the organ-
isms descended from that branch, and not in any outgroups. The
methodology described here can be easily adapted to that type of
study. The evolution of a novel gene requires, at the very minimum,
the origin of an open reading frame (ORF) and regulatory signals
for transcription and translation. Here we will mostly focus on the
emergence of the ORF. Starting with the complete set of annotated
protein-coding genes, we remove the ones with significant similar-
ity to genes in other genomes, resulting in a set of species-specific
genes. This set is then further reduced to the ones with identifiable
sequence similarity to their orthologous noncoding regions in
closely related outgroup genomes, from which an ancestral
sequence can be inferred. Finally, the ones for which the inferred
ancestral sequence can be shown to lack coding potential are
retained as de novo gene candidates (see Fig. 1 for a complete
outline). The approach is designed to (1) err on the side of caution
(i.e., we endeavor to avoid false positives), (2) be applicable as
widely and easily as possible, and (3) be scalable so that it can
work for large and small datasets. It is for this reason that the
method we are describing here is command-line oriented and
specific commands are provided. Nevertheless, the choice of para-
meters that one has to set throughout the application of this
Prediction of De Novo Genes 65
Fig. 1 An outline of the different steps of the methodology described here

method will depend on the context, the biological questions asked,

and the choice of genomes. We encourage the reader to carefully
study the different parameters and options of the tools before using
them, regardless of whether these are mentioned in the examples.
Application of the computational procedure requires familiarity
with the command-line of UNIX-based systems (Bash) and some
basic knowledge of a high-level scripting language such as Python
or Perl.
2 Materials
1. Genome sequence for the focal genome (within which we want

to identify de novo genes) and at least two but ideally more
outgroup genomes (the most phylogenetically close to the
focal genome), in FASTA format.
2. The annotations of the protein-coding genes in the focal and
outgroup genomes in any of the commonly used formats
(GENBANK, EMBL, GFF, etc.).
3. Nucleotide and protein sequences of all annotated protein-
coding genes of the focal and outgroup genomes, in FASTA
format.
4. The topology of the phylogenetic tree of the species in ques-
tion, in Newick format.
5. Pairs of orthologous genes between the focal genome and each
of the neighboring genomes. Orthology resources can be
found at the Quest for Orthologs website (https://que
stfororthologs.org/orthology_databases).
6. The preformatted files of the NCBI NR database available from
ftp.ncbi.nlm.nih.gov/blast/db/.
7. The BLAST [18] stand-alone programs available from ftp.ncbi.
nlm.nih.gov/blast/executables/blast+/2.6.0/ (choose
according to your machine’s OS).
8. The fasta [19] programs available from http://fasta.bioch.
virginia.edu/fasta_www2/fasta_down.shtml.
9. An alignment viewer such as Jalview (http://www.jalview.org/
download) or SeaView (http://doua.prabi.fr/software/
seaview).
10. The MAFFT [20] sequence alignment program available from
http://mafft.cbrc.jp/alignment/software/ and the PRANK
[21] program available from http://wasabiapp.org/software/
prank/.
11. The genblast [22] program available from http://genome.sfu.
ca/genblast/download.html.
12. The GNU parallel command line tool available from https://
www.gnu.org/software/parallel/.
13. The faSomeRecords and faSize tools available from http://
hgdownload.cse.ucsc.edu/admin/exe/.
14. The SAMtools suite of programs available from http://
samtools.sourceforge.net/.
15. The EMBOSS suite of programs available from http://emboss.
sourceforge.net/download/.
16. The phyml [23] phylogenetic reconstruction program available
from http://www.atgc-montpellier.fr/phyml/binaries.php.
17. The tantan [24] tool for low-complexity masking in biological
sequences.
3 Methods
Note: The choice of genomes to be studied is an initial point to be

taken into account. As we will see later on, in order to conclusively
show that a gene has emerged de novo, we need to be comparing
genomes that have not diverged too much so that detectable simi-
larity exists between intergenic regions and that some degree of
synteny is conserved. It is also important to note that the pipeline
described here is based on existing annotations, and the results will
thus be heavily dependent upon the annotation methodology.
High-quality annotation performed according to similar criteria
for each genome under comparison is crucial for the robustness of
the outcome. Homology to known, already annotated genes in
other genomes is frequently used in annotation pipelines thereby
biasing the annotation away from detection of species-specific
genes [8]. Furthermore, short (usually <300 nt) open reading
frames (ORFs) will often be ignored during annotation, and since
young de novo genes are predicted to be short, this can lead to
further underestimation of their true numbers. A more complete
but significantly more time-consuming approach can be to include
all ORFs above a certain length threshold, regardless of the
annotation.
3.1 Retrieve the Data The first step is to download the necessary data to a local machine,
where all the subsequent computations will take place. To start the
analysis, we need the genomic sequences in FASTA format, anno-
tation files in one of the commonly used formats (GenBank,
EMBL, GFF), and the amino-acid and coding DNA sequences
(CDS) for all annotated protein-coding genes for the focal genome
and the outgroup genomes. There exist multiple sources for
genome data, and the choice depends on the genome being inves-
tigated. NCBI’s Genome Resource (https://www.ncbi.nlm.nih.
gov/genome/) is one of the most rich ones. Ensembl is useful

when studying vertebrate species (http://www.ensembl.org/
index.html). Other organism-specific resources are sometimes sim-
pler to navigate such as FlyBase (http://www.flybase.org/) or Sac-
charomyces Genome Database (https://www.yeastgenome.org/).
It is a good practice to always work on the most recent releases of
genomes and annotations. In order to download the actual files,
one may have to access an ftp server for which clients such as
FileZilla are particularly useful (https://filezilla-project.org/).
Let us now assume that we have downloaded the necessary files
for the focal species and four phylogenetically closest outgroup
species. We have created a main working directory, and we have
moved the following files there:
1. The genome sequences of the outgroup species. Here we will
assume that we are dealing with a chromosome-level assembly
(one FASTA file per genome containing multiple records, being
one record per chromosome): outgen1_chrom.fsa, outgen2_-
chrom.fsa, outgen3_chrom.fsa, and outgen4_chrom.fsa.
2. The annotation files in GFF format: focal.gff, outgen1.gff, out-
gen2.gff, outgen3.gff, and outgen4.gff.
3. The CDS files: focal_cds.fsa, outgen1_cds.fsa, outgen2_cds.fsa,
outgen3_cds.fsa, and outgen4_cds.fsa.
4. The amino-acid sequence files: focal_aa.fsa, outgen1_aa.fsa, out-
gen2_aa.fsa, outgen3_aa.fsa, and outgen4_aa.fsa.
For practical purposes, it is a good idea to move the different
types of files into separate directories. Let’s assume that we have
changed directory (cd) to where we have just placed our files. To
create the sub-directories and move the files, we can execute the
following commands:
$ mkdir ./chrom/ ./annot/ ./prot/ ./cds/

$ mv *_chrom.fsa ./chrom/ ; mv *gff ./annot/ ; mv *_cds.fsa ./cds/ ;
mv *_aa.fsa ./prot/
We are now ready to start the analysis.

Note: Throughout the chapter, we will assume that the headers of
the FASTA files are formatted as follows:
>SEQUENCE_ID[SPACE]OTHER_INFORMATION
and that the SEQUENCE_ID part of the header is formatted as
follows
[SPECIES][underscore][GENE_NAME]
For example : Hsap_SOMEGENENAME
We invite the reader to adjust the individual commands provided
to accommodate the formatting of their FASTA headers.
3.2 Identify All By definition, de novo genes are derived from previously noncod-
Species-Specific ing sequence. If we are considering recently evolved de novo genes,
Genes then they will be a subset of species-specific genes, having no
homologs outside the focal genome. Thus the first step is to iden-
tify all species-specific genes in our focal genome. De novo genes
can be categorized according to whether or not they contain any
genetic material that is copied or descended from a pre-existing
gene [3]. The most intuitive cases are type I de novo genes which
are completely derived from noncoding sequence. However,
depending on the purpose of the study, one may also be interested
in de novo genes with small or large portions derived from
sequences previously under selection (type II and type III de
novo genes, respectively), and the similarly search criteria can be
adjusted accordingly.
3.2.1 Similarity Search in We will first perform a similarity search of all the protein-coding
NCBI’s NR Database genes in the focal genome against NCBI’s NR database using the
blastp executable from the BLAST suite of programs. A commonly
used E-value threshold is 103, generally accepted to result in a
good trade-off between sensitivity and specificity. Using more per-
missive thresholds than 103 is very likely to produce a lot of false
hits. The NR database is relatively large; the total size of the
uncompressed files of the preformatted version, as of this writing,
is 106 GB. To speed up the search, we can use BLAST’s multi-
threading option with an appropriate number of threads
(-num_threads X) to parallelize the alignment step, as well as the
GNU parallel command line tool to further accelerate the search.
Let’s assume that we have downloaded the NR database prefor-
matted files along with the taxonomy information file (taxdb.tar.
gz) and we have uncompressed them and placed them in the
NR_DIRECTORY directory. The command to execute is the
following:
$ cat focal_aa.fsa | parallel --GNU --block 100k --recstart ’>’ --pipe ’blastp
-query - -db [NR_DIRECTORY]/nr -outfmt "6 std slen qlen stitle staxids sscinames"
-max_target_seqs 500 -num_threads [NUM_OF_CPUS] -evalue 0.001’ > focal_nr_out.txt
The time that this command will take to complete will depend
on the number and size of the query protein sequences. The
“-max_target_seqs 500” argument is used to limit the number of
target sequences reported, since we are only interested in the
general presence or absence thereof, and not in the sequences
themselves.
We then need to parse the blastp output and store in a file a list
of all the genes without hits, notwithstanding self-hits. This list will
then be compared to the full list of genes to extract a list of genes
without any BLAST hit. We then select the protein sequences of
these genes and store them in a separate file; the same is also done
for the nucleotide sequences. Before proceeding, make sure that

the FASTA files do not contain duplicate records.
First, we select all the lines of the output file that do not
represent a match to a protein sequence of the focal genome itself.
If, for example, the focal species’ scientific name is “Saccharomyces
cerevisiae,” we execute the following:
$ grep -v "Saccharomyces cerevisiae" focal_nr_out.txt | cut -f 1 |

sort -u > focal_found_genes.txt
All the genes with at least one significant match outside of the
focal taxon are now stored in the file “focal_found_genes.txt.”
Then we remove them from the list of all the genes:
$ grep ">" focal_prot.aa | cut -f 1 -d ’ ’ | tr -d ’>’ | sort -u > all_focal_genes.txt

$ comm -3 all_focal_genes.txt focal_found_genes.txt > focal_ss_names.txt
Our species-specific genes are now stored in the file “focal_ss_

names.txt.” We now need to extract their sequences from the initial
FASTA files; this can be done using the utility faSomeRecords:
$ faSomeRecords focal_aa.fsa focal_ss_names.txt focal_ss_aa.fsa

$ faSomeRecords focal_cds.fsa focal_ss_names.txt focal_ss_cds.fsa
3.2.2 Similarity Search in Using the species-specific genes, we will perform a similarity search
Outgroup Genomes in the outgroup genomes’ sequences. This is needed to ensure that
no homologous, unannotated genes exist in the outgroup species
and is also required for other downstream steps. First, it is a good
idea to mask low-complexity segments on the chromosome
sequences so that we avoid spurious matches. This can be achieved
with the tantan program:
$ ls ./chrom/*chrom.fsa | parallel --GNU ’tantan -x N {} > {}.masked’
Next, we use the genblast program to identify gene models

based on homology searches. First we need to set a variable to let
the program know where to look for the auxiliary legacy BLAST
programs, installed during genblast installation. Replace the part
after the “¼” by the path of the install directory of the genblast
program in your machine.
$ export GBLAST_PATH="/users/User/Documents/tools/genBlast_v138_mac_os_X/"
We also need to copy the auxiliary file alignscore.txt from the

genblast directory to our working directory.
Next, we execute the program once for every species that we
want to search in.
$ ls ./chrom/*chrom.fsa.masked | parallel --GNU ’genblasta -p genblastg -q focal_s-

s_aa.fsa -t {} -e 0.001 -c 0.5 -cdna -gff -pid -o {}.gb’
We are next going to parse the “.gff” files generated by genblast

and apply identity percentage (60%) and coverage percentage (50%)
thresholds (also possible for coverage via the -c option of genblast,
in bold above). You can adjust these thresholds, highlighted in bold
in the following command, according to the parameters of your
study:
$ for i in ./chrom/*gff ; do grep -v "^#" $i | grep "transcript" | cut -f 9 | tr ’;’

’\t’ | cut -f 2-4 | sed ’s/[A-z]*\=//g’ | awk ’{ if ($2>60 && $3>50) { print } }’ | cut
-f 1 ; done | sort -u > ss_missing_homologs.txt
The names of the species-specific genes for which a homolo-

gous gene model can be retrieved are now stored in file “ss_missin-
g_homologs.txt.” We next need to remove these sequences from
the FASTA files and from our list:
$ faSomeRecords -exclude focal_ss_aa.fsa ss_missing_homologs.txt focal_ss_aa_final.

fsa
$ faSomeRecords -exclude focal_ss_cds.fsa ss_missing_homologs.txt focal_ss_cds_f-
inal.fsa
$ comm -3 focal_ss_names.txt ss_missing_homologs > focal_ss_names_final.txt
At this stage, we will also use the tfasty executable from the
fasta suite of programs, to do a similarity search using as query the
protein sequences of the species-specific genes and the masked
chromosome sequences from the four outgroup species as subject.
The command is run twice to get two different output formats
(controlled by the “-m” argument): the tabular one which is useful
for parsing and the detailed one which is useful for visual inspection
and manual curation (see Note 1):
$ ls ./chrom/*chrom.fsa.masked | parallel --GNU ’tfasty36 -E 0.00001 -m 9C -p -s BP62

focal_ss_aa_final.fsa {} > {}.against.trgs.detailed.txt’
$ ls ./chrom/*chrom.fsa.masked | parallel --GNU ’tfasty36 -E 0.00001 -m 8 -p -s BP62
focal_ss_aa_final.fsa {} > {}.against.trgs.tabular.txt’
Next, we need to filter out low identity and low coverage hits.
These thresholds can vary, but here we will apply a percentage
identity threshold of 50% and a protein coverage threshold of 50%.
To apply a coverage percentage threshold, we first need to use
the faSize utility to calculate the length of each species-specific
sequence:
$ faSize -detailed focal_ss_aa_final.fsa | sort -k1 > focal_ss_aa_final_lengths.txt

Then, we need to append the corresponding sequence length

to each line of our tfasty tabular output files:
$ for i in ./chrom/*against.trgs.tabular.txt ; do sort -k1 -o $i $i ; join -1 1 -2 1 $i

focal_ss_aa_final_lengths.txt | tr ’ ’ ’\t’ > ${i%\.*}_with_lengths.txt ; done
Finally, we filter the files:
$ for i in ./chrom/*with_lengths.txt ; do awk ’{ if ($3 > 50.0 && sqrt(($8-$7)*

($8-$7))/$13 > 0.5) { print } }’ $i > ${i%.*}_filtered.txt ; done
The parameters highlighted in bold are, in that order, the

percentage identity and query coverage and can be adjusted at will.
3.3 Showing that the The most robust evidence that a gene emerged de novo is provided
Ancestral Sequence when its ancestral sequence can be shown to have lacked protein-
Lacked Protein-Coding coding potential. In order to achieve that, one needs to detect the
Potential candidate gene’s orthologous genomic sequences in at least two
closely related outgroup species. At this point it is crucial to note
that while this step is relatively straightforward for single-exon
genes and genes with simple gene structures, it necessitates signifi-
cantly more manual curation and involves more uncertainty in the
case of complex gene structures. For simplicity and because in the
majority of cases young genes are short and have very simple gene
structures, we will consider only the single-exon gene case. To
extend to multiple exons, each exon would have to be searched
separately during the tfasty step described in Subheading 3.2.2.
That would then allow the manual “stitching together” of the
orthologous region of each exon into a single putative CDS
which can then be aligned and inspected as described in the follow-
ing subsections (see Note 2).
3.3.1 Retrieving the Check if orthology or synteny information between the focal
Orthologous Regions in genome and the neighboring genomes exists in comparative geno-
Outgroup Genomes mic resource databases, a list of which can be found at the Quest for
Orthologs website https://questfororthologs.org/orthology_
databases. The goal at this step is to locate, if possible, the ortho-
logous region of the candidate genes in closely related outgroup
genomes (see Fig. 2), and for this we will need lists of orthologous
pairs of genes, which can be extracted from one of the aforemen-
tioned databases. If orthology information is not available, the
orthologous pairs need to be computed from scratch using a dedi-
cated tool.
By combining the results of Subheading 3.2.2 and the orthol-
ogy information, we can build multiple alignments of each candi-
date gene and its orthologous sequences. These MSAs will be used
in the next step.
Species- specific gene

Focal genome
Conserved orthologs
Fig. 2 Graphical representation of the configuration described in Subheading 3.1, in a scenario with four
closely related outgroup species. The regions of interest are highlighted in green. Note that in actual cases,
neighboring genes and regions might overlap, and so the region of interest might not be as clearly defined as
in the example here
If the number of candidates is not excessive, one can opt to do

it manually by grouping each candidate with all of its matches in the
outgroup genomes and then removing the ones that do not fall in
the predicted orthologous region. Here’s a short bash script that
accomplishes the initial, grouping part:
#!/bin/shcat focal_ss_names_final.txt |\
while read f_line ;
do
echo $f_line > faIn_temp ;
faSomeRecords focal_ss_cds_final.fsa faIn_temp ${f_line}_ortho.fsa ;
grep $f_line chrom/*tabular_with_lengths_filtered.txt |\
while read line ;
do
START=$(echo $line | cut -f 9 -d ’ ’) ;
END=$(echo $line | cut -f 10 -d ’ ’) ;
REV_FLAG=0 ;
if [ $START -gt $END ] ;
then
read START END <<< "$END $START" ;
REV_FLAG=1 ;
fi ;
FILE_NAME=$(echo $line | cut -f 1 -d ’:’) ;
CHROM_NAME=$(echo $line | cut -f 2 -d ’ ’) ;
samtools faidx
chrom/${FILE_NAME%%\.*}.chrom.fsa.masked ${CHROM_NAME}:$START-$END
| tr ’:’ ’_’ > temp ;
if [ $REV_FLAG -gt 0 ] ;
then
revseq -tag -sequence temp -outseq temp_rev ;
mv temp_rev temp;
fi ;
cat temp | sed ’s/^>$.*$/>\1 OUTGR/’ | tr -d ’:’ >> ${f_line}_
ortho.fsa ;
done ;
done
To execute, after replacing with the correct file names, copy and
paste this code into a file called parse_candidate_gen_hits.sh. Then,
permit that file to be executed by running the following:
$ chmod +x parse_candidate_gen_hits.sh
Then execute the script. By doing so, we will have generated a

file for each candidate (with the extension “_ortho.fas”), contain-
ing its sequence and all its genomic hits. Then, by manually open-
ing and inspecting each file, we can cross-check the coordinates of
each hit with the orthologous regions that have been inferred
previously. At the same time, we can also verify that no sequencing
gaps exist in the orthologous regions (see Note 3).
3.3.2 Reconstruction of Next, we infer the state of the ancestral sequence. Almost always in
the Ancestral Sequence the literature, this step would be performed manually, by “walking
along” the alignment and trying to identify “shared disablers.”
Simply put, this consists of manually inspecting the alignments
and identifying common ORF-disrupting mutations in the out-
group orthologous sequences. The ancestral state of these positions
can then be parsimoniously inferred, revealing whether the ances-
tral sequence had an intact ORF. We consider the ORF as ances-
trally not “intact” when it’s shorter than 70% of the de novo
candidate gene’s length.
The first step is to align the sequences. The alignments can be
generated using the linsi executable of MAFFT, as follows
(as before, replace [CPU_NUM] by the number of cores in your
machine):
$ ls *_ortho.fas | parallel --GNU ’linsi --quiet --thread

[CPU_NUM] {} > {.}.aln’
The aligned sequences are now stored in the files with the
extension “_ort‘ho.aln.” By opening the alignment files with an
alignment viewer, we can detect frameshift mutations (see Fig. 3a)
and stop codons (Fig. 3b) in the orthologous sequences. It is
important that we follow any change of reading frame that occurs
Fig. 3 Two hypothetical examples of shared disablers. (a) A single-nucleotide deletion (highlighted in yellow)
that occurred along the terminal branch of the focal genome results in a frameshift, making available a
different potential translation of the sequence that avoids the TGA stop codon that is in frame for the potential
ORF in other species. (b) Two base substitutions in the focal genome lineage lead to the removal of two stop
codons (nonsense-to-sense mutations, highlighted in blue) leading to the formation of a longer ORF
in the alignment, either by getting the correct reading frame trans-

lation from our viewer or by searching the match of the specific
region in the tfasty detailed output files that we generated in
Subheading 3.2.2. Note that in the literature, even if no specific
“shared disablers” are detected, de novo emergence may still be
assumed if all outgroup orthologous sequences have at least one
mutation disrupting the ORF at least 70% of its length.
Cases of de novo emergence inferred manually can be corro-
borated by performing true ancestral sequence reconstruction. This
is especially useful when the alignment contains combinations of
multiple frameshifts and nonsense-to-sense mutations. It can be
achieved using the PRANK multiple sequence alignment program,
but before it can be done, we need to infer phylogenetic trees to use
as guide trees in the reconstruction.
First, we must convert our FASTA multiple alignment files in
PHYLIP format using the seqret tool from the EMBOSS package:
$ ls *_ortho.aln | parallel --GNU ’seqret -sequence FASTA::{} -outseq PHYLIP::{.}.

phylip’
Then, we need to save the topology of the species tree (assumed

to be known and robust) in Newick format in a file, here called
“species_tree.nwk.” For our hypothetical example, that tree would
be the following:
$ echo "(relgen4,(relgen3,(relgen2,(relgen1,focal))));" > species_tree.nwk
Then, we can build trees, one for each alignment, that follow
the species topology by executing the following command:
$ ls *phylip | parallel --GNU ’phyml -i {} -d nt -v e -o lr -c 4 -a e -b 0 -f e -u

species_tree.nwk’
Finally, we can run PRANK with the following command (see

Note 4):
$ ls *_ortho.fas | parallel --GNU ’prank -d={} -showanc -showevents -F -once -t={.}.

phylip_phyml_tree.txt -o={.}’
Once PRANK has successfully finished running, we can inspect

the ancestral sequences for intact ORFs. The ancestral sequences
can be found aligned to the extant ones in the files with the
extension “anc.fas” and have numbers as identifiers. You can see
which ancestor corresponds to which branch in the Newick tree
files ending with “.anc.dnd,” which you can open with a phyloge-
netic tree viewer such as SeaView. See Fig. 4 for the reconstruction
of a toy example.
When the number of candidates is prohibitive for one-by-one
manual investigation, we can automatize the previous step. To do
this, we first need to “de-align” the ancestral sequences that are
found in the files produced by PRANK ending with “anc.fas”:
$ for i in *anc.fas ; do cat $i | sed "/^[^>]/s/-//g" > $i.daln ; done
Next, we need to check that no intact ORF existed ancestrally.

In the PRANK output files, the names of the ancestors follow the
format #NUMBER# (#1#, #2#, etc.). The ORFs we are interested
in are in the reading frame of the candidate, so all we need to do is
translate the ancestral sequences and look for stop codons in the
relevant ancestors, meaning the ones that are part of the focal
genome’s lineage. In the example of Fig. 3, that would be all of
them (1,2,3,4), but in other cases, some ancestors might be irrele-
vant (i.e., not in the lineage of interest).
First we translate the sequences using transeq from EMBOSS:
$ for i in *anc.fas.daln ; do transeq -sequence $i -outseq $i.prt ; done

Fig. 4 Inferring de novo emergence for a hypothetical example alignment combining both frameshifts and
nonsense-to-sense mutations of Fig. 3, using ancestral reconstruction. (a) The phylogenetic tree generated by
PRANK, the same as the input guide tree but with the assigned ancestor identifiers. (b) The PRANK alignment
containing the extant and ancestral sequences. The positions of interest are highlighted as in Fig. 3. The
ancestral states at these positions confirm the results of the manual inference. (c) The “de-aligned,”
translated ancestral proteins and the focal extant one (see below for relevant commands), allowing to verify
that no intact ORF existed before the focal leaf of the tree
Then, we extract the ancestors’ sequences, measure the size of

the ORF, and apply the filter (see Note 5):
$ for i in *prt ;do CAND_LEN=$(grep ${i%_*} focal_ss_aa_final_lengths.txt | cut -f 2)

; cat $i | tr ’#’ ’_’ > temp ; mv temp $i ; echo -n ${i%_*}$’\t’ ; RESULT=$(grep ">_
[1234]_" $i | tr -d ’>’ | while read line ; do echo -n $CAND_LEN$’\t’ ; echo -n $line |
tee temp_names.txt ; faSomeRecords $i temp_names.txt temp_seq.fasta ; cat temp_seq.
fasta | grep -v ">" | tr -d ’\n’ | egrep -o "^.*?\*|^[^\*]*$" | head -1 | wc -c ; done
| awk ’{if ($3/$1 > 0.7) {print}}’) ; if [ -z "$RESULT" ] ; then echo "Keep" ; else
echo "Discard" ; fi ; done > results.txt
The final results are stored in the file results.txt. The candidates
tagged with “keep” are the ones that do not have an intact ORF
larger than 70% of the length of the candidate, in any of the

ancestors 1, 2, 3, and 4. To change the length threshold, change
the number in bold (0.7) in the previous command line. Adjust the
relevant grep matching pattern accordingly to get the desired
ancestors.
3.4 Showing that the By default, young de novo genes are only present in a single
Candidate Genes Are genome. They therefore lack one of the main lines of evidence
Protein-Coding/ that is put forth to prove that a piece of DNA is indeed protein-
Functional coding or functional, namely, conservation due to purifying selec-
tion. It is thus necessary to provide some evidence that the putative
de novo gene expresses a functional protein and is not simply a
spurious result. This is a difficult issue which depends on the very
definition of a functional protein-coding gene, complicated further
by pervasive transcription [25] and pervasive translation [26]. Con-
sequently, what constitutes sufficient evidence of “coding-ness”
and functionality will depend on the context and the assumptions
of the study.
In the absence of specific functional annotation or identifica-
tion of the protein by other means, experimental evidence for its
expression can be provided if proteomics or ribosome profiling data
are available. Major repositories of results from mass spectrometry
proteomic experiments include PRIDE (https://www.ebi.ac.uk/
pride/archive/) and PeptideAtlas (http://www.peptideatlas.org/)
where one can check whether peptides matching a de novo candi-
date have been experimentally detected (see [27] for additional
information on mass spectrometry proteomic resources). Ribo-
some profiling result databases include GWIPS (http://gwips.ucc.
ie/) and RFPDB (http://sysbio.sysu.edu.cn/rpfdb/index.html)
(see [28] for a more complete list of resources). Alternatively, one
can calculate what are referred to as “coding scores” based on the
intrinsic sequence composition of the candidates [17, 29]. This is
sometimes done as part of the initial genome annotation, as is the
case, for example, in the Saccharomyces [30]. One possible solution
is the CPAT tool [31] which has been developed to be applied on
entire transcripts but can work on single ORFs as well. It involves
training the model on data of known coding and noncoding
sequences first (unless your genome is one of human, fly, mouse,
or zebrafish, already available at http://lilab.research.bcm.edu/
cpat/) and so will not be covered in detail here. At any rate, a
sequence annotated as coding remains at the very least a candidate,
even if no other evidence exists of its functionality.
4 Notes
1. If this command crashes, replace parallel with a normal for loop.

2. Here we will describe, in a general fashion, how to look for the
aforementioned evidence. However, some of this work is already
done for some genomes, especially the ones from model organ-
isms. For example, in human, we can directly extract the ortho-
logous genomic sequences in vertebrate genomes for any part of
the genome. To do this, all we need is to download the multiple
alignments of mammalian genomes to human, available from
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/
multiz20way/ in .MAF files (according to our needs, we can
choose files with more or less genomes; see http://hgdownload.
soe.ucsc.edu/downloads.html). Then by converting the human
genome coordinates of our species-specific genes in BED format
and feeding them to “mafsInRegion” (downloadable from
http://hgdownload.cse.ucsc.edu/admin/exe/), we can directly
generate the desired alignments. A similar shortcut is possible in
genomes like S. cerevisiae, D. melanogaster, and other model
organisms.
3. If the number of candidates is high and/or manual curation is
not desirable, we will need to write a script to perform the task.
The resulting files are assumed to be the same. Here’s what a
straightforward algorithm might look like in pseudocode, but
any effective solution is acceptable:
for candidate in species_specific_gene_list :

find gene_neighbors immediately downstream (candidate_-1) and upstream (candi-
date_+1)
for genome in outgroup_species_list :
find orthologs for candidate_-1 (ortho_candidate_-1) and candidate_+1
(ortho_candidate_+1)
if both ortho_candidate_-1 and ortho_candidate_+1 exist :
Extract their coordinates, calculate the interval
if they are contiguous on their chromosome :
for hit in genomic_hits_of_in_genome :
if hit within the interval :
Extract the genomic sequence
Store in file
4. In order for the last two commands to work, the names at the
leaves of the tree files and the names of the sequences in the
FASTA files must match. That means we first must remove the
part of the header following the species name in the de novo
candidate’s record in the FASTA file and accordingly remove any
extra information from the header of the orthologous matches.
Moreover, if, for example, we have a candidate for which ortho-

logous matches are found in only three of the four outgroup
species but we still want to proceed with the reconstruction, we
must adjust the file “species_tree.nwk” by removing the extra
species.
5. faSomeRecords has trouble with sequence identifiers that start
with “#,” so we replace “#” by “_.”
References
1. Long M, Betrán E, Thornton K et al (2003) Drosophila melanogaster are frequently

The origin of new genes: glimpses from the X-linked and exhibit testis-biased expression.
young and old. Nat Rev Genet 4:865–875 Proc Natl Acad Sci 103:9935–9939
2. Andersson DI, Jerlström-Hultqvist J, N€asvall J 13. Carvunis A-R, Rolland T, Wapinski I et al
(2015) Evolution of new functions de novo (2012) Proto-genes and de novo gene birth.
and from preexisting genes. Cold Spring Harb Nature 487:370–374
Perspect Biol 7:a017996 14. Domazet-Lošo T, Carvunis A-R, Albà MM
3. McLysaght A, Hurst LD (2016) Open ques- et al (2017) No evidence for phylostratigraphic
tions in the study of de novo genes: what, how bias impacting inferences on patterns of gene
and why. Nat Rev Genet 17:567–578 emergence and evolution. Mol Biol Evol
4. Schlötterer C (2015) Genes from scratch—the 34:843–856
evolutionary fate of de novo genes. Trends 15. Moyers BA, Zhang J (2014) Phylostratigraphic
Genet 31:215–219 bias creates spurious patterns of genome evolu-
5. McLysaght A, Guerzoni D (2015) New genes tion. Mol Biol Evol 32:258–267
from non-coding sequence: the role of de novo 16. Moyers BA, Zhang J (2016) Evaluating phy-
protein-coding genes in eukaryotic evolution- lostratigraphic evidence for widespread de
ary innovation. Philos Trans R Soc Lond B Biol novo gene birth in genome evolution. Mol
Sci 370:20140332 Biol Evol 33:1245–1256
6. Li D, Dong Y, Jiang Y et al (2010) A de novo 17. Vakirlis N, Hebert AS, Opulente DA et al
originated gene depresses budding yeast mat- (2018) A molecular portrait of de novo genes
ing pathway and is repressed by the protein in yeast. Mol Biol Evol 35:631–645
encoded by its antisense strand. Cell Res 18. Altschul SF, Madden TL, Sch€affer AA et al
20:408–420 (1997) Gapped BLAST and PSI-BLAST: a
7. Vakirlis N, Sarilar V, Drillon G et al (2016) new generation of protein database search pro-
Reconstruction of ancestral chromosome grams. Nucleic Acids Res 25:3389–3402
architecture and gene repertoire reveals princi- 19. Pearson WR, Wood T, Zhang Z et al (1997)
ples of genome evolution in a model yeast Comparison of DNA sequences with protein
genus. Genome Res 26:918–932 sequences. Genomics 46:24–36
8. Tautz D, Domazet-Lošo T (2011) The evolu- 20. Katoh K, Standley DM (2013) MAFFT multi-
tionary origin of orphan genes. Nat Rev Genet ple sequence alignment software version 7:
12:692–702 improvements in performance and usability.
9. Cai J, Zhao R, Jiang H et al (2008) De novo Mol Biol Evol 30:772–780
origination of a new protein-coding gene in 21. Löytynoja A, Goldman N (2008) Phylogeny-
Saccharomyces cerevisiae. Genetics aware gap placement prevents errors in
179:487–496 sequence alignment and evolutionary analysis.
10. Heinen TJAJ, Staubach F, H€aming D et al Science 320:1632–1635
(2009) Emergence of a new gene from an 22. She R, Chu JS-C, Wang K et al (2009) Gen-
intergenic region. Curr Biol 19:1527–1531 BlastA: enabling BLAST to identify homolo-
11. Knowles DG, McLysaght A (2009) Recent de gous gene sequences. Genome Res
novo origin of human protein-coding genes. 19:143–149
Genome Res 9:1752–1759 23. Guindon S, Delsuc F, Dufayard J-F et al (2009)
12. Levine MT, Jones CD, Kern AD et al (2006) Estimating maximum likelihood phylogenies
Novel genes derived from noncoding DNA in with PhyML. Methods Mol Biol 537:113–137
24. Frith MC (2011) A new repeat-masking database to Web server and software. Brief
method enables specific detection of homolo- Bioinform. https://doi.org/10.1093/bib/
gous sequences. Nucleic Acids Res 39:e23–e23 bbx093
25. Clark MB, Amaral PP, Schlesinger FJ et al 29. Ruiz-Orera J, Messeguer X, Subirana JA et al
(2011) The reality of pervasive transcription. (2014) Long non-coding RNAs as a source of
PLoS Biol 9:e1000625 new peptides. Elife 3:e03523
26. Ingolia NT, Lareau LF, Weissman JS (2011) 30. Scannell DR, Zill OA, Rokas A et al (2011)
Ribosome profiling of mouse embryonic stem The awesome power of yeast evolutionary
cells reveals the complexity and dynamics of genetics: new genome sequences and strain
mammalian proteomes. Cell 147:789–802 resources for the Saccharomyces sensu stricto
27. Chen T, Zhao J, Ma J et al (2015) Web genus. G3 (Bethesda) 1:11–25
resources for mass spectrometry-based proteo- 31. Wang L, Park HJ, Dasari S et al (2013) CPAT:
mics. Genomics Proteomics Bioinformatics coding-potential assessment tool using an
13:36–39 alignment-free logistic regression model.
28. Wang H, Wang Y, Xie Z (2017) Computa- Nucleic Acids Res 41:e74
tional resources for ribosome profiling: from
Chapter 5
Coevolutionary Signals and Structure-Based Models

for the Prediction of Protein Native Conformations
Ricardo Nascimento dos Santos, Xianli Jiang, Leandro Martı́nez,
and Faruck Morcos
Abstract
The analysis of coevolutionary signals from families of evolutionarily related sequences is a recent concep-
tual framework that provides valuable information about unique intramolecular interactions and, therefore,
can assist in the elucidation of biomolecular conformations. It is based on the idea that compensatory
mutations at specific residue positions in a sequence help preserve stability of protein architecture and
function and leave a statistical signature related to residue-residue interactions in the 3D structure of the
protein. Consequently, statistical analysis of these correlated mutations in subsets of protein sequence
alignments can be used to predict which residue pairs should be in spatial proximity in the native functional
protein fold. These predicted signals can be then used to guide molecular dynamics (MD) simulations to
predict the three-dimensional coordinates of a functional amino acid chain. In this chapter, we introduce a
general and efficient methodology to perform coevolutionary analysis on protein sequences and to use this
information in combination with computational physical models to predict the native 3D conformation of
functional polypeptides. We present a step-by-step methodology that includes the description and applica-
tion of software tools and databases required to infer tertiary structures of a protein fold. The general
pipeline includes instructions on (1) how to obtain direct amino acid couplings from protein sequences
using direct coupling analysis (DCA), (2) how to incorporate such signals as interaction potentials in Cα
structure-based models (SBMs) to drive protein-folding MD simulations, (3) a procedure to estimate
secondary structure and how to include such estimates in the topology files required in the MD simulations,
and (4) how to build full atomic models based on the top Cα candidates selected in the pipeline. The
information presented in this chapter is self-contained and sufficient to allow a computational scientist to
predict structures of proteins using publicly available algorithms and databases.
Key words Coevolution, Structure-based model, Energy landscapes, Molecular dynamics, Pro-
tein Folding, Structure prediction
1 Introduction
The knowledge of the three-dimensional fold of proteins is funda-

mental for the goal of understanding biological function. There-
fore, a major goal of structural biology is the determination of the
tertiary protein structures, either by experimental or computational
83
84 Ricardo Nascimento dos Santos et al.
methods. Computational methods have been for years able to

confidently predict protein folds for structures for which experi-
mental models of similar proteins are available. More recently, the
development of the theory and software to elucidate contact infor-
mation from a collection of related amino acid sequences has
enhanced the prediction of protein folds to a more prominent
level [1, 2]. One idea behind coevolutionary coupled amino acid
sites is that during evolution proteins experience mutations which
are only partially deleterious for function or retain marginal stability
and, thus, are propagated through generations. However, compen-
satory mutations can appear in the population of individuals, restor-
ing or even enhancing the fitness of the protein, and the associated
pair of mutations might become the dominant genotype. Many of
these compensatory mutations have structural origins, that is, they
counteract the loss of favorable interactions between amino acid
side chains which are structurally in contact [3]. The emergence
and propagation of compensatory mutations between structurally
related residues can be studied through statistical methods to iden-
tify which are the pairs of residues that coevolve and provide clues
about physical residue contacts from collections of evolutionary-
related sequences [4].
The use of coevolutionary-derived contact information has
been successful in improving the quality of 3D-fold predictions to
a precision which was not attainable for proteins for which no
similar experimental structure was available [1, 5, 6]. The identifi-
cation of potential contacts between pairs of residues contributes to
the prediction of protein structure [7–16] and protein interactions
[8, 17–19] and even assists the study of protein folding [20, 21],
conformational changes [22, 23], and complex-forming mechan-
isms [24, 25].
Here we present a pratical guide on how to predict protein fold
structures using contact information obtained from coevolutionary
signals at amino acid positions in combination with coarse-grained
physical models of proteins. Initially, we describe how to infer
residues that are physically interacting from sequence alignments
along a protein family using direct coupling analysis (DCA)
[1]. DCA is a global probabilistic model used to calculate an
estimate of the amino acid pairs that are directly coupled in
sequence families. We provide the instructions and references for
the implementation of DCA, starting from data collection to the
final computation of amino acid couplings. Then we show how to
combine secondary structure prediction with residue contacts pre-
dicted by DCA to define a force field to simulate folding using
structure-based models (SBMs) [26, 27]. Finally, we describe how
to visualize the simulated model and evaluate the structural similar-
ity between the predicted and experimental coordinates. The pro-
cedures described here are expected to facilitate and generalize
Coevolution and SBMs for Protein Structure Prediction 85
research focused on the prediction of tertiary and quaternary

molecular structures of proteins.
The combination of coevolutionary information for amino
acids with structure-based models exhibits reproducible model
prediction with fold-level accuracy. The prediction performance
with raw input secondary structure for simulation suggests that
the method is versatile for any kind of secondary structure. This
methodology and the tools we provide are convenient and are
aimed to enable the scientific community to solve problems in
structural bioinformatics.
2 Materials
In this section we describe all computational tools and web-based

resources that will be necessary to generate inputs, run molecular
simulations for folding, as well as evaluate and visualize results.
2.1 UniProt Server UniProt is a comprehensive genomic sequence and analysis data-
base containing a large dataset of protein sequences, accompanied
by diverse biological annotations for biological function, domain
composition, subcellular location, and possible molecular interac-
tions [28]. This database is freely accessible at www.uniprot.org.
2.2 Pfam Server Pfam is a public database of protein families (groups of

evolutionary-related proteins) that features thousands of annotated
entries and is constantly being curated and updated [29, 30]. This
database provides multiple sequence alignments (MSAs) generated
using Hidden Markov Models, a statistical modeling paradigm
based on dynamic Bayesian networks. Entries for identified families
and respective MSAs can be accessed at http://pfam.xfam.org.
2.3 Direct Coupling Correlations in amino acid mutations can be identified by applying
Analysis (DCA) Server statistical inference in MSAs [31]. Several algorithms and refine-
ments to perform this analysis have been developed by our group
and others, with good performance [1, 4, 11]. Direct coupling
analysis is an efficient technique to compute coevolutionary signals
that is able to disentangle indirect and direct correlations that are
hard to distinguish by usual correlation analysis such as mutual
information [1, 5]. Identification of direct couplings is especially
important in structure predictions, since they can be interpreted as
regions that are physically interacting in the functional state of
macromolecules. A web server for estimation of direct correlations
in residue pairs using the DCA approach is available at http://dca.
rice.edu/ and http://morcoslab.org. Moreover, a standalone ver-
sion of DCA is also available in the same web page. A more detailed
description of DCA usage can be found in another publication [5].
2.4 HMMER Profiling HMMER is a software framework developed for the identification
of homologous sequences in biological databases and to perform
efficient multiple sequence alignments. Its methodology employs
probabilistic Hidden Markov Models for pattern recognition.
HMMER is comprised of several tools such as hmmscan for query
sequence against homologs databases and hmmbuild, to generate a
MSA. This software is freely accessible and can be downloaded at
http://hmmer.org/.
2.5 SBM Generation Structure-based models (SBMs) are simplified representations of

Script molecular interactions that allow to accelerate the conformational
search of biomolecular systems in molecular dynamic simulations
[25, 26, 32]. They are based on the idea of minimally frustrated
energy landscapes, which describe realistic protein-folding mechan-
isms [33]. Due to the SBM formulation, interaction data such as
coevolutionary couplings can be added to SBMs to drive confor-
mational search [6, 19]. We provide an efficient script to automati-
cally generate coordinates and SBM models based solely on the
primary sequence and predicted secondary structure of a protein.
This script incorporates coevolution data as energy potentials with
optimized parameters for protein-folding simulations. This proce-
dure provides an initial unfolded model and the topology and
force-field files necessary to run folding simulations. The scripts
can be downloaded at the following link, http://morcoslab.org,
under the tab Research -> Software Tools. Python scripts and param-
eter files to assist the process of generation and visualization of
protein models from coevolution data are also provided.
2.6 Gnuplot Gnuplot is an open source and portable software for graphical
visualization of data and mathematical functions. It is very intuitive
and native to most Linux distributions, in addition to running in all
major operating systems. To check if Gnuplot is already included in
your operating system, type gnuplot in an operating system’s termi-
nal. Details about download and usage can be found at http://
gnuplot.sourceforge.net/.
2.7 Jpred Server While recognition of tertiary structure is still challenging and meth-
odologies are under development, tools for estimation of local
secondary motifs are very mature and display good accuracy
[34, 35]. One of the most recent implementations of secondary
structure prediction methods, which was selected for this protocol,
is Jpred version 4 [36]. This approach uses multilayered neural
networks that are trained to identify secondary motif patterns
from primary sequence as input. Sequence queries can be submit-
ted to the Jpred server at http://www.compbio.dundee.ac.uk/
jpred/index.html. Alternatively, other prediction methodologies
can also be used, such as PSIPRED, Spider2, and PredictProtein
[37–39].
2.8 GROMACS GROningen MAchine for Chemical Simulations (GROMACS) is a

with Support molecular dynamic platform designed for computational simula-
to Gaussian Potentials tions of biomolecular systems, such as proteins, nucleic acids, lipids,
and small organic molecules [40, 41]. It is very fast and compatible
to many force-field representations and supports parallelization
through central and graphical processing units (CPUs and GPUs,
respectively). As an open-source package, versions with diverse
additional supports can be independently developed. In the proto-
col described herein, a GROMACS version with support for SBMs
containing Gaussian potentials for representation of distance con-
straints will be used. This modified version can be downloaded at
http://smog-server.org/extension/gromacs-4.5.4_sbm1.0.tar.gz.
Details about GROMACS installation procedure for a diverse set of
operating systems can be found at http://www.gromacs.org/Docu
mentation/Installation_Instructions_4.5. This protocol will
assume a single precision GROMACS installation.
2.9 Protein Protein Data Bank (PDB) is a structural database containing the 3D
Data Bank relative coordinates of thousands of biological macromolecules
obtained from experimental techniques such as X-ray and electron
crystallography, nuclear magnetic resonance (NMR), and cryo-
electron microscopy (cryo-EM). Most of the database comprises
protein structures, but also nucleic acid structures are found
[42, 43]. This source of structural genomic information is in con-
tinuous growth, with more than 130, 000 entries by August 2017.
It can be accessed at https://www.rcsb.org/pdb/home/home.do.
2.10 UCSF Chimera UCSF Chimera is a free-of-charge software for visualization and
analysis of molecular structures. It is very flexible on supporting a
wide range of file formats and provides many tools for data analysis,
such as sequence alignment, charge distribution and energy opti-
mizations, interpolation of structure conformations, and solvation.
Moreover, Chimera also provides a platform to generate high-
quality scientific images and animations. It can be downloaded at
https://www.cgl.ucsf.edu/chimera/download.html.
2.11 LovoAlign Structural similarity of two protein conformations can be computed

by rigid-body structural alignment. Here we use LovoAlign, a free
and fast package for structural alignment [44]. It supports several
well-established similarity measurements, such as the root-mean-
square deviation (RMSD), template modeling score (TM-score),
and global distance test (GDT). Directions for downloading the
latest version of LovoAlign and installation can be found at http://
www.ime.unicamp.br/~martinez/lovoalign.
2.12 REMO Server REconstruct atomic MOdel (REMO) is a program that generates
full atomic coordinates of proteins from Cα models. This recon-
struction process employs an algorithm to optimize the network of
hydrogen bonding in backbone and side chains [45]. A server and a

stand-alone version of REMO are available at https://zhanglab.
ccmb.med.umich.edu/REMO/. Moreover, alternative approaches
to reconstruct all-atom models can also be used in this practice
[46, 47].
3 Methods
We describe a step-by-step protocol to run folding simulations

driven by coevolutionary data. The protocol will be divided into
several main steps, which should be executed in succession (see
Note 1). We will assume the reader is working on a Unix-like
environment or operating system. Further details about software
compatibility and installation in specific versions of operating sys-
tems can be found in the list of links provided in Materials section.
3.1 Protein Sequence As a first example of protein folding prediction, we will work with
and Family the human transmembrane protein aquaporin-1. This macromole-
cule is part of a large family of transport proteins that controls the
flow of water through the cell [48–50]. The presence of water
channels allows cells to increase and decrease intracellular water
content at a faster rate than diffusion through membranes, and
functional defects are related to diseases [48]. Moreover, structural
characterization of transmembrane proteins such as aquaporin-1 is
particularly challenging for experimental studies, due to limitations
of available techniques. Therefore, computational methodologies
like the one presented here in are particularly useful. Also,
while many full-atom ab initio methods have been successful in
predicting the conformation of small proteins (<100 residues),
the study of larger systems such as aquaporin-1 is still intangible
even when using high-performance computation. Together, these
attributes justify aquaporin-1 as an excellent example for folding
studies that is compatible to the complexity of real-case research
problems.
In order to infer coevolutionary signals for a given system, we
need the primary amino acid sequence of the target. To perform
this task, we will access the UniProt server (Subheading 2.1) and
type human aquaporin 1 in the main query field at the top of the
page, with the option UniProtKB selected. A list of entries for
aquaporins organized by relevance will be provided. We should
select the top relevant entry that corresponds to our target of
study (entry P29972, AQP1_HUMAN). By clicking in the link of
this entry, another page will display all information already anno-
tated about this molecule (such as biological function and molecu-
lar interactions). In order to perform coevolutionary analysis, we
need the amino acid sequence and the Pfam family of this protein.
In section Sequences of the aquaporin-1 UniProt page, we can
download its amino acid sequence corresponding to isoform

1 using the Fasta button (next to entry code P29972-1, 269 resi-
dues, click with the right button, and go to Save link as. . .). Open
the downloaded .fasta file with any text processing program, and
check the entry code and length.
Next we should identify to which family aquaporin-1 belongs
and obtain a curated MSA containing many representative proteins
of this family. Still at the page resulting from our query, we can
identify in the section Family and Domains the aquaporin-1
families annotated for distinct databases. For the Pfam annotation,
we observe a unique hit for a family named MIP (major intrinsic
proteins), with Pfam accession code PF00230. By clicking in this
link, we are redirected to Pfam page for this MIP family (Subhead-
ing 2.2). An initial page provides a summary of functional details
about the members of this family. In the Alignments tag, we see all
MSAs available considering distinct proteome databases (see Note
2). We will use the full MSA already curated by Pfam with the
maximum number of sequences. In the section Format an align-
ment of this tab, select the options Alignment ¼ Full, for-
mat ¼ FASTA, order ¼ Tree, Sequence ¼ Inserts lower case, and
gaps ¼ mixed (Gaps as “.” or “-”), and download option. Save the
MSA file by clicking on Generate button. Figure 1 depicts the
settings used to generate this MSA file.
The MSA file generated by Pfam contains many position inserts
that are not important for the analysis of coevolution, but in some
cases, it can result in intractable file sizes. To optimize the size of
these working files keeping the same amount of information, a
python script named msa_clean.py is provided (Subheading 2.5).
Use this script with the following command in the terminal to
generate a new MSA with reduced file size.
python msa_clean.py PF00230_full.txt
Fig. 1 Parameters to download MSA for a specific family in Pfam server. The generated format is compatible
with DCA
This step will remove insertions from the original MSA file
PF00230_full.txt downloaded from Pfam and rewrite this informa-
tion in a file named PF00230_full_clean. Furthermore, in order to
identify the region and specific residues from the human aquaporin-
1 sequence that are part of MIP family, we also need to know the
profile used to generate the MSA (Pfam employs Hidden Markov
Models for multiple alignments). We can download the model
named MIP.hmm at the Curation & model tab using the download
link provided at the bottom.
3.2 Coevolutionary From the MSA of the MIP family, we can now perform a statistical
Information from analysis over many sequences using direct coupling analysis (DCA).
Direct Coupling This analysis will provide a quantitative measure of the direct cou-
Analysis plings of each possible pair of positions in the MSA, which can be
used to infer physical contacts in aquaporin-1. To perform this step,
go to the DCA server website (see Subheading 2.3), and use the full
MSA for MIP family obtained (PF00230_full_clean) as input. Use
the DCA button on the Workbench tab. Choose a job name in the
first blank field, and upload the MSA file in FASTA_IN option. Set
the relative_pseudo_count option to 1.0 and the homolog_radius
parameter to 0.8. Details about the influence of these parameters
to DCA performance are provided in the same submission page as
well as in [1]. Finally, run the coevolutionary analysis with the
button Start Job.
After finishing the calculations, the DCA server will provide a
preliminary heatmap with an overview of highly ranked correlation
distribution in MSA. Furthermore, a file containing the list of direct
correlations for each pair of position in the family can be down-
loaded from the link named DI_values.DI. Save this file to your
computer, and open it using any text editor of choice. Check that
three columns are provided for each line, corresponding to the pair
positions in MSA and to the direct correlation level named as direct
information (DI), respectively. Notice that lines in the DCA list are
sorted by pair numbering starting at 0, but we want to distinguish
only the pairs with maximum DI values (third column). Moreover,
since adjacent positions in MSA usually correspond to neighboring
residues in the backbone, then we expect an intrinsic and trivial
high correlation among those residues. In order to sort this list
based on DI values and to filter neighboring positions with local
correlations, we will open a terminal in the folder containing the
downloaded files and execute the following command to generate a
new list or pairs (see Note 3):
awk ’{if($2-$1>4)print $1+1,$2+1,$3}’ DI_values.DI | sort -g -k 3 -r > PF00230_full_

ranked.DI
Open the file just created (PF00230_full_ranked.DI), and

check if this new list presents the desired properties (descending
DI values for nonadjacent pairs). It should be noticed that

aquaporin-1 is only a representative member of MIP family. Now
that we have the top coevolving pairs in MIP family, we need to
identify the matching pairs of MSA in the respective residue pairs of
our target aquaporin-1. In other words, we need to map the
position of MIP family in human aquaporin-1 sequence. For this
step, we will use HMMER software (Subheading 2.4). Still in the
same terminal in the folder where the files are located, use the
hmmpress tool from HMMER to compile the MIP HMM profile:
hmmpress MIP.hmm
Next, use the hmmscan option to search for MIP domain

position in aquaporin-1 sequence:
hmmscan -o P29972_MIP_scan --notextw MIP.hmm P29972.fasta
Check the file P29972_MIP_scan using a text editor to verify

the correspondence between MSA positions and the residues in the
protein sequence. Now, we should use the generated mapping to
rewrite top correlated MSA pairs as residue pairs in aquaporin-1
sequence. In order to complete this task, copy to the current folder
the python script map_dca.py provided in the link of Subheading
2.5, and run the following command:
python map_dca.py P29972_MIP_scan PF00230_full_ranked.DI
This script will generate two files: (i) a new list of top coevolu-
tionary couplings corresponding to residue pairs in the target pro-
tein sequence and (ii) a reference table for each matching position
in the MSA and the protein sequence (see Note 4). You can check
the creation of these files by typing the command “ls” in the
terminal. Open the first file generated (named as P29972_MIP_s-
can_ranked_matched.DI), and compare with the previous MSA list
obtained from DCA (PF00230_full_ranked.DI). Next, check the
generated reference file for mapping (P29972_MIP_scan_reference.
txt), and observe that some insertions can occur in the map list
(represented by “-”) that are not in the original MSA of assigned
family and vice versa. These instances demonstrate why a detailed
pair matching is necessary instead of only knowing the
absolute beginning and end of the location of a family in a protein
(see Note 5).
Moreover, to get an idea about the level of interaction diversity
in our predicted coevolutionary data, we can visualize the
top-ranked couplings using residue-residue plots. To do this, we
should open a Gnuplot terminal (Subheading 2.6) by typing gnu-
plot in the working shell terminal and plot the first 200 obtained
Fig. 2 Representation of top 200 DCA contacts for human aquaporin-1 (UniProt
code: P29972). The lower right distribution corresponds to the native contacts
from the reference structure PDB: 4CSK
residue couplings with top DI values using the following command

in the new terminal:
plot ’P29972_MIP_scan_ranked_matched.DI’ every::::200 pt 7 notitle
This step will generate a residue-residue map pattern similar

(but probably not identical, since Pfam MSAs are constantly being
updated) to the one shown in Fig. 2. If we have abundant informa-
tion in the MSA, this pattern of highly coevolving pairs should
ideally recover the physical contact map of the target protein at its
functional state [1, 5, 51]. As known in statistics, the accuracy of
inference is directly proportional to the level of nonredundant
information available [52, 53]. Therefore, the larger the number
of distinct but still homologous sequence representatives that are
available in our MSA, the more accurate the DCA predictions tend
to be. In order to evaluate the variation of coevolution signal
distribution, try to plot different numbers of top-ranked pairs
replacing the 200 in the last command executed in Gnuplot, and
see how it changes in comparison to the top DCA pairs shown in
Fig. 2. The lower triangular part of Fig. 2 shows the native contacts
from the reference structure PDB: 4CSK.
DCA uses a statistical estimation method to compare the

strength of direct couplings among positions in MSA. Since this
estimation depends on intrinsic features of the MSA (length, evo-
lutionary pressure, sequence variation, and number of instances),
the estimated DI value of a residue pair is relative to its counter-
parts, rather than an absolute value. As a consequence, the accuracy
of DCA predictions (rate of true positive interactions) follows a
continuum descending profile from the top predicted coupling that
allows internal differentiation. Furthermore, interaction couplings
from distinct phenomena can contribute and be mixed in DCA
signals, such as signals from folding, oligomerization, and allosteric
movements. Therefore, to estimate the ideal set of coevolutionary
information is not trivial. An initial guess suggested here, based on
practical folding examples, is to select the top-ranked pairs propor-
tional to the length of the MSA of the protein family (i.e., the
number of DCA pairs equals the MSA length). If we check the
file generated by hmmscan for MIP family search in aquaporin-1
(P29972_MIP_scan), we can observe a match of 226 positions of
this family to aquaporin-1 sequence (from position 2 until 227 in
MSA of MIP family). Therefore, we will select the top 226 DCA
pairs for folding prediction. This can be done using the following
command in the working terminal:
head -n 226 P29972_MIP_scan_ranked_matched.DI > DCA_top226
where the file DCA_top226 has the top 226 coupled residue pairs
separated by more than 4 residues in the amino acid chain.
3.3 Secondary After quantifying coevolutionary couplings for our protein of study
Structure Prediction aquaporin-1, we can now use this information for folding predic-
tion. But first, we need to describe how neighbor residues are
locally organized and packed. Clues about local information (sec-
ondary structure) can also be inferred from coevolution analysis;
however, several mature and more accurate methodologies are
available. These approaches use statistical, knowledge-based, or
machine-learning techniques (or combinations of them) to
predict packing patterns. One robust tool for secondary structure
prediction that was selected for this methodology is Jpred
(Subheading 2.7).
In order to predict the secondary structure of aquaporin-1, go
to the Jpred server website (Subheading 2.7), delete the sample
sequence in the query field, paste the amino acid sequence of the
protein obtained in Subheading 3.1 and saved in a .fasta file (copy
the text inside P29972.fasta and paste in query field). Run the
analysis with the button Make prediction. In the following page
returned after job submission, a message should appear showing
validated structures matching this query sequence. Since we are
performing this practice as a validation example for the method,
ignore this page by selecting the continue button. Notice this

option is not true in cases where the architecture of a protein of
study is unknown.
Finally, after the job is complete, a new page displaying an
overview of the predicted secondary structure will be provided.
Select the option View simple results in HTML, copy the entire
two lines corresponding to the sequence and respective secondary
structure, and save this information in a new text file as
P29972_Jpred.txt.
3.4 Structure-Based Once we have obtained sources of data for secondary structure and
Models from coevolutionary couplings as proxy for secondary- and tertiary-fold
Coevolution levels, respectively, we can merge these data and use it as input for
structure prediction. First, in order to run folding with molecular
dynamic simulations using structure-based models, we need to
generate: (i) an initial unfolded model for aquaporin-1 and (ii) a
topology file containing all details about the physical properties of
the system, such as the mass of atoms, covalent bonds, interaction
potentials between specific pairs obtained from DCA, and energies
involved in conformational movements (variations in bond lengths,
angles, and dihedrals). Further general information about physical
models and the approach of molecular dynamic simulations can be
found elsewhere [26, 54–56]. To generate these files, we should
use a python script provided at the link described in Subheading 2.5
(file named as dcasbm.tar.gz). Extract the file provided in a folder
inside the working directory using the following command:
tar -xzvf DCASBM.tar.gz
Now, run the script to generate the protein model and topol-
ogy files including DCA signals using the following command:
python dcasbm/dcasbm.py P29972_Jpred.txt DCA_top226
A value for maximum force factor of DCA interactions will be

asked. This value will control the force of pairwise potentials that
will drive folding and should be set based on the number of
coevolutionary restrictions. For a small dataset of restrictions,
stronger force factors should be chosen in order to be powerful
enough to ensure high conformational convergence and vice versa.
A good estimate for the strength of coevolutionary interactions
(suggestion based on empirical data) is to consider the force factor
equivalent to the ratio of the family length (L) over the number of
DCA pairs (factor ¼ L/|DCA|). In our case of aquaporin-1, both
values are already the same (226), and therefore the force factor
should be set to 1. After setting this parameter, running the script
will generate two new files (P29972_Jpred_calpha.gro and
P29972_Jpred_calpha.top). Open these new files with any text
editor, and check their organization. The first one (P29972_Jpred_-

calpha.gro) corresponds to the coordinates of the unfolded model
of aquaporin-1 that will be used as an initial conformation for
folding simulation. The second file (P29972_Jpred_calpha.top)
comprises the physical parameters and DCA interactions that will
be used to drive conformational search.
3.5 Folding With the initial coordinates and topological description files for
Simulations aquaporin-1 at hand, we can now run folding simulations. This
procedure will drive the association of protein residue pairs identi-
fied as highly coevolving by the application of a combination of
repulsive and attractive energy potentials (SBMs) in molecular
simulations [19, 25, 26, 32]. In this process, the system tempera-
ture is gradually reduced, in an approach known as simulated
annealing, until the total conformational energy of the system
achieves a minimum where most predicted interactions are satisfied
yielding to a native-like folded conformation [6, 25, 33].
A GROMACS version with support for Gaussian potentials
(see Subheading 2.8) is required. We should use the files created
in the last section as input for the simulation using the generated
topology and coordinate files for aquaporin-1 and the file with the
simulation parameters downloaded as part of the package provided
in Subheading 2.5 (file named sbm_calpha_SA.mdp) (see Note 6).
This procedure can be done using the following script:
grompp -f sbm_calpha_SA.mdp -c P29972_Jpred_calpha.gro -p

P29972_Jpred_calpha.top -o run.tpr
Completion of this step will generate a binary compiled file

(run.tpr) that can be used for simulation in GROMACS. Next, a
folding simulation can be carried out in GROMACS using the
following command:
mdrun -deffnm run -pd
This simulation procedure will take near a couple of hours to be

completed on a Linux-based PC desktop (tested in a AMD
FX-8370E cpu, running Ubuntu version 16.04). Furthermore,
when multi-core CPUs are available, computational time to finish
the process can be reduced using parallelization. Having knowledge
about the architecture of processors available, simply add the tag
“-nt x” in the end of last command, where x is the number of
available computer threads. Finally, after finishing the folding simu-
lation, a few new files will be generated. Among those we can find the
conformation with the final folded protein coordinates (run.gro).
3.6 Analysis After developing a folding simulation for aquaporin-1, we can now
of Predicted Models check the predicted conformation and compare with respective
experimental models available in the literature. We can look for
experimental models for aquaporin-1 in the Protein Data Bank

(PDB, Subheading 2.9). At the PDB web page, type the code
4CSK into the search field. You should be redirected to an entry
for a human aquaporin-1 X-ray structure model. At the initial
Summary tab for this entry, many details such as quality measure-
ments of experimental data and related literature can be observed
[57]. In order to verify that this protein corresponds exactly to the
same aquaporin-1 we have been working with, go to the tab
Sequence, and look for the annotated UniProt reference code (Uni-
ProtKB P29972) that corresponds exactly to the UniProt code of
the sequence for our system. Having this information verified, we
can now download the coordinates of the X-ray model for human
aquaporin-1 by selecting the button Download Files located on the
right of the page and the option PDB file. Save the generated file in
pdb format (4csk.pdb) to the current working directory of the shell
terminal. For an initial visual analysis, open this experimental model
together with the final folding conformation for aquaporin-1 using
UCSF Chimera package (Subheading 2.10) with the following
command:
chimera run.gro 4csk.pdb
Alternatively, this step can be completed by opening Chimera

and selecting the files for both models with the option File > Open
inside the Chimera environment. Set a ribbon representation for
both models selecting the following option in Chimera panel:
Presets > Publication 4 (depth-cued, licorice). Next, we can align
the orientation of both structures using the tool MatchMaker from
Chimera. To do that, go to Tools > Structure Comparison > Match-
Maker, and select the experimental and predicted models as refer-
ence and structure to match, respectively, on the new window that
should be displayed (see Fig. 3). Keep the default options of all
secondary parameters and align both structures by clicking on the
OK button.
As a last step in our study, we will evaluate the structural
similarity of predicted and experimental aquaporin-1 models
using quantitative parameters. The selection of a robust method
for structural comparison is crucial to evaluate the performance and
identify the correct predictions from molecular modeling methods.
For example, although the root-mean square deviation (RMSD) is
a well-established measure to estimate relative coordinate differ-
ences in molecules, it is not the best similarity measure to compare
protein folds. Since RMSD computes the absolute differences on
atom coordinates between models, this approach fails dramatically
to identify equivalent folding patterns when the system displays
some intrinsic flexibility such as those observed in hinges and
shearing mechanisms [58]; in other words, RMSD is too sensitive
to discrepancies in subsets of the structures. A more appropriate
Fig. 3 Visual comparison of predicted and experimental conformations of human aquaporin-1 using UCSF
chimera
measure of the similarity between conformations of protein models

is the template modeling score (TM-score) [59]. This parameter
indicates the similarity between two protein models by a value
ranging between 0 and 1. As a general rule, values below 0.2
represent uncorrelated structures, which could be obtained by a
random search. TM-score values greater than 0.5 indicate that the
models share the same overall fold, while a value of 1.0 is achieved
when both models converge to the very same 3D structure [59].
In this chapter, we will use the software LovoAlign (Subhead-
ing 2.11) to calculate TM-score values [44]. Since LovoAlign
requires input models to be represented in the standard .pdb for-
mat, we first need to transform our predicted model generated by
GROMACS from .gro to .pdb format. This step can be completed
using a GROMACS tool called editconf, with the command:
editconf -f run.gro -o run.pdb
Now, having LovoAlign installed, one can compute the

TM-score between the full-length predicted and experimental
models of aquaporin-1 (run.pdb and 4csk.pdb, respectively) by
simply using the following command on the working terminal:
lovoalign -p1 4csk.pdb -p2 run.pdb -seqnum
The -p1 and -p2 options define the two structures to be aligned,
and the -seqnum option assures that the sequence alignment will
match exactly to the corresponding residue numbering in the PDB
files, which is the case here since the two proteins share the same
sequence except for possible deletions. After running this com-

mand, a log message will be displayed on terminal, providing the
TM-score between models (in the field named FINAL SCORE:)
and specific details about the calculation (sequence matching, num-
ber of residues considered, and alignment cycles). Moreover, this
log data for alignment can be saved by adding the extra “> name.
log” identifier at the end of the command (name can be chosen as
desired). If the level of coevolutionary information obtained is
enough (which is true in our case) and the folding simulation
converged to a configuration where DCA geometrical restrictions
are maximized, the predicted and experimental models should be
very similar and result in a TM-score greater than 0.5 (see Note 7).
In addition, we can restrict the comparison of folding confor-
mations only to the protein regions that are covered by the family
and have assigned coevolutionary restraints. To calculate the
TM-score of a specific region of protein sequence, first we need to
provide to LovoAlign the fragment that should be analyzed in each
model. In order to generate the aquaporin-1 models containing
only the region corresponding to MIP family, use the python script
getregion.py provided in Subheading 2.5 and the information
obtained from hmmscan analysis (Subheading 3.2) by running
the following commands:
python getregion.py P29972_MIP_scan run.pdb
python getregion.py P29972_MIP_scan 4csk.pdb
This procedure will generate two new pruned models contain-

ing only the family region of the protein. Finally, we can compute
the TM-score along the conserved family region of aquaporin-1
with the following command:
lovoalign -p1 4csk_pruned.pdb -p2 run_pruned.pdb -seqnum
Since we are restricting the analysis only to the sequence region

where coevolutionary information exists, the final TM-score should
be higher than the one observed when considering the full model.
Divergences in TM-scores when comparing full models and their
corresponding fragments restricted to family region should be
consistent to the relative size of fragments outside family coverage.
In this case of aquaporin-1, since the domain encompass almost the
full sequence of aquaporin-1, the best TM-score obtained in our
tests considering only the domain is very close to the one for the full
model (TM-scores of 0.70 and 0.68, respectively).
3.7 Rebuilding In this step, we will generate an all-atom protein structure from our
All-Atom Protein predicted Cα folded model. To perform this step, we will make use
Structures of the REMO server [45]. Go to the web page of REMO
Fig. 4 All-atom model for aquaporin-1 generated by REMO using the coarse-
grained predicted fold model. The full protein TM-score between predicted
(purple) and experimental (white) models is 0.68, considering Cα carbons
(Subheading 2.12), and upload the final folded model obtained for
aquaporin-1 converted to pdb format (run.pdb) using the
Browse. . . button. Fill the e-mail form, and submit the process
using the button run REMO in the bottom of the page. After a
couple of minutes, you should receive the link with REMO results
in the e-mail that you provided. Download the generated model
and visualize it using Chimera (Fig. 4).
3.8 Additional For additional test cases, we suggest the reader to try the same
Examples protocol for folding prediction using other interesting biological
systems. Some suggestions are provided:
1. The human small G protein RAP2A. UniProt code P10114 and
PDB code 1KAO.
2. The bacterial ABL transporter, a larger transmembrane protein.
UniProt code P06609 and PDB code 1L7V.
3. The receiver domain of DesR from Bacillus subtilis, a transcrip-
tional regulatory protein. UniProt code O34723 and PDB
code 4LE1.
In this chapter we discussed a convenient methodology to
predict 3D coordinates of folded protein structures based on
coevolutionary information and molecular dynamics. We provide
resources and software tools that are free to access and instructions
on how to use them to elucidate structures with high TM-scores.
The information contained in this chapter is general and can be
used to study and infer structures of many proteins for which no
structural information is available. With the advent of sequencing
technologies, we expect that the applicability of these techniques

will become more relevant and will have a positive impact in the
quest to accurately characterize and uncover conformations and
functional properties of biomolecules.
4 Notes
1. A folder containing all input files necessary to run the protocols

described here can be found at the link provided in Subheading
2.5 under the name example_aqp1.tar.gz.
2. The described protocol is limited to multiple sequence align-
ments (MSAs) with available Pfam annotations. Nevertheless,
sequences with no assigned Pfam domains can also be used to
generate customized MSAs to infer coevolution MSAs. In this
case, MSAs can be built using the Jackhmmer search tool inside
the HMMER server.
3. If direct coupling analysis is performed using the local Matlab
implementation also available in the DCA server, please keep in
mind that the output DI file has indexing starting at 1 instead of
0 as in the web server. The Matlab implementation also gives
mutual information (MI) values as output; therefore the DI
values are shown in the 4th column as opposed to the third as
in the DCA server.
4. The python script provided for sequence mapping will only
consider the first family found by hmmscan in the target protein
sequence. For systems with multiple families, mapping for con-
secutive families can be done manually or using the same proce-
dure described in Subheading 3.2 by deleting the data of
previously found families in the output of hmmscan analysis.
5. Mapping an MSA to protein sequence is not only a process of
getting the beginning and the end position of domain in pro-
tein. Since insertions in domains and proteins can occur, the
correspondence of each MSA position in a sequence residue
should be considered.
6. When using GROMACS in a double-precision installation ver-
sion, the same protocol for running folding simulations (Sub-
heading 3.5) can be done by replacing the tag grompp by
grompp_d in commands.
7. When comparing a predicted protein structure with its respec-
tive experimental model, make sure that the amino acid
sequences used for prediction and the experimental structure
use the same residue numbering to avoid indexing errors.
Acknowledgments
The authors thank financial support from the São Paulo Research
Foundation (FAPESP) (Grants 2015/13667-9, 2010/16947-9,
2013/05475-7, and 2013/08293-7) and funding from the Uni-
versity of Texas at Dallas.
References
1. Morcos F, Pagnani A, Lunt B et al (2011) 12. Hayat S, Sander C, Marks DS, Elofsson A
Direct-coupling analysis of residue coevolution (2015) All-atom 3D structure prediction of
captures native contacts across many protein transmembrane β-barrel proteins from
families. Proc Natl Acad Sci U S A 108: sequences. Proc Natl Acad Sci U S A
E1293–E1301 112:5413–5418
2. Hamilton N, Burrage K, Ragan MA, Huber T 13. Marks DS, Hopf TA, Sander C (2012) Protein
(2004) Protein contact prediction using pat- structure prediction from sequence variation.
terns of correlation. Proteins 56:679–684 Nat Biotechnol 30:1072–1080
3. Ivankov DN, Finkelstein AV, Kondrashov FA 14. Jones DT, Singh T, Kosciolek T, Tetchner S
(2014) A structural perspective of compensa- (2015) MetaPSICOV: combining coevolution
tory evolution. Curr Opin Struct Biol methods for accurate prediction of contacts
26:104–112 and long range hydrogen bonding in proteins.
4. de Juan D, Pazos F, Valencia A (2013) Bioinformatics 31:999–1006
Emerging methods in protein co-evolution. 15. Sadowski MI, Taylor WR (2013) Prediction of
Nat Rev Genet 14:249–261 protein contacts from correlated sequence sub-
5. Morcos F, Hwa T, Onuchic JN, Weigt M stitutions. Sci Prog 96:33–42
(2014) Direct coupling analysis for protein 16. Hopf TA, Morinaga S, Ihara S et al (2015)
contact prediction. Methods Mol Biol Amino acid coevolution reveals three-
1137:55–70 dimensional structure and functional domains
6. Sulkowska JI, Morcos F, Weigt M et al (2012) of insect odorant receptors. Nat Commun
Genomics-aided structure prediction. Proc 6:6077
Natl Acad Sci 109:10340–10345 17. Schug A, Weigt M, Onuchic JN et al (2009)
7. Hopf TA, Colwell LJ, Sheridan R et al (2012) High-resolution protein complexes from inte-
Three-dimensional structures of membrane grating genomic information with molecular
proteins from genomic sequencing. Cell simulation. Proc Natl Acad Sci U S A
149:1607–1621 106:22124–22129
8. Ovchinnikov S, Kamisetty H, Baker D (2014) 18. Tamir S, Rotem-Bamberger S, Katz C et al
Robust and accurate prediction of residue- (2014) Integrated strategy reveals the protein
residue interactions across protein interfaces interface between cancer targets Bcl-2 and
using evolutionary information. Elife 3: NAF-1. Proc Natl Acad Sci U S A
e02030 111:5177–5182
9. Kamisetty H, Ovchinnikov S, Baker D (2013) 19. dos Santos RN, Morcos F, Jana B et al (2015)
Assessing the utility of coevolution-based resi- Dimeric interactions and complex formation
due-residue contact predictions in a sequence- using direct coevolutionary couplings. Sci Rep
and structure-rich era. Proc Natl Acad Sci U S 5:13652
A 110:15674–15679 20. Morcos F, Schafer NP, Cheng RR et al (2014)
10. Skwark MJ, Abdel-Rehim A, Elofsson A Coevolutionary information, protein folding
(2013) PconsC: combination of direct infor- landscapes, and the thermodynamics of natural
mation methods and alignments improves con- selection. Proc Natl Acad Sci U S A
tact prediction. Bioinformatics 29:1815–1816 111:12408–12413
11. Ekeberg M, Lövkvist C, Lan Y et al (2013) 21. Mallik S, Kundu S (2015) Co-evolutionary
Improved contact prediction in proteins: constraints of globular proteins correlate with
using pseudolikelihoods to infer Potts models. their folding rates. FEBS Lett 589:2179–2185
Phys Rev E Stat Nonlinear Soft Matter Phys 22. Morcos F, Jana B, Hwa T, Onuchic JN (2013)
87:012707 Coevolutionary signals across protein lineages
help capture multiple protein conformations. prediction server. Nucleic Acids Res 43:
Proc Natl Acad Sci U S A 110:20533–20538 W389–W394
23. Sfriso P, Duran-Frigola M, Mosca R et al 37. Yachdav G, Kloppmann E, Kajan L et al (2014)
(2016) Residues coevolution guides the sys- PredictProtein—an open resource for online
tematic identification of alternative functional prediction of protein structural and functional
conformations in proteins. Structure features. Nucleic Acids Res 42:W337–W343
24:116–126 38. Buchan DWA, Minneci F, Nugent TCO et al
24. Cheng RR, Morcos F, Levine H, Onuchic JN (2013) Scalable web services for the PSIPRED
(2014) Toward rationally redesigning bacterial Protein Analysis Workbench. Nucleic Acids Res
two-component signaling systems using coevo- 41:W349–W357
lutionary information. Proc Natl Acad Sci U S 39. Heffernan R, Paliwal K, Lyons J et al (2015)
A 111:E563–E571 Improving prediction of secondary structure,
25. Jana B, Morcos F, Onuchic JN (2014) From local backbone angles, and solvent accessible
structure to function: the convergence of struc- surface area of proteins by iterative deep
ture based models and co-evolutionary infor- learning. Sci Rep 5:11476
mation. Phys Chem Chem Phys 40. Pronk S, Páll S, Schulz R et al (2013) GRO-
16:6496–6507 MACS 4.5: a high-throughput and highly par-
26. Noel JK, Levi M, Raghunathan M et al (2016) allel open source molecular simulation toolkit.
SMOG 2: a versatile software package for gen- Bioinformatics 29:845–854
erating structure-based models. PLoS Comput 41. Kutzner C, Páll S, Fechner M et al (2015) Best
Biol 12:e1004794 bang for your buck: GPU nodes for GRO-
27. Noel JK, Whitford PC, Sanbonmatsu KY, MACS biomolecular simulations. J Comput
Onuchic JN (2010) SMOG@ctbp: simplified Chem 36:1990–2008
deployment of structure-based models in 42. Meyer EE (1997) The first years of the Protein
GROMACS. Nucleic Acids Res 38: Data Bank. Protein Sci 6:1591–1597
W657–W661 43. Young J, RCSB PDBj PDBe Protein Data Bank
28. UniProt Consortium (2015) UniProt: a hub (2009) Annotation and curation of the Protein
for protein information. Nucleic Acids Res 43: Data Bank. Nat Preced. https://doi.org/10.
D204–D212 1038/npre.2009.3379.1
29. Bateman A (2000) The Pfam protein families 44. Martı́nez L, Andreani R, Martı́nez JM (2007)
database. Nucleic Acids Res 28:263–266 Convergent algorithms for protein structural
30. Finn RD, Coggill P, Eberhardt RY et al (2016) alignment. BMC Bioinformatics 8:306
The Pfam protein families database: towards a 45. Li Y, Zhang Y (2009) REMO: a new protocol
more sustainable future. Nucleic Acids Res 44: to refine full atomic protein models from
D279–D285 C-alpha traces by optimizing hydrogen-
31. Göbel U, Sander C, Schneider R, Valencia A bonding networks. Proteins 76:665–676
(1994) Correlated mutations and residue con- 46. Maupetit J, Gautier R, Tufféry P (2006) SAB-
tacts in proteins. Proteins Struct Funct Genet BAC: online Structural Alphabet-based protein
18:309–317 BackBone reconstruction from Alpha-Carbon
32. Lammert H, Schug A, Onuchic JN (2009) trace. Nucleic Acids Res 34:W147–W151
Robustness and generalization of structure- 47. Rotkiewicz P, Skolnick J (2008) Fast procedure
based models for protein folding and function. for reconstruction of full-atom protein models
Proteins 77:881–891 from reduced representations. J Comput Chem
33. Onuchic JN, Luthey-Schulten Z, Wolynes PG 29:1460–1465
(1997) Theory of protein folding: the energy 48. Agre P (2006) The aquaporin water channels.
landscape perspective. Annu Rev Phys Chem Proc Am Thorac Soc 3:5–13
48:545–600 49. Ishibashi K, Sasaki S (1997) Aquaporin water
34. Pirovano W, Heringa J (2010) Protein second- channels in mammals. Clin Exp Nephrol
ary structure prediction. Methods Mol Biol 1:247–253
609:327–348 50. Agre P, Kozono D (2003) Aquaporin water
35. Yang Y, Gao J, Wang J et al (2018) Sixty-five channels: molecular mechanisms for human
years of the long march in protein secondary diseases1. FEBS Lett 555:72–78
structure prediction: the final stretch? Brief 51. Marks DS, Colwell LJ, Sheridan R et al (2011)
Bioinform 19:482–494. https://doi.org/10. Protein 3D structure computed from evolu-
1093/bib/bbw129 tionary sequence variation. PLoS One 6:
36. Drozdetskiy A, Cole C, Procter J, Barton GJ e28766
(2015) JPred4: a protein secondary structure
52. Ash RB (2012) Information theory. Courier molecular simulation techniques. Annu Rev
Corporation, Dover Publications Inc, Mineola, Phys Chem 58:57–83
NY 57. Ruiz Carrillo D, To Yiu Ying J, Darwis D et al
53. Freedman D, Pisani R, Purves R (2007) Statis- (2014) Crystallization and preliminary crystal-
tics: fourth international student edition. lographic analysis of human aquaporin 1 at a
W. W. Norton & Company, New York, NY resolution of 3.28 Å. Acta Crystallogr F Struct
54. Rapaport DC (2004) The art of molecular Biol Commun 70:1657–1663
dynamics simulation. Cambridge University 58. Subbiah S (1996) Protein motions. Springer,
Press, New York, NY Berlin
55. Karplus M, Kuriyan J (2005) Molecular 59. Zhang Y, Skolnick J (2004) Scoring function
dynamics and protein function. Proc Natl for automated assessment of protein structure
Acad Sci U S A 102:6679–6685 template quality. Proteins 57:702–710
56. Scheraga HA, Khalili M, Liwo A (2007)
Protein-folding dynamics: overview of
Chapter 6
Detecting Amino Acid Coevolution with Bayesian Graphical

Models
Mariano Avino and Art F. Y. Poon
Abstract
The comparative study of homologous proteins can provide abundant information about the functional and
structural constraints on protein evolution. For example, an amino acid substitution that is deleterious may
become permissive in the presence of another substitution at a second site of the protein. A popular
approach for detecting coevolving residues is by looking for correlated substitution events on branches of
the molecular phylogeny relating the protein-coding sequences. Here we describe a machine learning
method (Bayesian graphical models) implemented in the open-source phylogenetic software package
HyPhy, http://hyphy.org, for extracting a network of coevolving residues from a sequence alignment.
Key words amino acid coevolution, Bayesian graphical model, hepatitis C virus, HyPhy, epistasis
1 Introduction
Genomes encode an enormous number of components that need

to interact in order to function properly. At a low level, for instance,
nucleotides within a codon triplet define each other’s context:
whether a substitution from A to G at the second codon position
alters the encoded amino acid depends on what nucleotides occupy
the other positions, given the organism’s repertoire of transfer
RNAs that determine the genetic code. Different amino acids in
the same protein may interact through direct contacts or long-
range interactions through energetic, structural, or allosteric
mechanisms [1]. Interactions between different proteins can also
be modulated by amino acid sequence variation [2]. Thus, we
cannot understand the biological significance of a genetic polymor-
phism without accounting for at least some of its genomic context.
Conversely, identifying interactions between amino acids may con-
fer information about the higher-order structure and function of a
protein and its relation to other components of the genome.
105
106 Mariano Avino and Art F. Y. Poon
There is an enormous literature on detecting interactions

between residues within a protein from genetic sequence variation.
Although there have been many efforts to review this literature, it is
difficult to produce a comprehensive overview because similar
methods have been developed repeatedly and independently in
diverse disciplines such as biochemistry, computer science, and
evolutionary biology [3–6]. A common motivation for the com-
parative study of amino acid sequences is that covariation among
residues may reveal information about the structure of a folded
protein [7]. In the medical sciences, sequence covariation provides
information about the structural and functional constraints of pro-
teins of viral or bacterial pathogens, and can thereby inform the
development of vaccines [8]. We tend to attribute missense point
mutations to disease-associated phenotypes, and in many cases
there is strong evidence that such mutations are independently
sufficient to cause disease [9]. However, the effect of these muta-
tions can also be modulated through their interactions with muta-
tions at other specific sites in the genome [10]. Finally,
characterizing the interactions among amino acids can quantify
the type and prevalence of epistasis and how it shapes the evolu-
tionary trajectories of populations adapting to new environments
[11, 12].
1.1 Correlated The simplest approach to detect interactions between amino acids
Substitutions from the comparative study of protein sequences is to look for
correlations between different positions in the protein with respect
to the biochemical properties of residues [13], empirical substitu-
tion rates [7, 14] or the occurrence of specific amino acids
[15]. Correlations are often measured using mutual information
[16–18] or extensions of this approach that incorporate other
information [19, 20]. One of the major confounding factors affect-
ing the comparative study of protein sequences is that the amino
acids in different sequences are not independent observations;
instead, they are copies that descend from a common ancestor
such that shared genotypes may be due to identity by descent. In
the worst case scenario, a comparative method may predict a false
interaction between residues at different positions of a protein
because of two ancestral substitution events that have been propa-
gated to the observed descendants with no further evolution
[21]. This confounding due to common ancestry has been inde-
pendently recognized across fields and a large number of techni-
ques have been proposed to resolve it, e.g., [22–25]. One common
approach is to change the focus from the observed characteristics of
residues at different sites to the amino acid substitutions that have
accumulated in the evolutionary history of the protein sequences
[22]. Hence, we are looking for substitutions at different sites to
occur on the same branch of a phylogenetic tree, which implies
causality by proximity in time. This transformation of the data can
Detecting Amino Acid Coevolution with Bayesian Graphical Models 107
substantially reduce the number of false positives caused by identity

by descent [26]. However, it can also result in a loss of statistical
power to detect true associations because the number of inferred
substitutions tends to be much smaller than the number of differ-
ences [27], and this approach may also be biased by variation in
rates of evolution [28]. It is likely that none of these methods will
ever attain a high level of prediction accuracy [29]. Even so, the
comparative study of protein variation is a cost-effective means to
identify putative interactions for further empirical study.
1.2 Bayesian The challenge of detecting interactions among residues in a protein

Graphical Models is highly susceptible to the “curse of dimensionality”: the number
of possible interaction networks grows far more rapidly than the
amount of data. Many of the previously described methods use false
discovery rate methods to account for multiple comparisons. One
of the problems of this approach is that each interaction is evaluated
from a narrow subset (i.e., pairs) of variables while excluding the
rest of the data. Consequently, the end result for this type of
analysis is a list of potential pairwise interactions with no framework
for assembling them into a coherent whole. Instead, we have previ-
ously proposed [27] to examine the joint distribution of all poten-
tial interactions, borrowing a class of methods from the field of
artificial intelligence known as Bayesian graphical models (BGMs)
or Bayesian networks [30].
Each node in a BGM represents a variable, such as an amino
acid position in a protein alignment. A directed edge (arrow) is
drawn from node A to another node B to indicate that B is condi-
tionally dependent on A. Thus, a BGM concisely and visually
expresses the structure of a joint probability distribution for a set
of variables as a graph. For example, P(A, B, C) ¼ P(B |A, C) P(A)
P(C) is represented by the graph A ! B C. The challenge is to
learn the structure of this distribution from the data. The number
of structures increases faster than exponentially with the number of
variables. Hence, we employed a Markov chain Monte Carlo sam-
pling algorithm proposed by Daphne Koller and Nir Friedman [31]
that collapses the enormous space of all possible graph structures
into a more manageable space. This dimensionality reduction
results in a severe loss of granularity—instead of sampling over
individual graph structures, we are sampling permutations that
define hierarchies of nodes (whether A can be a parent of B, or
vice versa). Even so, the posterior probability of a specific edge can
be calculated from the number of structures consistent with a given
node hierarchy that contain this edge.
2 Program Usage
HyPhy is an open-source software package for phylogenetic

sequence analysis [32] that is written in C++ and compiled into
binary distributions for Mac OS, Windows, and Linux.1 It features
both a command-line and graphical user interface (GUI) and a
custom batch language for writing scripts that can be executed
from the command line. Most of the standard analyses in HyPhy
are implemented in the batch language. In addition, many of these
standard analyses are made available as web applications on the
datamonkey.org server [33]. One of these methods is a BGM
analysis of coevolving sites in codon sequences, which we dubbed
Spidermonkey [34]. Because the server is a shared public resource,
every analysis imposes a set of restrictions on the number and
length of sequences that can be processed. Spidermonkey limits
users to alignments of no more than 500 sequences, and the
number of codons to no more than 1000. Furthermore, it limits
the number of conditional dependencies (parents) to one or two
per node and sets a hard limit on the number of steps to run the
Markov chain Monte Carlo sampler (1. 1 105 steps with 10,000
steps discarded as “burn-in”). Researchers will often want to ana-
lyze more than 500 sequences and/or to customize various aspects
of the Bayesian network analysis, such as focusing on a subset of
codon sites or analyzing a protein sequence alignment. Hence, our
purpose here is to provide the resources and methods for running
Spidermonkey as a standalone analysis pipeline that can be custo-
mized to a specific problem.
2.1 Obtaining HyPhy HyPhy binaries can be downloaded for free at http://hyphy.org. If you
and Scripts want to compile the software package, the source code can be
obtained at http://github.com/veg/hyphy. Alternatively, a POSIX-
threaded HyPhy binary (hyphymp) for Linux can be obtained with a
package manager; for example, in Ubuntu: sudo apt install
hyphy-pt. The scripts and data used here are available at http://
github.com/PoonLab/comet-prot. If you are running HyPhy from
the command line, then all commands should specify the path to your
local installation, e.g.: HYPHYMPBASEPATH¼/usr/local/lib/
hyphy <path to script>.
1
The scripts in this chapter were tested with HyPhy version 2.220170201beta and release 2.2.7. HyPhy is a large
and complex software package that is constantly undergoing development by a small team of researchers and
programmers, and some of the more specialized features such as BGMs may temporarily break as newer versions
are released. If you compiled HyPhy from source, make sure that you are using a single-threaded (HYPHYSP) or
multiprocessing-enabled (HYPHYMP) build and not a message passing interface (MPI)-enabled (HYPHYMPI)
build; at the time of writing, there were residual issues in the source code related to MPI processing. If you
encounter any other problems, please submit an issue at https://github.com/veg/hyphy/issues and we will
attend to it as soon as possible.
Otherwise, you can run the scripts through the graphical user
interface by opening the file through the file selection dialog ( -O
on macOS, Ctrl-E on Windows, or File > Open > Open Batch
File. . .).
2.2 Preparing To run this analysis, you need to have a codon sequence alignment
Input Data and a phylogenetic tree relating these sequences. A codon sequence
has a single reading frame, excluding any frameshifts or stop
codons. In other words, the first three bases should map to a
codon, and so on. It does not have to cover the entire gene. Any
stop codons need to be replaced with gaps (interpreted as missing
data); otherwise, the entire codon site will be stripped from the
alignment, throwing out useful data and making it difficult to inter-
pret the end result of the analysis. HyPhy also has strict requirements
on sequence names, which cannot contain any characters other than
the alphanumeric characters and the underscore character “_”. This
name restriction also applies to tip labels in the tree, so it is often
more convenient to reconstruct a tree after the following step.
A convenient tool for adjusting sequence names and simulta-
neously replacing stop codons with gap characters is provided in the
HyPhy standard library. In the GUI, you can open the Standard
Analysis menu by pressing -E (macOS), Ctrl-E (Windows), or
selecting (Analysis > Standard Analyses. . .), expanding the “Data
File Tools” tab, and then selecting CleanStopCodons.bf. From
the command line, you can launch an interactive menu by calling
the HyPhy executable (HYPHYMP or hyphymp if you used a package
manager) and then select the options (4) Data File Tools and
then (6) to run the same script. You will be prompted to specify a
genetic code and codon data file (see below). The last query is
whether to discard duplicate sequences and/or codon sites that are
entirely gaps. Duplicate sequences cannot be separated in a phylog-
eny, so unless you will be using a tree relating these sequences based
on additional information, there is no reason to retain all copies for
the analysis. Similarly, entirely gapped codon positions are not phy-
logenetically informative and may be dropped unless they are
needed to preserve the coordinate system of the alignment.
A phylogenetic tree can be reconstructed from the sequence
alignment using any standard maximum likelihood program such as
RAxML [35] (https://github.com/stamatak/standard-RAxML)
or PhyML [36] (https://github.com/stephaneguindon/phyml).2
2
For this type of analysis, we prefer using maximum likelihood (ML) methods to reconstruct trees. If it is not
feasible to use ML methods due to excessive numbers of sequence and/or sequence lengths, we suggest using the
approximate ML program FastTree 2 [37], which can be orders of magnitude faster than the standard ML
programs. Neighbor-joining (NJ) methods also scale favorably with larger alignments, but tend to be less accurate
for reconstructing branch lengths. While there are NJ and ML tree reconstruction methods implemented in HyPhy,
they are not as efficient as these specialized programs and we do not recommend using them for larger data sets.
In addition, the tree should not contain bootstrap support values

[38]. HyPhy interprets these values as internal node labels and does
not allow duplicate labels.3
2.3 Fit Codon Model The first step in our analysis pipeline is to fit a codon substitution model
[39] to the sequence alignment by running the script fit_codon_-
model.bf (which depends on the utility file fit_codon_model.
4
ibf). Although there are standard methods for this task in the default
HyPhy menu, we implemented a customized method that constrains
the branch lengths in the input tree to be rescaled by a global factor.5
This confers a significant savings in computing time, since we don’t
need to re-estimate the length of every branch in the tree.
2.3.1 Choose In most cases, we will select option 1, the universal genetic code.
a Genetic Code However, there is a large selection of genetic codes available in
HyPhy, and selecting an appropriate code is important for this
analysis because it will determine how nucleotide substitutions are
interpreted as missense, nonsense, or silent mutations.
2.3.2 Specify a Codon Enter a relative or absolute path6 to the file containing the cleaned
Data File sequence alignment, or if using the GUI, use the filesystem dialog
to navigate to the file. Again, we assume that this alignment com-
prises codon sequences with a consistent reading frame. The pres-
ence of frameshifts due to alignment errors or actual sequence
insertions/deletions will prevent HyPhy from correctly reconstruct-
ing non-synonymous and synonymous substitutions.
2.3.3 Model Options This option determines how the model parameters are distributed
across branches in the tree. The “Local” option assigns an instance
of each parameter to every branch in the tree. For example, if we are
fitting a model with a transition/transversion bias parameter, then
this bias will be estimated independently for every branch. While
this results in a more flexible model, there is a greater danger of
3
A bootstrap support value is an empirical measure of confidence in a specific clade given the data. Most
phylogeny reconstruction programs should have an option to omit these values. If you already have a Newick
tree file and you just need to remove the support values, you can use the following UNIX command: sed -E s/
)[0-9.]+:/):/g [input] > [output].
4
From this point onward, we assume that you are using the command-line interface. Unfortunately, this script
may not work properly with the GUI because of how HyPhy handles file paths. Even on the command line, this is
not straight-forward. For example, we used the following invocation in the macOS Terminal: HYPHYMP
BASEPATH¼/usr/local/lib/hyphy/ pwd/fit_codon_model.bf If you want to take
advantage of a multi-core CPU, you can add the argument CPU¼[number of cores] immediately after
HYPHYMP. Note that not all steps in this analysis are able to utilize multiple threads.
5
If you want to examine this scaling factor, you can find it in the serialized likelihood function generated by this
script by searching for the parameter name scalingB.
6
If you’re using an operating system with a desktop environment, it’s often easier to drag the icon representing
your file into the terminal window instead of typing out the corresponding path. This works when running HyPhy
on the command line, but you need to use backspace to remove the space that is automatically appended to end of
the path. HyPhy won’t be able to locate the file otherwise.
overfitting your data and we seldom use this approach. The

“Global” option means that each model parameter is estimated
using the information from all branches in the tree. This is the
other extreme that results in a simpler but less flexible model. The
remaining two options represent a compromise between these two
extremes by allowing rates of evolution to vary across sites in the
alignment.7 The “Global w/variation” option models this rate
variation using one of many parametric distributions, such as the
prototypical gamma distribution that was first proposed by Ziheng
Yang [40]. The “Global w/variation+HM” option uses a Hidden
Markov model to smooth the assignment of rate categories along
the length of the sequence alignment [41], such that adjacent sites
will tend to be assigned to the same rate category.
2.3.4 Nucleotide Model The codon substitution model implemented in these scripts has a
nested model of nucleotide substitution8 that needs to be specified
by the user. This step uses the 6-digit PAUP*-style model specifi-
cation string [42], which defines equality constraints for the six
symmetric substitution rates in alphabetical order: A $ C, A $ G,
A $ T, C $ G, C $ T, and G $ T. For example, the Tamura-Nei
model [43] is specified by the string 010020—all the nucleotide
transversions share a single rate identifier (0). The most appropriate
nucleotide model can be determined using a model selection
method such as ModelTest [44].
2.3.5 Specify a Tree File At the prompt, enter a relative or absolute path to the file contain-
ing the reconstructed phylogeny in a Newick tree string format.
The tip labels in this tree need to correspond one-to-one with the
sequence labels in your alignment file.9
2.3.6 Fit a Likelihood Finally, you are prompted to specify a relative or absolute path to a
Function file to write a serialized likelihood function, which encodes the data,
model, and parameter estimates.10 After providing a file path, the
analysis will run and eventually converge to the maximum likeli-
hood estimates of the model parameters. It is usually a good idea to
open the likelihood function output in a text editor and inspect the
7
Prior to version 2.3.4, the text in HyPhy implies that these options allow rates to vary among branches, not sites:
“. . .branch lengths come from a user-chosen distribution.” We have revised this help text as of version 2.3.4 to
indicate that the distributions are used to model rate variation across sites, not branches.
8
A standard codon model is described by a 61-by-61 transition rate matrix and a single parameter R that
corresponds to the ratio of non-synonymous and synonymous substitution rates. The model assumes that the
system moves from one codon to another by single nucleotide substitutions; codon substitutions that require
more than one nucleotide change are not allowed.
9
Some phylogeny reconstruction programs truncate sequence labels and cause an error at this stage—for
example, neither RAxML or FastTree2 will read sequence labels beyond a whitespace character. A quick fix in
this situation is to replace all whitespace characters with underscores in a text editor or with sed.
10
By convention, we use the file extension .lf and keep the same basename as the codon data file. This makes it
easier to track files that belong to the same workflow.
parameter estimates. This output is written in a NEXUS format11

[45]. The parameter estimates can be found at the top of the HYPHY
block. One useful diagnostic is to examine the estimate of the
non-synonymous/synonymous rate ratio R. An R value of less
than 1 indicates that most of the alignment has undergone purify-
ing selection. In addition, we often look for the transversion rate
estimates to be less than 1 (where the reference rate of A $ G
transitions is fixed to 1).
2.4 Map The next step in our pipeline is to reconstruct ancestral sequences in
Substitutions the tree based on the maximum likelihood parameter estimates of
to the Tree the model [46]. If the descendant sequence has a different codon
than its ancestor, then we infer that at least one substitution has
occurred along the intervening branch in the tree [47]. This step is
implemented by the script MapMutationsToTree.bf. Upon run-
ning the script, the user is prompted to provide a relative or abso-
lute path to the file containing the serialized likelihood function
from the previous step.
2.4.1 Select HyPhy implements the fast joint ancestral reconstruction algorithm
Reconstruction Option formulated by Tal Pupko and colleagues [48]. Our script prompts
the user to decide whether to sample ancestral sequences from the
posterior distributions at each node of the tree. Sampling enables us
to accommodate the uncertainty in reconstructing ancestral states,
which is exacerbated for ancestral nodes that are further back in
time relative to the observed sequences. On the other hand, each
sample will comprise a set of ancestral sequences that compounds
the number of replicate analyses to be performed further along the
pipeline. We recommend using your discretion for this step: if it is
likely that the most recent common ancestor is separated from all
the observed sequences by excessive amounts of evolutionary time,
then it may be important to sample ancestral states for a more
robust but time-consuming analysis.
2.4.2 Output Options This script was designed to generate two different kinds of outputs.
The first option is to generate a binary matrix where each row
corresponds to a branch of the tree, and each column corresponds
to a codon site in the alignment. This matrix is written to the
output file in a comma-separated tabular (CSV) format. A 1 indi-
cates that a non-synonymous substitution was mapped to the
respective branch and site. This matrix output is the raw material
for a BGM analysis, where each codon site is a potential node in the
graph and the branches represent independent observations. The
second option is to output a tab-delimited tabular file where each
11
NEXUS is a widespread format with known issues with standardization and usability, and has been implemen-
ted in diverse and often incompatible ways by multiple programs.
row corresponds to a inferred non-synonymous substitution, and

the columns correspond to the replicate number (if sampling),
branch label, site, and the ancestral and derived amino acids.12
If the first option is selected, then the user is prompted to
specify if they want the CSV to begin with a header row. Each
entry of the header row corresponds to the amino acid at each
position of the ancestral sequence reconstructed at the root of the
tree.13
2.5 BGM Analysis A Bayesian graphical model (BGM) analysis is implemented in the
script bayesgraph.bf, which depends on the utility (“include”)
file bayesgraph.ibf. These scripts were designed to emulate the
workflow provided by the Spidermonkey application on our
datamonkey.org webserver. We note here, however, that BGM ana-
lyses in HyPhy are more versatile than our example demonstrates.14
2.5.1 Input Data Matrix First, the user is prompted for a relative or absolute path to the CSV
file containing the substitution map matrix that was produced by
the MapMutationsToTree.bf script. If the CSV does not contain
a header row with column labels that indicate what each variable
represents, then they will be assigned integer values. It is preferable
to use the ancestral residue labels generated by the MapMutation-
sToTree.bf script because we are going to filter out columns
based on the number of inferred non-synonymous substitutions.
2.5.2 Filter Sites Next, the program will prompt you for the minimum number of
substitutions for a site to be included in the BGM model. This
cutoff cannot be less than 1 because sites without any
non-synonymous substitutions contain no information for infer-
ring conditional dependencies. The script automatically determines
the maximum cutoff based on the largest number of substitutions
mapped at any single codon site. Once the user selects a number in
this range, the script will filter sites that do not meet this cutoff
from the data set and populate a BGM model with the remaining
variables.15
12
We have previously found this list output to be a more convenient format for debugging the script. It’s usually a
good idea to manually compare entries in this list against your sequence alignment to make sure that things make
sense.
13
Most phylogenetic tree reconstruction methods, such as maximum likelihood or neighbor-joining, will output
an unrooted tree. For an unrooted tree, the labels will be generated for the deepest internal node.
14
For example, you can customize on a node-by-node basis the number of “parental” nodes on which a given
node can be conditionally dependent. You can also load a serialized BGM from a XML Bayesian Interchange
format file and use this model to simulate additional data sets. For more details, please refer to the file bayes-
graph.ibf and the batch file tests/hbltests/BayesianGraphicalModels/TestBGM.
bf in the HyPhy source code distribution.
15
As a general rule of thumb, we try to not build a BGM model that has many more nodes than observations. The
number of substitutions provides a meaningful criterion for reducing the dimensionality of our data.
2.5.3 MCMC Settings There are four settings that the user needs to specify for running a
Markov chain Monte Carlo (MCMC) sample. First, the user has to
specify the maximum number of parents that will be allowed per
node. This determines the complexity of the analysis. An analysis
with a one-parent maximum per node will run very fast and scales
easily with large numbers of variables, but loses the sensitivity to
detect complex interactions among nodes. Conversely, an analysis
that allows many more parents per node is far more computation-
ally complex.16 Second, the user needs to indicate the number of
steps to discard as a “burn-in” period. This budgets an amount of
time that one estimates it will require for this random walk to travel
from its initial point to a “reasonable” area of model space. Third,
the user needs to specify the number of steps to run the chain
sample following the “burn-in” period. This length sets an upper
limit to the effective sample size, which will almost surely be much
smaller because of the highly autocorrelated nature of MCMC.
Lastly, the user must specify the number of steps to extract from
this chain sample. Because of autocorrelation in the chain, there is
usually no benefit in retaining every step. To reduce the output file
sizes and increase the efficiency of post-processing, it is standard
practice to reduce the chain by sub-sampling at regular intervals.
The script defaults to a sub-sample of 100 steps, which results in
gaps of 1000 steps (see Fig. 1). The user should adjust the size of
the sub-sample roughly in proportion to the length of the post-
burn-in chain sample.
2.5.4 Output Settings The bayesgraph.bf script generally produces three kinds of out-
puts. The script will prompt the user for only one relative or
absolute path for an output file, and paths for the other output
files will automatically be generated based on this first path. First,
the script will output the marginal posterior probabilities for
directed edges as a CSV formatted file. This is the raw material for
assembling the consensus BGM. Next, the script will write this
consensus BGM using the network visualization language DOT,
which can be converted into an image by several programs such as
GraphViz [49], Cytoscape [50], and Gephi [51]. Finally, the script
will record the posterior probability trace for all steps sub-sampled
from the original MCMC sample. This is important information for
assessing the convergence of the chain sample to the posterior
distribution (e.g., Fig. 1).
16
This is where the ability to customize the analysis implemented in the bayesgraph.bf script can be very
useful. If you have prior information that a subset of codon sites are involved in a large number of interactions, the
computational complexity of increasing the number of parents can be greatly reduced by modifying this parameter
for only these sites.
1.0
11950
Posterior probability
0.8
11960
0.6
11970
Autocorrelation
0.4
6000 6500 7000 7500 8000
MCMC step
0.2
0.0
0.2
0 200 400 600 800 1000

Lag
Fig. 1 Autocorrelation in a Markov chain Monte Carlo (MCMC) sample from a

BGM analysis. The x-axis represents the number of steps separating observa-
tions in the sub-sample (lag), and the y-axis measures the autocorrelation
between adjacent steps in the sub-sampled chain. We observe strong autocor-
relation when the lag is short because the MCMC random walk takes short steps
in model space. On average, we need to have taken at least 1000 steps until the
posterior probability at the current step is not informed by the previous state. The
inset figure displays an interval of the MCMC sample, where the autocorrelation
between steps is more apparent18
3 Example
Hepatitis C virus (HCV) is a rapidly evolving RNA virus that can

establish persistent chronic infections in human hosts [52]. Over
70 million people worldwide are estimated to be infected with
actively replicating HCV, as diagnosed by the presence of viral
RNA in the bloodstream [53]. Comparative methods to detect
amino acid coevolution in HCV proteins have been utilized to
find potential therapeutic targets [54], associate this variation
with treatment outcomes [55], and to explain the global prevalence
18
(In an MCMC run, we observe autocorrelation when we sample parameter values that are very close in the
parameter space and unrepresentative of the true underlying posterior distribution. Therefore, we try to decrease
autocorrelation so that the MCMC sample provides a more precise estimate of the posterior sample. One way to
accomplish this is by down-sampling to every n-th step).
of drug resistance-associated mutations [56], to name a few appli-

cations. In this example, we work through a BGM analysis of amino
acid coevolution in HCV nonstructural (NS) protein 5b, an
RNA-dependent RNA polymerase that is a major target for the
new generation of direct-acting antivirals [57].
We first queried the euHCVdb database [58] (https://
euhcvdb.ibcp.fr/euHCVdb/) to retrieve GenBank accession num-
bers for HCV subtype 1b nucleotide sequences with at least partial
coverage of the NS5b gene. Next, we used the Entrez batch inter-
face to retrieve the corresponding FASTA records from GenBank
given the list of accession numbers (HCV1b-NS5b.fasta).19 We
used a fast pairwise alignment method to extract the NS5b-
encoding region from each sequence relative to the H77 reference
(H77-NS5b.txt), then generated a multiple sequence alignment
using MAFFT v7.305b [59] (HCV1b-NS5b.mafft.fa) and man-
ually inspected and adjusted the resulting alignment with AliView
v1.19-beta-3 [60] (HCV1b-NS5b.aliview.fa). The remaining
sequences in this data set (n ¼ 536) ranged from 1043 to 1776 in
nucleotide length. We used the built-in method in HyPhy (see
Subheading 2.2) to clean sequence names and remove stop codons
and shortened the FASTA identifiers to facilitate subsequent phy-
logenetic analyses (HCV1b-NS5b.cleaned.fa). To quickly screen
for potential alignment errors or subtype misclassifications, a pre-
liminary phylogenetic tree was reconstructed by approximate max-
imum likelihood with FastTree 2 [37] and examined visually.
Next, a maximum likelihood phylogenetic tree (Fig. 2) was
reconstructed using PhyML (v20160207, [36]) with a bootstrap
analysis (HCV1b-NS5b.phyml.nwk). Using jModeltest v2.1.10, a
TPM2 model (PAUP* specification 010212) incorporating invari-
ant sites and a gamma distribution (TPM2+I+G) was selected based
on the Akaike information criterion [61, 62]. The likelihood func-
tion was constructed and optimized in HyPhy using the script
fit_codon_model.bf with the universal genetic code option.
We selected for parameters to be globally constrained and for rate
variation across sites to be modeled by the Gamma+Invariant
model with 4 rate classes. We then entered the 6 character model
designation (010212) corresponding to the TPM2 model. Lastly,
we input the file containing the PhyML tree reconstruction and
specified the output path to a new file HCV1b-NS5b.lf.
To extract the map of non-synonymous substitution events to
branches of the phylogenetic tree (MapMutationsToTree.bf),
we input the serialized likelihood function from the last step and
selected maximum likelihood for the ancestral reconstruction
option. Next, we selected a binary matrix CSV file output with
19
We have provided most of the data files in this example on our GitHub repository at https://github.com/
PoonLab/comet-prot/tree/master/data.
Fig. 2 Excerpt of phylogenetic tree reconstructed from the HCV sequence data using PhyML. The tree was
rooted at the midpoint (the halfway point on the longest path separating two tips in the tree). This image was
generated using the R package ggtree [63]. Branches are colored with respect to the number of
non-synonymous substitutions (increasing from blue to red). The shape of the tree is generally consistent
with the sequences belonging to a single HCV subtype. However, the tree also contains some clusters (inset) of
highly related sequences that may represent multiple sequences from the same individuals, or recent
transmission outbreaks of HCV
automatically generated column labels and indicated that this out-

put should be written to the file HCV1b-NS5b.csv.
The resulting matrix was used as the input file for our BGM
analysis script bayesgraph.bf. Since we used the previous script
to generate column labels from the reconstructed ancestral amino
acid sequence, we specified that the input file contained a header

row. To verify that these column labels corresponded to the protein
NS5b of HCV subtype 1b, we concatenated these labels into an
amino acid sequence using a search-and-replace method20 and
submitted the resulting sequence to the BLASTP database
(http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE¼Proteins). The
highest scoring match was a HCV subtype 1b NS5b sequence
(GenBank accession number BAW81715) with a 97% (573/591)
amino acid identity. <!– COMP: Please set the following “seq ¼
re.sub([0-9]+,*, , header.strip())” as in MS. --?>
The bayesgraph.bf script indicated that our data matrix
contained 1069 cases (rows) and 592 variables (columns). We set
the minimum number of non-synonymous substitutions per codon
site to 2, which resulted in 231 variables after filtering conserved
sites. Next, we set the maximum number of parents per node to
2. We ran two replicate chain samples under the default settings
(a 104 “burn-in” followed by 105 steps thinned to 100). To assess
convergence to the posterior distribution, we compared the
log-likelihood traces associated with each chain and obtained a
potential scale reduction factor of 0.996 (upper 95% confidence
limit ¼ 0.998) using the coda package in R21 [64], where factors in
excess of 1 imply a lack of convergence [65].
A consensus Bayesian graphical model of the two chains is
shown in Fig. 3. This network visualization was generated using
the script make_dotfile.py. The HCV NS5b structure presents
discernible fingers, palm and thumb subdomains, and a biologically
significant C-terminal region that lines the RNA binding cleft in the
active site. Our analysis shows 32 coevolving pairs associations sites
and most of them involve amino acid pairs located in the same
subdomain (23 of whom 12 lie on fingers, 4 palm, 3 thumb, and
4 C-terminal). Five of the pairs involve sites belonging to the
fingers and palm region while none involved fingers and thumb;
this is quite unexpected given the extensive interactions between
these regions, which is responsible for the formation of an encircled
active site [66]. One pair involves a flanking amino acid of the
β-hairpin region (A442) and a site included in the same region
(A450); the β-hairpin plays an important role of positioning the
20
To generate an amino acid sequence from the column labels, we used the regular expression “[0-9]+,*” to
replace all instances with an empty string. In Python, this can be achieved with the re module: seq ¼ re.sub
([0-9]+,*, , header.strip()), where header is a string variable containing the first line of the CSV file.
21
This can be accomplished with the following R commands:
require(coda)
chain1 <- read.csv("chain1.trace.csv", header¼F)
chain2 <- read.csv("chain2.trace.csv", header¼F)
chains <- mcmc.list(mcmc(chain1$V1), mcmc(chain2$V1))
gelman.diag(chains, autoburnin¼F)
where the file names may be different for your run.

Fig. 3 Visualization of a consensus Bayesian graphical model of residue–residue interactions in HCV1b NS5b
proteins. Each node represents a codon site in the NS5b protein, labelled with the ancestral residue and
position. The size of the node is scaled to the log-transformed number of non-synonymous substitutions
reconstructed at the respective codon sites. Nodes are colored with respect to the NS5b domains: fingers
(red), palm (green), and thumb (blue); nodes representing residues in the C-terminal tail are left uncolored.
Arrows (edges) are drawn to represent inferred coevolution between the respective codon positions. The
edges are annotated with the marginal posterior probabilities (MPP, %). Only edges with an MPP value
exceeding 90% were included in this graph
30 terminus of the viral RNA genome [67]. In general, our results

confirmed the general finding that coevolving sites are not
restricted to residues that are in physical contact and likely respon-
sible for the structural stability of the proteins. We found a few
examples of coevolving pairs of residues involving the β-hairpin and
C-terminal regions that may be of functional importance for the
functioning NS5b protein, or in proximity of sites that have a
biological function.
Acknowledgements
This study was supported in part by the Government of Canada

through Genome Canada and the Ontario Genomics Institute
(OGI-131), and by grants from the Canadian Institutes of Health
Research (PJT-153391 and BOP-149562). AFYP was supported
by a CIHR New Investigator Award (FRN-130609).
References
1. Kihara D (2005) The effect of long-range 12. Ivankov DN, Finkelstein AV, Kondrashov FA
interactions on the secondary structure forma- (2014) A structural perspective of compensa-
tion of proteins. Protein Sci 14(8):1955–1963 tory evolution. Curr Opin Struct Biol
2. Sprinzak E, Margalit H (2001) Correlated 26:104–112
sequence-signatures as markers of protein- 13. Neher E (1994) How frequent are correlated
protein interaction. J Mol Biol 311 changes in families of protein sequences? Proc
(4):681–692 Natl Acad Sci 91(1):98–102
3. Horner DS, Pirovano W, Pesole G (2007) Cor- 14. Olmea O, Rost B, Valencia A (1999) Effective
related substitution analysis and the prediction use of sequence correlation and conservation in
of amino acid structural contacts. Brief Bioin- fold recognition. J Mol Biol 293
form 9(1):46–56 (5):1221–1239
4. Taylor WR, Hamilton RS, Sadowski MI (2013) 15. Atchley WR, Wollenberg KR, Fitch WM,
Prediction of contacts from correlated Terhalle W, Dress AW (2000) Correlations
sequence substitutions. Curr Opin Struct Biol among amino acid sites in bHLH protein
23(3):473–479 domains: an information theoretic analysis.
5. Marks DS, Hopf TA, Sander C (2012) Protein Mol Biol Evol 17(1):164–178
structure prediction from sequence variation. 16. Tillier ER, Lui TW (2003) Using multiple
Nat Biotechnol 30(11):1072–1080 interdependency to separate functional from
6. De Juan D, Pazos F, Valencia A (2013) phylogenetic correlations in protein align-
Emerging methods in protein co-evolution. ments. Bioinformatics 19(6):750–755
Nat Rev Genet 14(4):249 17. Martin L, Gloor GB, Dunn S, Wahl LM (2005)
7. Göbel U, Sander C, Schneider R, Valencia A Using information theory to search for
(1994) Correlated mutations and residue con- co-evolving residues in proteins. Bioinformat-
tacts in proteins. Proteins Struct Funct Bioinf ics 21(22):4116–4124
18(4):309–317 18. Gouveia-Oliveira R, Pedersen AG (2007)
8. Korber B, Farber RM, Wolpert DH, Lapedes Finding coevolving amino acid residues using
AS (1993) Covariation of mutations in the V3 row and column weighting of mutual informa-
loop of human immunodeficiency virus type tion and multi-dimensional amino acid repre-
1 envelope protein: an information theoretic sentation. Algorithms Mol Biol 2(1):12
analysis. Proc Natl Acad Sci 90 19. Fernandes AD, Gloor GB (2010) Mutual
(15):7176–7180 information is critically dependent on prior
9. Hirschhorn JN, Lohmueller K, Byrne E, assumptions: would the correct estimate of
Hirschhorn K (2002) A comprehensive review mutual information please identify itself? Bio-
of genetic association studies. Genet Med 4 informatics 26(9):1135–1139
(2):45–61 20. Jeong CS, Kim D (2012) Reliable and robust
10. Kowarsch A, Fuchs A, Frishman D, Pagel P detection of coevolving protein residues. Pro-
(2010) Correlated mutations: a hallmark of tein Eng Des Sel 25(11):705–713
phenotypic amino acid substitutions. PLoS 21. Felsenstein J (1985) Phylogenies and the com-
Comput Biol 6(9):e1000923 parative method. Am Nat 125(1):1–15
11. Weinreich DM, Delaney NF, DePristo MA, 22. Shindyalov IN, Kolchanov NA, Sander C
Hartl DL (2006) Darwinian evolution can fol- (1994) Can three-dimensional contacts in pro-
low only very few mutational paths to fitter tein structures be predicted by analysis of cor-
proteins. Science 312(5770):111–114 related mutations? Protein Eng 7(3):349–358
23. Wollenberg KR, Atchley WR (2000) Separa- maximum-likelihood phylogenies: assessing

tion of phylogenetic and functional associa- the performance of PhyML 3.0. Syst Biol 59
tions in biological sequences by using the (3):307–321
parametric bootstrap. Proc Natl Acad Sci 97 37. Price MN, Dehal PS, Arkin AP (2010) Fas-
(7):3288–3291 tTree 2–approximately maximum-likelihood
24. Gloor GB, Martin LC, Wahl LM, Dunn SD trees for large alignments. PLoS ONE 5(3):
(2005) Mutual information in protein multiple e9490
sequence alignments reveals two classes of coe- 38. Holmes S (2003) Bootstrapping phylogenetic
volving positions. Biochemistry 44 trees: theory and methods. Stat Sci
(19):7156–7165 18:241–255
25. Pollock DD, Taylor WR, Goldman N (1999) 39. Muse SV, Gaut BS (1994) A likelihood
Coevolving protein residues: maximum likeli- approach for comparing synonymous and non-
hood identification and relationship to struc- synonymous nucleotide substitution rates, with
ture. J Mol Biol 287(1):187–198 application to the chloroplast genome. Mol
26. Tuff P, Darlu P (2000) Exploring a phyloge- Biol Evol 11(5):715–724
netic approach for the detection of correlated 40. Yang Z (1993) Maximum-likelihood estima-
substitutions in proteins. Mol Biol Evol 17 tion of phylogeny from DNA sequences when
(11):1753–1759 substitution rates differ over sites. Mol Biol
27. Poon AFY, Lewis FI, Pond SLK, Frost SDW Evol 10(6):1396–1401
(2007) An evolutionary-network model reveals 41. Felsenstein J, Churchill GA (1996) A hidden
stratified interactions in the V3 loop of the Markov model approach to variation among
HIV-1 envelope. PLoS Comput Biol 3(11): sites in rate of evolution. Mol Biol Evol 13
e231 (1):93–104
28. Talavera D, Lovell SC, Whelan S (2015) 42. Swofford D, Begle DP (1993) PAUP: Phylo-
Covariation is a poor measure of molecular genetic analysis using parsimony, Version 3.1,
coevolution. Mol Biol Evol 32(9):2456–2468 March 1993. Center for Biodiversity, Illinois
29. Fodor AA, Aldrich RW (2004) Influence of Natural History Survey
conservation on calculations of amino acid 43. Tamura K, Nei M (1993) Estimation of the
covariance in multiple sequence alignments. number of nucleotide substitutions in the con-
Proteins Struct Funct Bioinf 56(2):211–221 trol region of mitochondrial DNA in humans
30. Pearl J (1986) Fusion, propagation, and struc- and chimpanzees. Mol Biol Evol 10
turing in belief networks. Artif Intell 29 (3):512–526
(3):241–288 44. Posada D (2003) Using MODELTEST and
31. Friedman N, Koller D (2003) Being Bayesian PAUP* to select a model of nucleotide substi-
about network structure. A Bayesian approach tution. Curr Protoc Bioinformatics 6–5.
to structure discovery in Bayesian networks. https://doi.org/10.1002/0471250953.
Mach Learn 50(1–2):95–125 bi0605s00
32. Pond SLK, Frost SDW, Muse SV (2005) 45. Maddison DR, Swofford DL, Maddison WP
HyPhy: hypothesis testing using phylogenies. (1997) NEXUS: an extensible file format for
Bioinformatics 21(5):676–679 systematic information. Syst Biol 46
33. Delport W, Poon AFY, Frost SDW, Kosa- (4):590–621
kovsky Pond SL (2010) Datamonkey 2010: a 46. Joy JB, Liang RH, McCloskey RM, Nguyen T,
suite of phylogenetic analysis tools for evolu- Poon AFY (2016) Ancestral reconstruction.
tionary biology. Bioinformatics 26 PLoS Comput Biol 12(7):e1004763
(19):2455–2457 47. Nielsen R (2002) Mapping mutations on phy-
34. Poon AFY, Lewis FI, Frost SDW, Kosa- logenies. Syst Biol 51(5):729–739
kovsky Pond SL (2008) Spidermonkey: rapid 48. Pupko T, Pe I, Shamir R, Graur D (2000) A
detection of co-evolving sites using Bayesian fast algorithm for joint reconstruction of ances-
graphical models. Bioinformatics 24 tral amino acid sequences. Mol Biol Evol 17
(17):1949–1950 (6):890–896
35. Stamatakis A (2014) RAxML version 8: a tool 49. Ellson J, Gansner E, Koutsofios L, North SC,
for phylogenetic analysis and post-analysis of Woodhull G (2001) Graphviz—open source
large phylogenies. Bioinformatics 30 graph drawing tools. In: International sympo-
(9):1312–1313 sium on graph drawing. Springer, Berlin, pp
36. Guindon S, Dufayard JF, Lefort V, 483–484
Anisimova M, Hordijk W, Gascuel O (2010) 50. Shannon P, Markiel A, Ozier O, Baliga NS,
New algorithms and methods to estimate Wang JT, Ramage D, Amin N,
Schwikowski B, Ideker T (2003) Cytoscape: a euHCVdb: the European hepatitis C virus

software environment for integrated models of database. Nucleic Acids Res 35(Suppl_1):
biomolecular interaction networks. Genome D363–D366
Res 13(11):2498–2504 59. Katoh K, Standley DM (2013) MAFFT multi-
51. Bastian M, Heymann S, Jacomy M et al (2009) ple sequence alignment software version 7:
Gephi: an open source software for exploring improvements in performance and usability.
and manipulating networks. In: Proceedings of Mol Biol Evol 30(4):772–780
the third international ICWSM conference, vol 60. Larsson A (2014) AliView: a fast and light-
8, pp 361–362 weight alignment viewer and editor for large
52. Simmonds P (2004) Genetic diversity and evo- datasets. Bioinformatics 30(22):3276–3278
lution of hepatitis C virus–15 years on. J Gen 61. Darriba D, Taboada GL, Doallo R, Posada D
Virol 85(11):3173–3188 (2012) jModelTest 2: more models, new heur-
53. Blach S, Zeuzem S, Manns M, Altraif I, istics and parallel computing. Nat Methods 9
Duberg AS, Muljono DH, Waked I, Alavian (8):772
SM, Lee MH, Negro F et al (2017) Global 62. Guindon S, Gascuel O (2003) A simple, fast,
prevalence and genotype distribution of hepa- and accurate algorithm to estimate large phy-
titis C virus infection in 2015: a modelling logenies by maximum likelihood. Syst Biol 52
study. Lancet Gastroenterol Hepatol 2 (5):696–704
(3):161–176 63. Yu G, Smith DK, Zhu H, Guan Y, Lam TTY
54. Campo D, Dimitrova Z, Mitchell RJ, Lara J, (2017) ggtree: an R package for visualization
Khudyakov Y (2008) Coordinated evolution of and annotation of phylogenetic trees with their
the hepatitis C virus. Proc Natl Acad Sci 105 covariates and other associated data. Methods
(28):9685–9690 Ecol Evol 8(1):28–36
55. Aurora R, Donlin MJ, Cannon NA, Tavis JE 64. Plummer M, Best N, Cowles K, Vines K
(2009) Genome-wide hepatitis C virus amino (2006) CODA: convergence diagnosis and
acid covariance networks can predict response output analysis for MCMC. R News 6(1):7–11
to antiviral therapy in humans. J Clin Invest 65. Gelman A, Rubin DB (1992) Inference from
119(1):225–236 iterative simulation using multiple sequences.
56. McCloskey RM, Liang RH, Joy JB, Krajden M, Stat Sci 7:457–472
Montaner JS, Harrigan PR, Poon AF (2014) 66. Ranjith-Kumar C, Kao CC (2006) Biochemical
Global origin and transmission of hepatitis C activities of the HCV NS5B RNA-dependent
virus nonstructural protein 3 Q80K polymor- RNA polymerase. In: Tan S (ed) Hepatitis C
phism. J Infect Dis 211(8):1288–1295 viruses: genomes and molecular biology. Hori-
57. Poveda E, Wyles DL, Mena Á, Pedreira JD, zon Bioscience, Norfolk, pp 293–310
Castro-Iglesias Á, Cachay E (2014) Update 67. Hong Z, Cameron CE, Walker MP, Castro C,
on hepatitis C virus resistance to direct-acting Yao N, Lau JY, Zhong W (2001) A novel
antiviral agents. Antivir Res 108:181–191 mechanism to ensure terminal initiation by
58. Combet C, Garnier N, Charavay C, Grando D, hepatitis C virus NS5B polymerase. Virology
Crisan D, Lopez J, Dehne-Garcia A, 285(1):6–11
Geourjon C, Bettler E, Hulo C et al (2006)
Chapter 7
Context-Dependent Mutation Effects in Proteins

Frank J. Poelwijk
Abstract
Defining the extent of epistasis—the nonindependence of the effects of mutations—is essential for under-
standing the relationship of genotype, phenotype, and fitness in biological systems. The applications cover
many areas of biological research, including biochemistry, genomics, protein and systems engineering,
medicine, and evolutionary biology. However, the quantitative definitions of epistasis vary among fields,
and the analysis beyond just pairwise effects can be problematic. Here, we demonstrate the application of a
particular mathematical formalism, the weighted Walsh-Hadamard transform, which unifies a number of
different definitions of epistasis. We provide a computational implementation of such analysis using a
computer-generated higher-order mutational dataset. We discuss general considerations regarding the
null hypothesis for independent mutational effects, which then allows a quantitative identification of
epistasis in an experimental dataset.
Key words Epistasis, Higher-order epistasis, Context-dependent mutations, Amino acid interactions,
Evolutionary biology, Fitness, Combinatorial mutagenesis
1 Introduction
From the lowest to the highest level of biological organization,

from biomolecules to ecosystems, the world is shaped by interac-
tions, which make that a biological system as a whole is not simply
the sum of its parts. Interactions may lead to unexpected behavior
and complex dynamics, which represent both experimental and
conceptual challenges for the development of predictive models.
To unravel these complexities, a general strategy is to separate the
consequences of modifying individual components into direct/
independent effects and effects that are dependent on the configu-
ration of other components. In genetics, the context-dependent
effects are referred to as “epistasis,” a term coined by William

123
124 Frank J. Poelwijk
Bateson in 1907 to indicate a genetic interaction in which the state

of one allele “masks" the effect of an allele at another locus [1]. This
definition was later formalized by Ronald Fisher to refer to a
statistical interaction that causes a deviation from additivity [2].
Since then, many alternative measures for epistasis have been
developed (see, e.g., [3, 4]), which have been used with various
levels of quantitativeness. One aspect that often remains under-
exposed is the establishment of an explicit null model: what does
it mean quantitatively for two mutations to act independently? In
most cases independence is equated with additivity or multiplica-
tivity of effects; however, without consideration of the underlying
system, such choices are arbitrary. Additionally, empirical datasets
usually exhibit overall nonlinearities, for example, due to a limited
linear range of the measurement or to a saturating organismal
fitness, which, if ignored, lead to an overestimate of the prevalence
of epistasis.
In principle, quantitative information on epistasis should help
make meaningful descriptions of proteins and, potentially, more
complex biological systems by capturing the complexity without
an overabundance of parameters. For example, in ref. 5, it was
shown how epistatic analysis can identify the cooperative unit in a
PDZ-binding domain and a potassium ion channel. Many fields of
study should benefit from a firmer grasp of the “typical” level and
distribution of epistatic interactions. In molecular evolution, as well
as in laboratory evolution experiments, such knowledge could
guide our expectation for the repeatability and variability of adap-
tation, owing to the direct link between epistasis and the accessibil-
ity of evolutionary trajectories under selection [6–9]. From the
opposite end, knowing the prevalence of epistasis in genes and
noncoding sequences may improve phylogenetic reconstruction
methods by explicitly incorporating context-dependent effects
(see, e.g., [10]).
The current protocol illustrates the analysis of a combinatorially
complete set of mutations in a protein, which are phenotypically
assessed by means of an assay with some inherent nonlinearities. In
this dataset, statistically, significant two-way and multi-way interac-
tions between the mutations are identified.
1.1 Walsh-Hadamard In this protocol we will calculate the epistasis present in a complete
Transform combinatorial mutant dataset, i.e., a set that contains all mutants
and Epistasis that can be made by recombination of two parental protein
sequences that differ at N positions. More precisely, we start with
a vector y containing the phenotypic measurements for all 2N
combinations of mutations that can be generated at N positions,
where each position has two states. Calculating epistasis in such a
dataset consists of a linear mapping of vector of 2N phenotypes y
Context-Dependent Mutation Effects in Proteins 125
onto a vector of 2N epistatic coefficients ω, using an epistasis

operator Ω of dimension 2N 2N.
ω ¼ Ω y ð1Þ
The specific choice of the matrix elements for operator Ω

determines what quantitative definition of epistasis is being calcu-
lated [5]. Here we focus on background-averaged epistasis. In this
definition, each epistatic coefficient is averaged over all states of the
positions not involved in that term. For example, two-way epistasis
is calculated as the differential effect of mutating a to A in the
presence of either b or B, averaged over all states of the remaining
positions C, D, E, F, . . ., etc. Three-way epistasis between A, B, and
C is averaged over all backgrounds involving positions D, E, F,. . .,
etc. This definition of epistasis does not require a particular geno-
type as a reference, unlike the more traditional definition of epistasis
where this averaging does not take place [5]. For this reason, the
traditional definition of epistasis is also referred to as “local” epista-
sis (local in sequence space) and background-averaged epistasis as
“global.” The elements in the operator for global epistasis are
defined by the Hadamard matrix [11], H, weighted by the entries
in a diagonal matrix V specifying how many genetic backgrounds a
certain epistatic term is averaged over.
ω ¼ VH y ð2Þ
The two matrices comprising the operator for background-

averaged epistasis can be generated with a recursive definition:
!
1
Vn 0 Hn Hn
V nþ1 ¼ 2 and H nþ1 ¼ ð3Þ
0 Vn Hn H n
where V0 ¼ 1, H0 ¼ 1, and n ¼ {0, . . ., N 1}. I mention in

passing that calculating epistasis using the Hadamard formalism can
be seen as a decomposition of the fitness landscape into features of
different (genotypic) length scales, analogous to a Fourier decom-
position of some temporal signal in its frequency components (see,
e.g., [12–14]).
The inverse transformation, which reconstructs the data points
y from complete knowledge of the epistatic coefficients ω in the
system, is given by
y ¼ H 1 V 1 ω ð4Þ
In this protocol the inverse will be used at several points, for

example, by generating an initial dataset of mutant phenotypes
from a computationally generated vector of epistatic coefficients.
Two remarks are in order here. First, throughout this protocol, it is

assumed we have a complete combinatorial dataset, so that calculat-
ing epistasis from phenotypic data (and vice versa) is a one-to-one
mapping where no information is lost. In general, since the number
of possible combinations grows exponentially with the number of
mutable positions, N, the set of observed mutant phenotypes will
not be complete. Powerful methods can be applied to estimate
epistatic coefficients for incomplete data (see, e.g., [15]), and in
fact, a main reason for performing epistatic analysis is to be able to
predict missing phenotypes from a measured subset of all possibi-
lities. Second, here, every position is assumed to have two possible
states, which, for a protein, implies two possible amino acids per
position. This assumption is made for simplicity because an exten-
sion of the Hadamard formalism to an arbitrary number of states
per position is not straightforward. In this way, this protocol can
focus on two key parts of epistatic analysis: overall nonlinearities in
the dataset and identifying significant epistatic terms.
1.2 Overall Most experimental assays will exhibit overall nonlinearities, mean-
Nonlinearities ing that our observation of a quality of interest x is “distorted”
and the Null according to some nonlinear function f ðxÞ (Fig. 1a, b). This can be
Hypothesis specific to the measurement, for example, a limited linear range of
fluorescence detection in a flow cytometer or a limited linear con-
centration range in a binding assay due to non-specific binding.
Additionally, it can be inherent to the biological system, for exam-
ple, a saturating dependence of protein expression on an activator’s
binding affinity. If such nonlinearities are not taken into account,
the empirical dataset may appear more epistatic than it actually is
(Fig. 1b). To meaningfully quantify epistasis, an explicit null
hypothesis needs to be expressed, defining what it means for muta-
tions to act independently. Note that the null model also addresses
the question of whether mutational independence implies additiv-
ity or multiplicativity of effects: in fact, additivity in a quantity of
interest in φ can appear in the dataset as multiplicativity if the assay
measures a quantity with φ in the exponent, for example, when we
measure equilibrium dissociation constants but are interested in
epistasis with respect to the binding free energy. If we have suffi-
cient knowledge about the system, we can directly choose the
applicable nonlinear scaling (in the case of dissociation constants,
this would be their logarithm) and define independence as additiv-
ity. In general, especially when the system is more complex, we can
remove (part of) the overall nonlinearities using a linear-nonlinear
optimization (see [16, 17] for similar approaches). Here, the vector
y containing the observables is transformed using a nonlinear
function g ðyÞ that only has a small number of free parameters,
after which we attempt to optimize those parameters by maximiz-
ing the variance captured with first-order or low-order epistatic
a
latent variables
w,x f (x) f (x) + h = y

instrument observables
function / assay
b
f (x)
fa+b xa+b = xa + xb
fb no epistasis in x
fa
fa+b π fa + fb
epistasis in f
0 xa xb xa+b x
Fig. 1 Latent variables and observables. (a) Here the biological system of interest
is represented by variable x that can be decomposed into its epistatic compo-
nents ω. However, x is latent, and we can only observe its effects after some
nonlinear transformation f ðxÞ has occurred, that may indicate saturation in the
experimental assay, or an instrument function. Experimental noise is modeled
through a random variable η, so that the observed phenotypic data are given by
y ¼ f ðxÞ þ η. (b) Calculating epistasis requires an explicit null model. Trivial
nonlinearities in the transfer function f ðxÞ can result in apparent epistasis in the
observable y without epistasis in the underlying latent variable x
coefficients. Mathematically, reconstituting a dataset g ðyÞ from first-

order epistatic coefficients is achieved by the transformation
g 1 ðyÞ ¼ H 1 V 1 S 1 VH g ðyÞ ð5Þ
Here, from right to left after the equal sign, first, all epistatic
coefficients are calculated by multiplication with VH. Then this
vector is multiplied by matrix S1, the identity matrix with entries
S1, ii ¼ 0 at positions that do not pertain to first-order (linear)
terms, so that first-order epistatic terms are kept intact, but every-
thing else is set to zero. Lastly, the inverse transformation H1V1
reconstructs the data using only the information contained in the
first-order terms. The linear-nonlinear optimization procedure
now consists of finding the values for the parameters in function
g that minimize the quantity
varðg ðyÞ g 1 ðyÞÞ
h¼ ð6Þ
varðg ðyÞÞ
which is the sum of squares of the residuals divided by the total sum
of squares. For this approach to be successful, there are a number of
requirements for the form of the nonlinear function, which will be
discussed in the protocol steps and the Notes.
1.3 Error After finding the nonlinear transformation that optimally removes
Propagation the overall nonlinearities, epistatic coefficients can be determined
and Significant Terms according to
ω ¼ VH g ðyÞ ð7Þ
Since this is a one-to-one mapping from measurements to
epistasis, the full set of 2N epistatic coefficients in ω also captures
any measurement noise that is present in the transformed dataset
g ðyÞ. As the error for epistatic terms propagates exponentially, with
a factor two for each increasing order [5], the effects of noise are
more pronounced for higher-order terms. To prevent overfitting,
this protocol illustrates a self-consistent approach that determines
the noise contributions and establishes a significance threshold for
epistatic coefficients of each order. We will show at the end of the
protocol that these significant terms allow reconstruction of our
original computer-generated model data at high accuracy and with
little modeled measurement noise. Not surprisingly, when mea-
surement noise is too large, this approach will break down.
2 Materials
The accompanying computer script is written in MATLAB, but the

implementation is kept as simple as possible for straightforward
translation to other languages. Most programming languages con-
tain explicit implementations of the Hadamard matrix, but when
lacking, the matrices can be easily generated using the recursive
definitions provided.
3 Methods
Steps correspond to sections in the accompanying MATLAB script.
3.1 Generate 1. Initialize the parameters and matrices used for the calculation
a Combinatorial of the epistasis operator (see Note 1) and its inverse. We use
Mutant Dataset auxiliary variables A and B to indicate which positions are
involved in an epistatic term and what the order of that term is.
2. Generate a vector ω of length 2N containing the epistatic
contributions. The entries are generated randomly according
to a model of preferential attachment, where higher-order
terms are more likely to be non-zero if they involve positions
with non-zero lower-order terms (see Note 2). The generating

function PrefAttach() contains two variables, frac and dExp,
setting the fraction of non-zero first-order terms and the decay
rate of this fraction for higher-order terms.
3. Generate the phenotypic data by the inverse transform,
x ¼ H 1 V 1 ω (see Note 3). This, in an experimental setting,
is the latent variable (Fig. 1a) prior to potential nonlinear
scaling due to the measurement instrument or the assay con-
ditions and without measurement noise. This is the quantity for
which we aim to map the epistatic contributions.
4. Apply a nonlinear transfer function, mimicking a limited linear
range of the measurement instrument or assay, x ! f ðxÞ. Here

we have chosen a function that saturates for high values of x:
c logð1 þ xÞ
f ðxÞ ¼
1 þ logð1 þ xÞ
with c as a free parameter. Since the linear-nonlinear opti-

mization procedure below is invariant under scaling and trans-

lation ( x ! xab ), without loss of generality, we can move the
original range of values of x into an unsaturated part of the
transfer function by choosing values for the variables scale and
shift (see also Note 4).
5. Add Gaussian “measurement noise,” f ðxÞ ! f ðxÞ þ η (see
Note 5). The generated noisy data plays the role of the obser-
vables, designated by y (Fig. 1a). Plot the observables y as a
function of the latent variable x (Fig. 2a). This completes the
generation of the phenotypic dataset.
3.2 Removing 1. Decide on the general form of the nonlinear scaling g ðyÞ to be
the Overall tested. Ideally, g is the inverse of f, and we have g ðyÞ x,
but in
Nonlinearities general, f is not known accurately. The particular functional
form can be chosen based on knowledge of the instrument
transfer function or the experimental assay. If we do not have
such knowledge about the system, monotonically increasing or
decreasing test functions can be tried that have properties
consistent with the expected nonlinearities, e.g., saturation
for high values of the phenotypic data x (see Note 6). Here,
for simplicity, we assume we know the transfer function f but
up to some constant c, which we will try to find using the
linear-nonlinear optimization. The chosen test function is
therefore the inverse of f:
y
g ðc; yÞ ¼ e cy 1
2. Initialize the parameters for the linear-nonlinear optimization.
If the transfer function for which we chose the test function has
pronounced saturating behavior, a small amount of
a 0.6 b 1
0.8
0.5
0.6
0.4 R2 = 0.94366
0.4
0.2
0.3 0
-0.2
0.2
-0.4
-0.6
0.1
-0.8
0 -1
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
c d
2.5 0.5
0.4
2 0.3
R 2 = 0.80931
0.2
1.5 R2 = 0.97473
0.1
0
1
-0.1
0.5 -0.2
-0.3
0 -0.4
0 0.5 1 1.5 2 2.5 3 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Fig. 2 Example data generated in the protocol. (a) Observable mutant phenotypes y versus latent variable x. A
clear nonlinear relation is present. (b) After removing overall nonlinearities, data is reconstructed using the
obtained significant epistatic terms. A linear relation with a high correlation can be observed, provided that the
measurement noise is not too large. (c) Reconstructed data from all epistatic terms (i.e., the data directly after
applying the nonlinear scaling). Large noise contributions are present for data points at larger values of x.
Comparing this to panel b shows the noise suppression that can be achieved when only significant epistatic
terms are used for reconstruction. (d) Comparison of the values of the significant epistatic terms and their
counterparts in the latent data
measurement noise will yield extreme variability of the data

points in the saturated regime (see Fig. 1b). To counteract
this, data points in y are down-weighted during the optimiza-
tion according to the slope g 0 ðc; yÞ of the test function (see Note
7). The parameter wExp sets the extent of this down-
weighting, and epsilon determines the values for the slope
where no down-weighting occurs.
3. Perform the optimization. Here, since there is only a single
optimization parameter, we simply scan through its range and
record the quantity h ðc Þ ¼ varðgvar
ðc;
y Þg 1 ðc;
ðg ðc;
y ÞÞ
y ÞÞ
(where g1 is defined in
Eq. 5). See Note 8.
4. Find the value c∗ that minimizes h(c). With this value, we can
perform the nonlinear transformation of the observables
y ! g ðc ∗ ; yÞ.
3.3 Epistatic 1. Calculate the epistatic coefficients of the transformed data:

Analysis, Noise ωc ∗ ¼ VH g ðc ∗ ; yÞ.
Propagation, 2. Make histograms of the epistatic coefficients, per order, with a
Significant Terms reasonable number of terms per bin. Fit the histograms using
Gaussian distributions and record the widths.
3. Perform a linear fit to the logarithms of widths, for interme-
diate orders. See Notes 9 and 10. Calculate the widths for low
orders and high orders by extrapolation. See Note 11.
4. Use the widths to set the significance thresholds, using the
Sı̆dák correction for multiple testing, i.e., using the numbers
of potential epistatic contributions per order.
5. Identify the epistatic coefficients with a value greater than the
significance thresholds. The vector containing these terms,
ωc ∗ , sign , is the estimate for the epistasis present in the system.
3.4 Reconstruction 1. Reconstitute the phenotypic data from significant epistatic

of Phenotypes, contributions, according to xrecon ¼ H 1 V 1 ωc ∗ , sign , and
Corroboration plot the reconstituted data xrecon versus the initially generated
See Note 12 and Fig. 2b.
data x:
2. Reconstitute the phenotypic data using all epistatic contribu-
tions, according to xrecon, all ¼ H 1 V 1 ωc ∗ (where ωc ∗ was
calculated in Subheading 3.3, step 1), and plot the reconsti-
tuted data xrecon, all versus the initially generated data x:
See Note
13 and Fig. 2c.
3. Plot ωc ∗ , sign (calculated in Subheading 3.4, step 1) versus the
actual epistasis ω in the system (Fig. 2d).
4 Notes
1. If the number of mutable positions is large, an explicit Hada-

mard matrix may impose a large burden on the available mem-
ory resources. Fast Hadamard transforms are available in
MATLAB and other languages.
2. Preferential attachment is used here for the generation of the
initial epistatic vector since it creates a relatively sparse distribu-
tion of epistatic terms which seems consistent with the limited
amount of data that is available currently [15].
3. For the inverse epistasis operator, we use the simple explicit
inverses of H and V: H 1 ¼ 21N H and Vii1 ¼ 1/Vii. It is
important for speed and accuracy to not use a general imple-

mentation for the inverse, such as inv() in MATLAB.
4. Linear-nonlinear optimization is invariant under transforma-

tion of the latent variable according to x ! xab , because (1) an
a
overall shift ( b ) will be absorbed in the zeroth-order epistatic
term, which is the average phenotypic value of all data points,
and (2) a linear scaling b will be accounted for by the linear
(first-order) terms. An example of a shift and linear scaling is,
respectively, the arbitrary zero point of a binding free energy
ΔG0, and its expression in either kcal/mol or kJ/mol, both of
which physically do not matter to the system. In this protocol,
the transformation was used to populate the unsaturated part
of the transfer function. The invariance only holds if the trans-
fer function is monotonic and non-singular (see also Note 7)
and, strictly, if the measurement noise is zero. Practically,
results are insensitive to low amounts of measurement noise.
5. For increasing levels of measurement noise, the approach pre-
sented here will break down, because the propagating noise will
dominate the epistatic terms. This can be explored by varying
the parameter noise level in the initial generation of y.
6. In most practical cases, the chosen test function for the removal
of nonlinearities g ðyÞ will not be the exact inverse of the transfer
function f ðxÞ. This is not necessarily an issue. The saturation
effects present in the transfer function used in the current
protocol can be reasonably well captured using other functions
that have a monotonously decreasing slope, such as a fractional
power f ðxÞ ¼ xα, where 0 < α < 1, with test function
1=α
g ðyÞ ¼ y .
7. Apart from the requirement that the modeled transfer function
f ðxÞ is monotonically increasing or decreasing for the linear-
nonlinear optimization procedure to work, there are a number
of other considerations to keep in mind. Depending on the
choice for the test function g ðyÞ, the optimization procedure
may encounter some issues if for some range of y, the slope
g 0 ðyÞ becomes very large (indicating saturation in f ðxÞ) or if g ðyÞ
contains singular points (indicating an asymptote in f ðxÞ). With
a sharply saturating function f ðxÞ, any measurement noise
present in y will lead to large uncertainties after applying the
test function g ðyÞ. In the current protocol, this effect is reme-
diated in two ways: (1) while calculating the minimization
quantity h(c), data points are weighted by the inverse of the
slope g 0 ðyÞ, so that data with potentially large uncertainties will
be down-weighted, and (2) the calculation of residuals
g ðc; yÞ g 1 ðc; yÞ is replaced by a procedure where first a linear
fit is made between g ðc; yÞ and g 1 ðc; yÞ using a robust estimator
for the slope (the Theil-Sen estimator [18, 19]). The more
severe case where f ðxÞ exhibits asymptotic behavior, and thus

g ðyÞ does not exist for certain y, can be addressed with more
sophisticated approaches. In that case some data points have to
be ignored, and epistasis has to be calculated based on an
incomplete combinatorial dataset (see, e.g., [15]).
8. Since there is only one optimization parameter here, we per-
form a scan through its values, rather than a search procedure.
When there are multiple parameters, a nonlinear solver such as
fmincon() in MATLAB can perform a constrained
minimization.
9. Only widths at intermediate epistatic orders are included in the
fit. Low orders may have a large fraction of significant contri-
butions, which would lead to an overestimate of the noise
width. Very high orders are noisier because they have fewer
epistatic terms to generate the histograms and the terms are
also averaged over fewer genetic backgrounds.
10. Since background-averaged epistatic terms of a certain order
are differences between two terms of a lower order, but also
averaged over half as many genetic backgrounds, the error
increases by a factor 2 per order (see ref. 5). Therefore the
expectation that the width of the Gaussian distributions of
epistatic terms increases by a factor 2 per order serves as a
consistency check in the analysis.
11. The intersection of the linear fit with the y-axis (zeroth-order
epistasis) is a measure for the error on the mean for the mea-
∗
pffiffiffiffiffiffiffi in g ðc ; yÞ) or the per-data point
surement noise (here the noise
N
noise divided by a factor 2 .
12. Directly comparing results to x and ω is impossible in an
experimental setting, because x and ω are latent variables
(Fig. 1a). In the current protocol, the initial data was compu-
tationally generated, allowing for the corroboration in Sub-
heading 3.4. Generally, the effect of the nonlinear scaling can
be assessed by verifying that fewer epistatic terms are necessary
to accurately reconstitute the scaled data compared to the
unscaled data.
13. Comparing the phenotypic data reconstructed using the
significant epistatic terms ( xrecon ) and using all epistatic terms
x recon, all) to the initial data (
( x) illustrates two things. First, there
is indeed a linear relationship between xrecon and x (potentially
shifted and linearly scaled (see Subheading 3.1, step 4 and
Note 4)). In fact, in this example, a mere 6% of variance in
the dataset is not captured using the significant epistatic terms
alone (but see also Note 5). Second, leaving out the nonsignifi-
cant terms has the additional benefit of removing noise con-
tributions that originate from the steep parts of the test
function (Fig. 2c).
Acknowledgments
I thank Michael A. Stiffler and DerZen Fan for critical reading of

the manuscript.
References
1. Bateson W (1907) Facts limiting the theory of by maximum likelihood. Mol Biol Evol
heredity. Science 26:649–660 21:468–488
2. Fisher RA (1918) The correlation between 11. Beer T (1981) Walsh transforms. Am J Phys
relatives on the supposition of Mendelian 49:466–472
inheritance. Trans Roy Soc Edinb 52:399–433 12. Stoffer DS (1991) Walsh-Fourier analysis and
3. Phillips PC (1998) The language of gene inter- its statistical applications. J Am Stat Assoc
action. Genetics 149:1167–1171 86:461–479
4. Phillips PC (2008) Epistasis—the essential role 13. Weinberger E (1991) Fourier and Taylor series
of gene interactions in the structure and evolu- on fitness landscapes. Biol Cybernetics
tion of genetic systems. Nat Rev Genet 65:321–330
9:855–867 14. Stadler PF (2002) Spectral landscape theory.
5. Poelwijk FJ, Krishna V, Ranganathan R (2016) In: Crutchfield JP, Schuster P (eds) Evolution-
The context-dependence of mutations: a link- ary dynamics—exploring the interface of selec-
age of formalisms. PLoS Comput Biol 12: tion, accident, and function. Oxford University
e1004771 Press, Oxford, pp 231–272
6. Weinreich DM, Watson RA, Chao L (2005) 15. Poelwijk FJ, Socolich M, Ranganathan R
Perspective: sign epistasis and genetic con- (2017) High-order epistasis linking genotype
straint on evolutionary trajectories. Evolution and phenotype in a protein. Submitted
59:1165–1174 16. Otwinowski J, Nemenman I (2013) Genotype
7. Weinreich DM, Delaney NF, Depristo MA, to phenotype mapping and the fitness land-
Hartl DL (2006) Darwinian evolution can fol- scape of the E. coli lac promoter. PLoS One 8:
low only very few mutational paths to fitter e61570
proteins. Science 312:111–114 17. Sailer ZR, Harms MJ (2017) Detecting high-
8. Poelwijk FJ, Kiviet DJ, Weinreich DM, Tans SJ order epistasis in nonlinear genotype-
(2007) Empirical fitness landscapes reveal phenotype maps. Genetics 205:1079–1088
accessible evolutionary paths. Nature 18. Theil H (1950) A rank-invariant method of
445:383–386 linear and polynomial regression analysis. I, II,
9. Poelwijk FJ, T̃nase-Nicola S, Kiviet DJ, Tans SJ III, Nederl Akad Wetensch Proc 53: 386–392,
(2011) Reciprocal sign epistasis is a necessary 521–525, 1397–1412
condition for multi-peaked fitness landscapes. J 19. Sen PK (1968) Estimates of the regression
Theor Biol 272:141–144 coefficient based on Kendall’s tau. J Am Stat
10. Siepel A, Haussler D (2004) Phylogenetic esti- Assoc 63:1379–1389
mation of context-dependent substitution rates
Chapter 8
High-Throughput Reconstruction of Ancestral Protein

Sequence, Structure, and Molecular Function
Kelsey Aadland, Charles Pugh, and Bryan Kolaczkowski
Abstract
Ancestral protein sequence reconstruction is a powerful technique for explicitly testing hypotheses about
the evolution of molecular function, allowing researchers to meticulously dissect how historical changes in
protein sequence impacted functional repertoire by altering the protein’s 3D structure. These techniques
have provided concrete, experimentally validated insights into ancient evolutionary processes and help
illuminate the complex relationship between protein sequence, structure, and function. Inferring the
protein family phylogenies on which ancestral sequence reconstruction depends and reconstructing the
sequences, themselves, are amenable to high-throughput computational analysis. However, determining
the structures of ancestral-reconstructed proteins and characterizing their functions typically rely on time-
consuming and expensive laboratory analyses, limiting most current studies to examining a relatively small
number of specific hypotheses. For this reason, we have little detailed, unbiased information about how
molecular function evolves across large protein family phylogenies. Here we describe a generalized protocol
that integrates ancestral sequence reconstruction with structural homology modeling and structure-based
molecular affinity prediction to characterize historical changes in protein function across families with
thousands of individual sequences. We highlight key steps in the analysis protocol requiring particularly
careful attention to avoid introducing potential errors as well as steps for which computationally efficient
subroutines can be substituted for more intensive approaches, allowing researchers to scale the analysis up
or down, depending on available resources and requirements for reproducibility and scientific rigor. In our
view, this approach provides a compelling compliment to more laboratory-intensive procedures, generating
important contextual information that can help guide detailed experiments.
Key words Ancestral sequence reconstruction, Structural modeling, Protein function prediction,
Affinity prediction, Protein evolution, Molecular evolution
1 Introduction
Computational reconstruction of ancestral protein sequences has

been used as a foundation to test molecular-evolutionary hypoth-
eses [1], illuminate how protein structure impacts molecular

135
136 Kelsey Aadland et al.
function [2], and provide functional variation to support protein

engineering [3, 4]. Ancestral sequence reconstruction (ASR) is
computationally efficient, allowing thousands of ancestral
sequences to be inferred in a few seconds using modern computers
[5]. However, structural and functional characterization of ances-
tral sequences can be much more costly and time-consuming,
limiting current ASR studies to examining at most a handful of
sequences in detail [4, 6–15].
A typical ASR study begins with a sequence alignment of
representatives from the protein family of interest and a phyloge-
netic tree describing the evolutionary relationships among the
sequences as well as the relative rates of amino acid substitutions
and other evolutionary model parameters needed to adequately
describe the molecular-evolutionary process. Using the aligned
extant sequences, phylogenetic tree and fully specified evolutionary
model, the marginal posterior probability of each amino acid resi-
due at each position in the alignment and each ancestral node on
the phylogeny can be calculated [5]. Most evolutionary models do
not infer gap states in ancestral sequences, requiring either a sec-
ondary reconstruction process to infer insertion-deletion events
[16, 17] or use of an explicit insertion-deletion (indel) model as
part of the sequence reconstruction [18, 19].
Although genes from extant organisms can be sequenced with
little uncertainty, probabilistic reconstruction of ancestral
sequences necessarily generates some uncertainty at each position
in the sequence. Most ASR studies reconstruct “maximum likeli-
hood” ancestral sequences, choosing the most probable residue at
each position in the alignment. Although this approach is expected
to minimize errors in the ancestral sequence, even if the evolution-
ary model is correct and every position is inferred with very high
posterior probability, sequence errors can become unavoidable. For
example, the probability of at least one sequence error is 1.0 for a
100-residue sequence having every position reconstructed with
posterior probability 0.99. Many studies evaluate the impact of
ASR error on functional inferences using a “robustness” approach,
in which plausible alternative residues (typically non-maximum-
likelihood residues reconstructed with posterior probability >0.3)
are introduced into the maximum-likelihood sequence to observe
the effects on functional characteristics [11, 12, 20–22].
Some researchers have argued that maximum-likelihood ASR
might introduce functional biases, particularly when many
sequence positions have largely additive contributions to protein
function [23]. Maximum-likelihood ASR can also introduce biases
in state frequency distributions under some conditions, making
them inappropriate for explicitly examining the evolution of state
frequencies [24, 25]. Sampling a large number of ancestral
sequences directly from the posterior probability distributions at
each position has been suggested as a complimentary approach to
Reconstruction of Protein Sequence, Structure, Function 137
maximum-likelihood reconstruction [26, 27]. However, function-

ally characterizing a large ensemble of ancestral sequences requires
significantly more laboratory resources. In addition, interpreting
the results of such a study is difficult, as many ancestral sequences
are expected to include relatively large numbers of potential errors
likely degrading protein expression or function.
Although the accuracy of ancestral sequence reconstruction has
not been thoroughly evaluated, current studies suggest that the
ASR process is likely to be highly accurate, provided a reasonable
evolutionary model is used [16, 28]. Interestingly, phylogenetic
uncertainty has little impact on ASR accuracy or uncertainty, as the
conditions under which the phylogeny becomes less certain also
make ancestral sequences increasingly similar across plausible trees
[29]. We might expect strong violations of the assumed evolution-
ary model—such as changes in site-specific evolutionary rates or
changes in amino acid substitutability—to reduce ASR accuracy, as
model violations have been shown to introduce other phylogenetic
errors [30–32]; however, this has not been examined. The impact
of alignment error on ASR accuracy has also not been systematically
evaluated, although we might expect alignment accuracy to be
particularly important for the robustness of ASR.
Once ancestral protein sequences of interest have been identi-
fied computationally, genes encoding the ancestral proteins can be
synthesized and subjected to nearly any molecular or cellular assay
commonly used to examine the functions of extant proteins. Ances-
tral protein function is typically evaluated using a highly simplified
in vitro model system, minimizing the impact of any potential
“mismatch” between the ancestral protein and the extant genomic
context within which it is being evaluated [4, 6–15, 27,
33–35]. However, some recent studies have begun examining the
cellular functions of ancestral proteins [20, 36, 37], with the caveat
that an ancestral protein’s function within a modern cellular con-
text may not accurately reflect its functional role within the appro-
priate ancestral cell.
The most detailed ASR studies attempt to identify the specific
historical substitutions responsible for changes in a protein’s
molecular function, as well as the structural mechanisms through
which the substitutions impact molecular function. Once a func-
tional difference between two sequential ancestral sequences—typ-
ically an ancestor-descendent pair on the phylogeny—has been
observed, derived residues can be introduced into the ancestral
sequence by site-directed mutagenesis, in order to identify the
patterns of substitutions necessary to recapitulate the observed
functional shift [4, 6–15, 27, 33–35]. Structural determination of
the ancestral proteins can then be used to identify the mechanisms
by which changes in sequence impact molecular function [2, 38].
Although ASR studies have provided one of the few experi-
mental means for directly examining historical evolutionary
hypotheses, their reliance on rigorous experimental techniques has

also severely limited the scope with which these approaches can be
applied. Compared to the efficiency of computational reconstruc-
tion of ancestral sequences, the synthesis of ancestral genes, expres-
sion of ancestral proteins, and functional-structural
characterization in the lab require an enormous expenditure of
time and money, limiting nearly all current ASR studies to examin-
ing only a few functional shifts at key positions in the protein family
phylogeny, typically chosen a priori by the investigator based on
patterns of gene duplication or speciation. Although undoubtedly
useful, by focusing on parts of the tree where we believe function
might change, this approach has potentially biased our view of how
function evolves across large protein families.
Here we describe a complimentary ASR protocol that com-
bines ancestral sequence reconstruction with high-throughput
structural modeling and structure-based affinity prediction to eval-
uate the evolution of protein function across all ancestral nodes on
the phylogeny, potentially providing a more unbiased view of how
protein function evolves. This approach can be used to directly
evaluate hypotheses about large-scale patterns of functional evolu-
tion and guide traditional lab-based functional characterization
efforts, expanding both the scope and efficiency of ASR studies.
2 Methods
2.1 Sequence The first step in any ancestral sequence reconstruction (ASR) study
Collection and Protein is the collection and curation of protein sequences from the family
Family Curation of interest. In almost every case, the root of the protein family
under study is of interest, requiring the collection and curation of
“outgroup” sequences. The goal of this step is to collect all avail-
able members of the protein family under study—including appro-
priate outgroup sequences—and no sequences that are not
members of the target protein family or outgroup. Given the
efficiency of modern phylogenetic analysis software, we see no
reason to limit the amount of protein sequence data analyzed,
other than elimination of potentially erroneous sequences and
redundant sequences. Ideally, a roughly equal number of
“ingroup” and outgroup sequences should be included in the
analysis.
There are many approaches to collecting sequence data, nearly
all of which rely on some form of sequence similarity search to
identify members of the protein family of interest. The most com-
mon approach starts with a small number of well-annotated protein
family members from heavily studied model organisms and collects
homologs using some form of protein BLAST search, typically with
an e-value cutoff of 1.0e5 or an alternative value thought to be
appropriate for the family under study. While this approach is
probably adequate in most cases, care must be taken to avoid either

including spurious homologs based on local hits to common con-
served domains or missing distantly related homologs too dissimi-
lar to the well-studied seed sequences to be detected by BLAST or
similar approaches.
Here we develop an approach based on a domain architecture
definition of the protein family. When available, protein functional
domains—available through the Conserved Domain Database [39]
or the PFam Database [40]—can provide a concise description of
the expected sequence and structural features present in a protein
family. Functional domains are typically encoded at the sequence
level as high-level statistical models—either hidden Markov models
[40] or position-specific scoring matrices [39]—that can be used to
search sequence databases for domain-specific matches while avoid-
ing potentially limiting dependencies on specific seed sequences.
In this example, we will use position-specific scoring matrices
(PSSMs) from the Conserved Domain Database (CDD) to identify
domain matches in NCBI’s nr database. This analysis requires a
local mirror of the nr database, the BLAST command-line tools,
and the CDD, all available through NCBI. Note that some PSSM
domain models may need to be rescaled from a scaling factor of
100 to a scaling factor of 1, in order to be used in sequence database
searches. The following python script should perform the requisite
rescaling (note that this script will overwrite the existing PSSM file):
rescalePSSM.py1:
#!/usr/bin/env python
import sys, re
fname = sys.argv[1]
f = open(fname, ’r’)
content = f.read()
f.close()
start = content.find("scores {")

end = content.find("}", start)
scores = re.sub("([0-9]{2})(?=[\r\n\,])", "", content[start:end])
scores = re.sub("\ (\-)?,", " 0,", scores)
f = open(fname, ’w’)
f.write(content[0:start] + scores + content[end:].replace("scalingFactor 100",
"scalingFactor 1"))
f.close()
The following BLAST command can then be run at the UNIX

command prompt:
1
The Python scripts described in this chapter are hosted with the online version of the book.
psiblast -in_pssm cd00021.smp -db nr -out cd00021.nrhits.csv -outfmt ’10 qstart qend
sstart send sacc ssciname stitle evalue qlen sseq’ -evalue 0.01 -max_target_seqs
10000000
Given the rescaled CDD domain model from cd00021.smp,

this command will identify all sequences in the nr database match-
ing the domain model, using an e-value cutoff of 0.01, which is
sufficient to separate true domain matches from spurious hits for
most complex, globular domains. Matching sequences will be
printed to a comma-delimited text file, cd00021.nrhits.csv, which
includes information about the matching sequence’s accession, the
species it is found in, and the location in the full-length protein
where the match was identified. One useful approach for eliminat-
ing potentially spurious single-domain hits is to eliminate any
potential matches of <70–80% of the expected domain length,
which can be estimated by examining domain hits from well-
annotated model organisms or evaluated based on the length of
the PSSM domain model. The following python script can be used
to eliminate partial domain hits:
removePartialDomainHits.py:
import sys
min_prop = 0.75
handle = open(sys.argv[1],"r")
for line in handle:
linearr = line.strip().split(",")
hitlength = (int(linearr[3]) - int(linearr[2])) + 1
qlen = int(linearr[-2])
if float(hitlength) / float(qlen) >= min_prop:
sys.stdout.write(line)
handle.close()
Most protein families will not consist of a single functional

domain, and most functional domains can be found across multiple
families. In order to identify only members of the target protein
family, it is up to the researcher to develop a “minimal domain
architecture” that is believed to capture all members of the target
family and not match any other protein families. For each domain in
the minimal domain architecture, executing the psiblast command
above will identify all protein sequences encoding that domain.
Each list of domain hits should then be culled by eliminating partial
hits. Finally, individual domain hits must be combined by protein
accession to identify all the functional domains in each protein and
their ordering. The following Python script should form a reason-
able starting point for such an analysis:
combineDomainHits.py:
import sys, glob
combined_hits = {}
for f in glob.glob("*.nrhits.csv"):
domname = f.split(".nrhits.csv")[0]
handle = open(f, "r")
for line in handle:
linearr = line.strip().split(",")
acc = linearr[4]
spp = linearr[5]
beg = int(linearr[2])
end = int(linearr[3])
if acc in combined_hits.keys():
combined_hits[acc].append((beg,end,domname))
else:
combined_hits[acc] = [spp,(beg,end,domname)]
handle.close()
for acc in combined_hits.keys():
spp = combined_hits[acc][0]
doms = combined_hits[acc][1:]
doms.sort()
sys.stdout.write("%s,%s" % (acc,spp))
for (b,e,name) in doms:
sys.stdout.write(",%s:%d..%d" % (name,b,e))
sys.stdout.write("\n")
Any sequences not matching the defined “minimal domain

architecture” should be eliminated. Typically, this will require
expert manual curation by a researcher familiar with the protein
family under study, its expected diversity in domain architecture,
and some familiarity with the quality of genome annotations across
the species in which the protein family is present. Common genome
annotation errors—such as domain duplications and deletions—
can produce variation in domain architecture that is artefactual
rather than biological, and it is up to the researcher whether to
include or exclude such data. As all subsequent analyses will depend
on the initial sequence data set, it is our view that time invested in
insuring the initial data collection and curation are both reliable and
thorough can pay huge dividends downstream, potentially boost-
ing analysis power and reducing the likelihood of errors.
Once a suitable data set of curated ingroup and outgroup
sequences has been identified, full-length protein sequences can
be downloaded from the nr database in FASTA format using the
blastdbcmd tool:
blastdbcmd –db nr –entry NP_004169 –outfmt %f –out unaligned.

fasta
Sequences corresponding to individual functional domains can

also be downloaded by supplying starting and ending ranges:
blastdbcmd –db nr –entry NP_004169 –range 100-200 –outfmt %f

–out unaligned.100-200.fasta
Downloading a large number of sequences can be accom-

plished by entering each accession (and range) on a single line of
a text file and using the –entry_batch flag.
2.2 Alignment and Multiple sequence alignment forms the basis for inferring the pro-
Phylogenetic Tree tein family phylogeny and reconstructing ancestral sequences. Con-
Inference ceptually, aligning protein sequences amounts to making residue-
level statements of homology: aligned residues from two different
sequences are inferred to have arisen from a common ancestor;
aligning a residue from one sequence to a gap in another sequence
(“-”) amounts to inferring that the residue in the first sequence
does not have a homologous residue in the second sequence, either
due to an insertion in the first or a deletion in the second. Most
approaches to phylogenetic inference treat alignment and tree
inference as separate problems, first aligning the sequences and
then inferring the most likely tree, given that alignment. However,
there are approaches that attempt to simultaneously infer the
sequence alignment and the phylogeny [19, 41–44].
Unfortunately, different sequence alignment algorithms can
produce different residue-level statements of homology, even
when overall alignment accuracies are similar [45–47]. Additionally,
some regions of the alignment may be easier to infer—e.g., highly
conserved functional domains—while other regions may be more
error-prone [47]. Any errors in the sequence alignment can poten-
tially impact phylogenetic inference and ancestral sequence recon-
struction [48–50]. Ideally, we would like to eliminate alignment
errors, but this is typically not possible.
Here we take a “robustness approach” to sequence alignment
and phylogenetic inference; we use a variety of alignment strategies
to generate a large number of plausible sequence alignments, use
each of these alignments to infer the protein family phylogeny, and
then combine these inferences—both formally and informally—to
identify the protein family phylogeny most “robust” to uncertainty
in the sequence alignment.
As an example, we will use the popular (and relatively fast)
alignment algorithms from clustalw2 [51, 52], muscle [53], and
mafft [54] to align protein sequences. If computational resources
permit, other alignment programs such as msaprobs [55], proba-
lign [56], probcons [57], and tcoffee [58] can also be used.
Assuming unaligned sequences have been collected in FASTA

format, aligning full-length protein sequences is trivial at the UNIX
command line:
clustalw2 -output=fasta –infile=unaligned.fulllength.fasta -outfile=aligned_clus-

talw.fasta
muscle -in unaligned.fulllength.fasta -out aligned_muscle.fasta
einsi unaligned.fulllength.fasta > aligned_einsi.fasta
Note that we are using the einsi algorithm in mafft as a general

alignment strategy for divergent, multidomain proteins. For con-
venience, we are standardizing all alignment formats to FASTA and
using a file-naming convention of aligned_<algorithm_name>.
fasta to facilitate scripting.
In addition to full-length sequence alignments, we recommend
extracting the functional domains encoding the protein family’s
previously defined “minimal domain architecture” from each
sequence and aligning them, without including variable domains
or intervening sequences, using the same alignment algorithms.
Removing variable domains and intervening sequences simplify
the alignment process and avoid alignment errors due to highly
divergent linker sequences or variation in domain architecture
across the sequences under study [47].
Another approach to reducing potential alignment errors is to
identify and remove potentially ambiguous alignment columns
using an objective alignment-processing methodology
[59–62]. The most commonly used approach is Gblocks, a simple
and efficient ad hoc alignment-processing algorithm [63]. In our
analyses, we typically set the minimum number of sequences for a
flank position (-b2) equal to 3/5 of the total number of sequences
in the alignment, the maximum number of nonconserved positions
(-b3) to 10 and the minimum block length (-b5) to 5. We also
typically allow gap positions (-b5 ¼ a). The specific parameter
values will probably vary, depending on the protein family under
study; ideally, we want to remove potentially unreliable alignment
regions while leaving enough data to reconstruct a well-supported
phylogeny. Each sequence alignment can be processed at the com-
mand line to generate a trimmed alignment with potentially ambig-
uous regions removed.
Gblocks aligned_clustalw.fasta -b2=576 -b3=10 -b4=5 -b5=a

Gblocks aligned_muscle.fasta -b2=576 -b3=10 -b4=5 -b5=a
Gblocks aligned_einsi.fasta -b2=576 -b3=10 -b4=5 -b5=a
One approach we will use to incorporate alignment uncertainty

is an “elision” technique [64], which is similar to a “supermatrix”
approach for species tree reconstruction [65]; only in this case are
we concatenating different alignments of the same protein
sequences, rather than concatenating alignments of different gene

families. An “elision” alignment can be constructed from individual
sequence alignments using the following python script:
makeElision.py:
import sys, glob
def parseFasta(infname):
alnlen = 0
alignment = {}
handle = open(infname, "r")
line = handle.readline()
while line:
if line[0] == ">":
id = line[1:].strip()
seq = ""
while line and line[0] != ">":
seq += line.strip()
alnlen = len(seq)
alignment[id] = seq
else:
handle.close()
return (alnlen, alignment)
allids = []
handle = open("unaligned.fulllength.fasta", "r")
for line in handle:
if line[0] == ">":
allids.append(line[1:].strip())
handle.close()
full_aln = {}
for id in allids:
full_aln[id] = ""
for fname in glob.glob("aligned_*.fasta"):

(alnlen,aln) = parseFasta(fname)
for id in allids:
if id in aln.keys():
full_aln[id] += aln[id]
else:
full_aln[id] += ("-" * alnlen)
outf = open("aligned_elision.fasta", "w")

for id in allids:
outf.write(">%s\n%s\n" % (id, full_aln[id]))
outf.close()
Each sequence alignment can be used to infer a protein family

phylogeny using standard tree inference techniques, such as maxi-
mum parsimony, neighbor joining, maximum likelihood, or Bayes-
ian inference. For this example, we will use maximum likelihood
phylogenetic inference, as it is considered one of the most accurate
approaches [66, 67], and existing software is computationally effi-
cient [68–70].
Statistical phylogenetic inference methods like maximum like-
lihood and Bayesian inference rely on a probabilistic model of the
molecular-evolutionary process, and different models can result in
different tree topologies [71, 72]. Although most commonly used
models typically have little impact on the main branching pattern of
the tree, it is advisable to select the best-fit model for each sequence
alignment using a statistical model selection procedure. Perhaps the
most widely used approach is to select the best-fit model using
either the Akaike information criterion (AIC) or the Bayesian
information criterion (BIC), both of which are implemented in
ProtTest [73]:
prottest3 -i aliged_clustalw.fasta -G -F -all -o modelSelection.clustalw.results.txt
Once a suitable evolutionary model has been selected, a maxi-

mum likelihood phylogenetic tree can be rapidly inferred using
FastTree [68]; for example:
FastTree –pseudo –lg aligned_clustalw.fasta > clustalw.fasttree.tre
where the –lg parameter specifies the amino acid transition

model [74]. FastTree is extremely computationally efficient and
can produce highly accurate phylogenetic inferences on its own
[69]. Clade support is reported as SH-like aLRT scores, which
have been demonstrated to be statistically powerful but typically
slightly conservative support estimates [75, 76]. In our analyses, we
usually consider SH-like aLRT scores >0.8 to be strongly sup-
ported clades. One potential problem with FastTree output is that
redundant sequences are collapsed to polytomies in the inferred
phylogeny, which can be incompatible with some other analysis
programs. This problem can be avoided by removing completely
redundant sequences from the analysis prior to alignment and tree
inference.
If the best-fit evolutionary model is not available in FastTree, or

if the inferred phylogeny is to be refined using what might be
considered a more rigorous tree search algorithm, the FastTree
phylogeny can be used as a starting tree to perform a maximum-
likelihood inference using RAxML [70]:
raxml -m PROTCATLGX -f t -s aligned_clustalw.fasta -t clustalw.fasttree.tre -n

clustalwtree
RAxML has a large number of parameters that can be set to

tune the evolutionary model, the tree search algorithm, and the
clade support calculation.
After protein family phylogenies have been generated from each
sequence alignment, we recommend manually examining the set of
plausible trees to evaluate any strong consistencies or inconsistencies
across alignments. Regions of the tree that are consistently recov-
ered with strong clade support across alignments should generally
be considered reliable, whereas regions that are weakly supported or
that vary across alignments are less robust to alignment uncertainty.
In addition to this informal evaluation, consistency across
alignments can be formally evaluated using a “supertree” approach
that combines phylogenies inferred from each alignment into a
single “consensus.” To perform this analysis, use the supertree
toolkit (STK) to convert the ensemble of phylogenetic inferences
into a clade presence-absence matrix [77]:
cat tree1.tre tree2.tre . . . treeN.tre > alltrees.tre

stk create_matrix –f nexus alltrees.tre alltrees.nexus
convertMatrix.py alltress.nexus > alltrees.fasta
Where convertMatrix.py is the following python script that

converts NEXUS to FASTA format:
import sys
fname = sys.argv[1]
handle = open(fname, "r")
readmatrix = False
for line in handle:
linearr = line.split()
if len(linearr) > 0 and linearr[0] == "matrix":
readmatrix = True
continue
if readmatrix:
if len(linearr) > 0 and linearr[0] == ";":
break
elif len(linearr) > 0:
id = linearr[0]
seq = linearr[1].replace("?","-")
sys.stdout.write(">%s\n%s\n" % (id,seq))
handle.close()
The clade presence-absence matrix can then be used to recon-

struct a “consensus supertree” using a binary likelihood model and
clade support evaluated as for any other alignment:
raxml -m BINCATX -f t -s alltrees.fasta -t startingTree.tre -n supertree

raxml -m BINCATX -f J -s alltrees.fasta -t RAxML_bestTree.supertree -n supertreeSH
Note that clade support generated from the concatenated “eli-

sion” alignment and the supertree approach are summaries of
consistency across different alignments, and do not necessarily
reflect the support for a given clade generated by each individual
alignment. A weakly supported clade that is consistently recovered
across alignment strategies should generally be given less credibility
than a clade that is consistently recovered with high support. For
this reason, we suggest that researchers investigate both consensus
support from the elision alignment and supertree inferences and
clade support generated by individual alignments, which can be
summarized by providing the maximum, minimum, and
mean standard deviation support for a given clade across align-
ment strategies.
2.3 Ancestral Given an alignment and a phylogeny, marginal maximum-

Sequence and likelihood ancestral sequence reconstructions can be computed
Insertion-Deletion across all ancestral nodes very efficiently:
Reconstruction
raxml -f A -m PROTCATLGF -t tree.tre -s alignment.fasta -n
alignment.ancseqs
Maximum-likelihood ancestral sequences will be aligned to the

input sequence alignment and are available in:
RAxML_marginalAncestralStates.alignment.ancseqs
RAxML numbers ancestral nodes based on a tree traversal

algorithm; a node-labeled tree is provided, so the researcher can
map ancestral sequence identifiers to node labels:
RAxML_nodeLabelledRootedTree.alignment.ancseqs
Finally, the marginal posterior probability distributions across

all residues at each site and ancestral node are provided, allowing
the researcher to evaluate alternative reconstructions:
RAxML_marginalAncestralProbabilities.alignment.ancseqs
Although there are some algorithms that attempt to recon-

struct ancestral insertion-deletion events (indels) explicitly
[17, 19], the “standard” ASR algorithm treats gaps as missing
data, so ancestral indels need to be reconstructed separately and
then integrated into the sequence reconstruction. Here we perform
this task using a simple binary likelihood model applied to the
presence-absence sequence alignment. First, convert the amino
acid alignment to a presence-absence alignment using the following
python script:
makePresenceAbssenceMatrix.py:
import sys
handle = open(sys.argv[1], "r")

while line:
if line[0] == ">":
seq = ""
seq += line.strip()
indelseq = ""
for c in seq:
if c == "-":
indelseq += "0"
else:
indelseq += "1"
sys.stdout.write(">%s\n%s\n" % (id,indelseq))
handle.close()
Next, reconstruct ancestral indels using RAxML:
raxml -f A -m BINCAT -t tree.tre -s alignment.presenceabsence.fasta -n alignment.

ancindels
Finally, combine ancestral sequence reconstructions with

ancestral indel reconstructions using a python script:
putAncestralIndels.py:
import sys
SEQF = sys.argv[1] # RAxML_marginalAncestralStates.alignment.ancseqs

INDF = sys.argv[2] # RAxML_marginalAncestralStates.alignment.ancindels
ancseqs = {}
handle = open(SEQF,"r")
for line in handle:
id = linearr[0]
seq = linearr[1]
ancseqs[id] = seq
handle.close()
handle = open(INDF,"r")
for line in handle:
id = linearr[0]
ins = linearr[1]
seq = ancseqs[id]
sys.stdout.write(">%s\n" % id)
for i in range(len(seq)):
if ins[i] == "0" or seq[i] == "?":
sys.stdout.write("-")
else:
sys.stdout.write(seq[i])
sys.stdout.write("\n")
handle.close()
Although previous studies have suggested that ancestral

sequence reconstruction is expected to be highly congruent across
plausible tree topologies, it is possible to integrate reconstructions
across a set of input trees [29] or to use Bayesian approaches to
integrate reconstructions over model parameter values, rather than
relying on maximum-likelihood inference [19, 78]. It is up to the
researcher to evaluate the robustness of ancestral sequence recon-
structions; we recommend adopting the common approach of
evaluating alternative reconstructions if they are supported with
>0.3 posterior probability. Bayesian sampled ancestral sequences
can be generated using a python script (requires SciPy):
getBayesASR.py:
import sys
from scipy import stats
import random
if len(sys.argv) < 4:
sys.stderr.write("randomASR.py N nodeID
RAxML_marginalAncestralProbabilities.ancseqs
RAxML_marginalAncestralProbabilities.ancindels\n")
sys.stderr.write(" generates N random ancestral sequence ’draws’ from the
marginal probability distributions\n")
sys.stderr.write(" including indels\n")
sys.exit(1)
nseqs = int(sys.argv[1])
nodeid = sys.argv[2]
ancseqprobfname = sys.argv[3]
ancindprobfname = sys.argv[4]
# read prob distributions for sequences #

seq_labels = ["A","R","N","D","C","E","Q","G","H","I","L","K","M","F","P","S","T",
"W","Y","V"]
seq_probdists = []
handle = open(ancseqprobfname, "r")
while line and line.strip() != nodeid:
while line:
if len(linearr) < 20:
break
pdist = []
idist = []
i = 0
for k in linearr:
p = float(k)
if p > 0.0:
pdist.append(p)
idist.append(i)
i += 1
seq_probdists.append(stats.rv_discrete(values=(idist,pdist)))
handle.close()
# read prob distributions for indels #

ind_labels = [0,1]
ind_probdists = []
handle = open(ancindprobfname, "r")
while line and line.strip() != nodeid:
while line:
if len(linearr) < 2:
break
pdist = []
idist = []
i = 0
for k in linearr:
p = float(k)
if p > 0.0:
pdist.append(p)
idist.append(i)
i += 1
ind_probdists.append(stats.rv_discrete(values=(idist,pdist)))
handle.close()
# now generate random sequences in FASTA format #

for i in range(nseqs):
sys.stdout.write(">%s_rep%d\n" % (nodeid,i))
seq = ""
for k in range(len(seq_probdists)):
if ind_probdists[k].rvs():
seq += seq_labels[seq_probdists[k].rvs()]
else:
seq += "-"
sys.stdout.write("%s\n" % seq)
Here we treat the inference of the protein family phylogeny and

ancestral sequence reconstruction as separate problems with differ-
ent objectives. Phylogenetic inference is concerned primarily with
determining the most accurate and robust tree topology, whereas
the goal of ancestral sequence reconstruction is to identify the most
accurate and robust ancestral sequences, given the inferred protein
family tree or ensemble of plausible trees. In order to maximize
available data and improve statistical power, full-length sequences
are typically used to infer the protein family tree. However, in most
cases, only a subset of the protein family’s functional domains will
be used for ancestral reconstruction.
Any of the sequence alignments used for phylogenetic infer-
ence can be used to reconstruct ancestral sequences, and examining
reconstruction congruence across plausible alignments could be a
useful approach for characterizing ASR robustness. However, for
this methodology, we assume the researcher has access to 3D
structural information for the domain(s) of interest, which will be
used to ultimately predict protein-ligand affinities. For cases in
which multiple structures of the relevant functional domain(s) are
available, we recommend performing a structure-based sequence
alignment and using that alignment to reconstruct ancestral
sequences. First, align 3D structures using modeler [79, 80], which

can be run through a python script.
structureAlign.py:
import os
import sys
import glob
from modeller import *
import modeller.salign
if len(sys.argv) < 2:
sys.stderr.write("usage: structAln.py directory\n")
sys.stderr.write(" will perform an iterative structural alignment of all\n")
sys.stderr.write(" the .pdb files in the input directory\n")
sys.stderr.write(" the resulting alignment is printed to directory_it.pap\n")
sys.stderr.write(" and directory_it.ali\n")
sys.exit(1)
thedir = sys.argv[1]
if thedir[-1] == "/":
thedir = thedir[:-1]
pdbfiles = glob.glob("%s/*.pdb" % thedir)
# set up environment for modeller #

log.verbose()
env = environ()
env.io.atom_files_directory = thedir
aln = alignment(env)
# add all the .pdb files to the modeller alignment object #

for pdb in pdbfiles:
code = pdb.split("/")[-1].split(".pdb")[0]
mdl = model(env, file=code)
aln.append_model(mdl, atom_files=code, align_codes=code)
# perform the iterative structural alignment #

modeller.salign.iterative_structural_align(aln)
aln.write(file=’%s_it.pap’ % thedir, alignment_format=’PAP’)
aln.write(file=’%s_it.ali’ % thedir, alignment_format=’PIR’)
The python script requires modeler to be installed and pro-

duces a structure-based alignment of all PDB files in an input
directory. The structure-based alignment is written in ALI format,
which is very close to FASTA. This structure-based alignment will
be used as a “seed” to align homologous domains from the protein
family of interest; it is important that only the corresponding

functional domains are aligned to the structure-based seed align-
ment. We will use mafft (ginsi) to align sequences to the seed.
ginsi --seed structalign_it.fasta unaligned_domains.fasta > structaligned_domains.

fasta
The seed sequences can then be removed prior to ancestral

reconstruction. In our view, provided diverse 3D structures of the
functional domain of interest are available, structure-based
sequence alignment will generally give better results in this applica-
tion than sequence-based alignment, which can tend to misinter-
pret highly divergent sequences as insertion/deletion events.
Protein structure tends to be much more conserved than amino
acid sequence over long evolutionary distances, so structure-based
alignments should provide a more appropriate platform for down-
stream structural modeling and affinity prediction [81, 82].
2.3.1 Structural Modeling It is widely thought that proteins function primarily through their
and Optimization three-dimensional structure, which determines the spatial distribu-
tion of biochemical properties and its dynamics [83–85]. Here we
will exploit this structural basis for molecular function to provide
high-throughput molecular affinity predictions across ancestral and
extant protein sequences. We will use structural homology model-
ing to infer 3D structures of protein sequences for which empirical
structures are not available [79]. To facilitate downstream affinity
predictions, we will need at least one empirical structure of the
functional domain of interest from a protein family member or
distantly related homolog in complex with a ligand of interest,
which can be a small molecule, DNA/RNA, or another protein.
Most often, this will be retrieved from the Protein Data Bank by
sequence search [86]. Note that if the 3D structure of the func-
tional domain of interest has not been empirically solved in complex
with an appropriate ligand, it will need to be generated before
proceeding with this protocol. Generating a starting protein-ligand
complex is probably best done using an empirical structure-
determination protocol, although de novo structure prediction or
protein-ligand docking is an alternative method [87–91].
Once an appropriate protein-ligand complex has been gener-
ated, its protein sequence should be aligned to the same alignment
used to infer ancestral sequences; this will ensure that all extant and
ancestral protein sequences are aligned to the structural template,
facilitating high-throughput structural modeling. Given an align-
ment of a protein sequence to a structural template, modeler can be
used to generate 100 structural models and evaluate their accuracy
using a number of validation scores [79].
generateStructuralModel.py:
from modeller.automodel import *
#####------------ CONTROL VARIABLES ------------ #####

## you should only have to change these. ##
ALNFILE = ’alignment.ali’ # alignment file
KNOWNS = ’mystructure’ # name of template (known structure)
SEQ = ’mysequenceID’ # name of target (sequence of unknown structure)
NMODELS = 100 # number of models to build
## ##
#####------------ END CONTROL VARIABLES ------------ #####
log.verbose() # request verbose output

env = environ() # create a new MODELLER environment to build this model in
# directories for input atom (structure) files

env.io.atom_files_directory = [’.’,’../..’,’../../..’]
a = automodel(env,
alnfile = ALNFILE,
knowns = KNOWNS,
sequence = SEQ,
assess_methods=(assess.DOPE, assess.DOPEHR, assess.normalized_dope,
assess.GA341))
a.starting_model= 1
a.ending_model = NMODELS
a.make()
This script will print a large amount of diagnostic information

to the screen, including model assessment scores, which can be
used to identify the highest-quality structural model(s) for down-
stream analysis. The format of the sequence-template alignment file
(“alignment.ali” in this example) is modeler-specific and will
depend on the particular complex used for homology-based struc-
tural modeling. Please consult the documentation for the specific
version of modeler used to ensure that the alignment file format is
correct. As an example, consider the following alignment.ali file,
constructed using PDBID 3ADL:
>P1;ANC948
sequence:ANC948:::::::0.00: 0.00
---–QCDPDNDPSKTPISL-LSQLCEKRN-LCSPEYD------LVSQ-QG---P---–
PH---TRTFTMRVTVGD----FV-F-QGT---GRSKKEAKHNAAEKMLDHLRQ-
CPDVPYPT--
/
........../
..........*
>P1;3ADL
structure:3ADL::A:::::0.00: 0.00
------------–SHEVGA-LQELVVQKG-WRLPEYT------VTQE-SG---P---–
AH---RKEFTMTCRVER----FI-E-IGS---GTSKKLAKR-
NAAAKMLLRVHT---------–
/
........../
..........*
In this case, an ancestral DSRM protein sequence (ANC948)

has been aligned to the 3ADL protein structure; the gaps present in
both sequences are artifacts created by extracting these
two sequences out of a larger multi-sequence alignment and are
inconsequential for structural modeling. The forward-slash char-
acters (/) separate structural chains, and the dots (.) indicate
non-amino acid molecules, a double-stranded RNA in this case,
which is present in the structural template and will be transferred to
the homology model.
Automating modeler to generate structural models for a large
number of aligned sequences is relatively straightforward but does
require a bit of organization. Given an alignment of protein
sequence domains and an aligned structural template, the following
python script will generate a directory hierarchy including separate
directories for each sequence, construct 100 structural models for
each sequence, and identify each sequence’s best model. Note that
this incorporates the previous generateStructuralModel.py script.
generateStructuralModels.py:
import sys
import glob
import os
## NOTE: you will need to change the PDBID to the template you are using ##
PDBID = "3ADL"
## the alignment will need to be modified for your system ##

alnstr = """
>P1;%s
sequence:%s:::::::0.00: 0.00
%s
/
........../
..........*
>P1;%s
structure:%s::A:::::0.00: 0.00
%s
/
........../
..........*
"""
modelerpy="""
from modeller.automodel import *
#####------------ CONTROL VARIABLES ------------ #####

## you should only have to change these. ##
ALNFILE = ’alignment.ali’ # alignment file
KNOWNS = ’%s’ # name of template (known structure)
SEQ = ’%s’ # name of target (sequence of unknown structure)
NMODELS = 100 # number of models to build
## ##
#####------------ END CONTROL VARIABLES ------------ #####
log.verbose() # request verbose output

env = environ() # create a new MODELLER environment to build this model in
# directories for input atom files

env.io.atom_files_directory = [’.’,’../..’,’../../..’]
a = automodel(env,
alnfile = ALNFILE,
knowns = KNOWNS,
sequence = SEQ,
assess_methods=(assess.DOPE, assess.DOPEHR))
a.starting_model= 1
a.ending_model = NMODELS
a.make()
"""
def launchSeq(mydir,myid,myseq,number,mystruct):
index = number / 100
topdir = "D%d" % index
if not os.path.exists("%s/%s" % (mydir,topdir)):
os.mkdir("%s/%s" % (mydir,topdir))
if not os.path.exists("%s/%s/%s" % (mydir,topdir,myid)):
os.mkdir("%s/%s/%s" % (mydir,topdir,myid))
wrkdir = "%s/%s/%s" % (mydir,topdir,myid)
#write alignment file#
handle = open("%s/alignment.ali" % wrkdir, "w")
handle.write(alnstr % (myid,myid,myseq,PDBID,PDBID,mystruct))
handle.close()
#write modeler run file#
handle = open("%s/runModeller.py" % wrkdir, "w")
handle.write(modelerpy % (PDBID,myid))
handle.close()
os.system("chmod 775 %s/runModeller.py" % wrkdir)
os.system("cp parseBestModel.py %s/" % wrkdir)
os.chdir("%s/" % wrkdir)
os.system("./runModeller.py > SCORES.txt")
os.system("./parseBestModel.py SCORES.txt")
os.chdir("../../../")
sys.stderr.write("finished: %s %s\n" % (mydir,myid))
num = 0
dr = "models"
alnfilename = sys.argv[1] # alignment file, in FASTA

strucfilename = sys.argv[2] # aligned structural template, in FASTA
# we assume the structural template is aligned to the sequence alignment, is

in FASTA format #
# and has the sequence all on one line (the second line of the file)
#
handle = open(structfilename, “r”)
handle.readline()
structuraltemplate = handle.readline().strip()
handle.close()
handle = open(alnfname, "r")

while line:
if line[0] == ">":
se = ""
se += line.strip()
## now have id and seq, build directories and launch! ##
launchSeq(dr,id,se,num,structuraltemplate)
num += 1
This analysis relies on another python script:
parseBestModel.py:
import os
import sys
"""
Calculate mean and standard deviation of data x[]:
mean = {\sum_i x_i \over n}
std = sqrt(\sum_i (x_i - mean)^2 \over n-1)
"""
def meanstdev(x):
from math import sqrt
n, mean, std = len(x), 0, 0
for a in x:
mean = mean + a
mean = mean / float(n)
for a in x:
std = std + (a - mean)**2
std = sqrt(std / float(n-1))
return (mean, std)
scorefname = sys.argv[1]
handle = open(scorefname, "r")

# skip over all the model-build output #
while line:
if len(linearr)>3 and linearr[0]=="Filename" and linearr
[1]=="molpdf" and linearr[2]=="DOPE":
break
# read model scores #

models = []
molpdf = []
dope = []
dopehr = []
handle.readline()
while line:
if len(linearr) > 3:
models.append(linearr[0])
molpdf.append(float(linearr[1]))
dope.append(float(linearr[2]))
dopehr.append(float(linearr[3]))
handle.close()
# need to scale by 2 * stdev #

(molpdf_mean, molpdf_stdev) = meanstdev(molpdf)
( dope_mean, dope_stdev) = meanstdev(dope )
(dopehr_mean, dopehr_stdev) = meanstdev(dopehr)
newmolpdf = [(x-molpdf_mean)/(2.0*molpdf_stdev) for x in

molpdf]
newdope = [(x- dope_mean)/(2.0*dope_stdev) for x in dope ]
newdopehr = [(x-dopehr_mean)/(2.0*dopehr_stdev) for x in do-
pehr]
# now calculate best model #

# ave of scaled scores #
bestmodel = ""
bestscore = 100000000.0
print "model [molpdf dope dopehr] score"

printlines = []
for i in range(len(models)):
score = newmolpdf[i] + newdope[i] + newdopehr[i]
printlines.append((score,"%s [%.3f(%.3f) %.3f(%.3f) %.3f
(%.3f)] %.3f" % (models[i], molpdf[i], newmolpdf[i], dope[i],
newdope[i], dopehr[i], newdopehr[i], score)))
if score < bestscore:
bestscore = score
bestmodel = models[i]
printlines.sort(reverse=True)
for (s,v) in printlines:
print v
print "BEST MODEL (out of %d): %s" % (len(models),bestmodel)
# get top 1 models #

printlines.sort()
NMODELS = 1
for i in range(NMODELS):
(score,info) = printlines[i]
model = info.split()[0]
mname = info.split(".")[0]
cmd = "cp %s ../%s.BESTMODEL_%d.pdb" % (model,mname,i)
print cmd
os.system(cmd)
In this case, the “best” structural model is determined by

averaging over all model assessment scores, each of which is first
scaled to units of standard deviation across the 100 constructed
structural models. The researcher can increase the NMODELS
variable to evaluate additional models, allowing assessment of
robustness to variation in structural modeling. For very large data

sets, structural homology modeling should probably be parallelized
across multiple processors; the particular approach will depend on
the interface available for parallelization on a specific computer
system.
Protein-protein complexes are typically modeled appropriately
by modeler, but other molecular systems—DNA/RNA or small
chemical ligands—may be treated as “block” molecules during
structural modeling, which fails to account for the specific bio-
chemical properties of the ligand. In these cases, it is recommended
that the resulting structural model be “optimized” to form favor-
able protein-ligand interactions, prior to affinity prediction. Here
we will use pdb2pqr to perform a fast, force-field-based structural
optimization [92]; a more computationally expensive alternative
would be to optimize the protein-ligand structure using molecular
dynamics simulation [93].
pdb2pqr --ff amber --chain inmodel.pdb outmodel.pqr
This optimization will need to be executed for each structural

model used in the analysis.
2.4 Affinity A variety of structure-based affinity prediction tools are available,

Prediction and most of them tuned for predicting a protein’s affinity for small
Visualization chemical ligands such as drug leads [94–98]. Here we will use
generalized linear models developed in our lab to perform
structure-based affinity prediction; models are available for predict-
ing protein-small molecule, protein-DNA/RNA, or protein-
protein affinities [99, 100]. The software requires the protein and
the ligand in separate files, so we will use the following python
script to extract specific chains from the complex:
extractChains.py:
import sys
pdbfname = sys.argv[1]
chains = sys.argv[2:]
handle = open(pdbfname, "r")

for line in handle:
if len(line) > 21:
chain = line[21]
if chain in chains:
sys.stdout.write(line)
handle.close()
Assuming the protein domain is chain A, and the ligand is

encoded as chains B and C, separate protein and ligand files can
be generated using:
extractChain.py complex.pdb A > protein.pdb

extractChain.py complex.pdb B C > ligand.pdb
A ligand file in mol2 format is also required, which can be

generated using the OpenBabel library [101]:
babel -ipdb ligand.pdb -omol2 ligand.mol2
Finally, GLM-Score is used to predict the affinity of the protein

for its ligand [100]:
GLM-Score protein.pdb ligand.pdb ligand.mol2 protein
If the ligand is DNA/RNA, the last “protein” should be

changed to “DNA,” or the ligand type can be identified as “small
molecule.” Of course, affinity predictions will need to be made for
each structural model, which can be scripted by the researcher or
parallelized on a large supercomputer.
Affinity predictions are provided as pKds, which are –log10-
transformed dissociation constants; larger pKds indicate tighter
protein-ligand binding. Results will be available in a “ligand_re-
sult.txt” file.
Generating protein-ligand affinity predictions for every ances-
tral and extant sequence on a large protein family phylogeny can
provide valuable information about how one aspect of protein
function has evolved, but this information can be difficult to visua-
lize. Here we develop an approach that maps affinity estimates onto
the phylogeny used to generate ancestral sequences, coloring
branches on a red-blue gradient based on predicted affinity, with
high-affinity branches colored red and branches with low affinity
colored blue.
First, pKd estimates are collected into a single text file,
organized by protein identifier. We are assuming a directory struc-
ture in which each protein-ligand complex and resulting affinity
prediction is stored in a directory named by the protein ID. Results
will be placed in all_pkds.txt.
collectPKDs.py:
import sys
import glob
outf = open("all_pkds.txt", "w")

for f in glob.glob("*/*_result.txt"):
seqid = f.split("/")[0]
pkds = []
handle = open(f,"r")
for line in handle:
if linearr[0] == "predicted" and linearr[1] == "pKD:":
pkds.append(float(linearr[2]))
handle.close()
pkds.sort(reverse=True)
outf.write(seqid)
for k in pkds:
outf.write("\t%f" % k)
outf.write("\n")
outf.close()
If multiple affinity estimates were generated for each sequence,

the mean affinity estimate could be calculated for display across the
tree. The next step is to generate a phylogenetic tree that has both
branch lengths and ancestral node identifiers; this will be stored in
pkd_base_tree.tre.
createPKD_BaseTree.py:
import sys
# need to point to the tree with branch lengths and the labelled tree #
BL_TREE="ASRinput.tre"
LA_TREE="RAxML_nodeLabelledRootedTree.alignment.ancseqs"
handle = open(LA_TREE, "r")

nodetree = ""
for line in handle:
nodetree += line.strip()
handle.close()
handle = open(BL_TREE, "r")

brlentree = ""
for line in handle:
brlentree += line.strip()
handle.close()
outf = open("pkd_base_tree.tre", "w")
# parse trees
structurals = ["(",")",",",":",";"]
numericals = ["0","1","2","3","4","5","6","7","8","9",".","e","E","-"]
i1 = 0
i2 = 0
while i1 < len(nodetree):

if nodetree[i1] in structurals:
outf.write(nodetree[i1])
i1 += 1
else:
label = ""
while nodetree[i1] not in structurals:
label += nodetree[i1]
i1 += 1
# get branch length #
while i2 < len(brlentree) and brlentree[i2] != ":":
i2 +=1
brlenstr = ""
i2 += 1
while i2 < len(brlentree) and brlentree[i2] in numericals:
brlenstr += brlentree[i2]
i2 += 1
# print labelled information #
if brlenstr == "":
brlenstr = "0.0"
outf.write("%s:%s" % (label,brlenstr))
outf.write("\n")
outf.close()
Finally, we will convert pKd predictions to colors on a red-blue

gradient. The researcher will have to set the lower and upper
bounds on pKd values, based on the specific system under study.
colorPKDTree.py:
import sys
pkdfname = "all_pkds.txt"
trefname = "pkd_base_tree.tre"
# read pKd values #

pkd_map = {}
handle = open(pkdfname, "r")
handle.readline()
for line in handle:
pkd = float(linearr[1])
pkd_map[linearr[0]] = pkd
handle.close()
colors=["#0025e5","#1926d2","#3327c0","#4c28ae","#66299b",
"#7f2a89","#992b77","#b22c64","#cc2d52","#e52e40","#ff302e"]
## the max and min pKd values should be chosen based on the system under study ##
new_max = 9.75
new_min = 4.75
# create break points for color gradient #

breaks = []
totalsize = new_max - new_min
bitsize = totalsize / (len(colors)-1)
breaks.append(new_min)
for i in range(1,len(colors)-1,1):
breaks.append(new_min+(bitsize*i))
breaks.append(new_max)
def getColor(nodename):
if nodename not in pkd_map.keys():
return "[&!color=#d3d3d3]" ## missing data gets gray color ##
else:
pkd = pkd_map[nodename]
for i in range(len(breaks)):
if pkd < breaks[i]:
return "[&!color=%s]" % colors[i]
return "[&!color=%s]" % colors[-1]
handle = open(trefname, "r")

nodetree = ""
for line in handle:
nodetree += line.strip()
handle.close()
sys.stdout.write("#nexus\nbegin trees;\n tree t1 = [&R] ")
structurals = ["(",")",",",":",";"]
numericals = ["0","1","2","3","4","5","6","7","8","9",".","e","E","-"]
i1 = 0
while i1 < len(nodetree):
# parse branch length #
if nodetree[i1] == ":":
sys.stdout.write(nodetree[i1])
i1 += 1
brlen = ""
while i1 < len(nodetree) and nodetree[i1] in numericals:
brlen += nodetree[i1]
i1 += 1
sys.stdout.write(brlen)
# write any tree structural information #

elif nodetree[i1] in structurals:

sys.stdout.write(nodetree[i1])
i1 += 1
else:
label = ""
while nodetree[i1] not in structurals:
label += nodetree[i1]
i1 += 1
colorlabel = getColor(label)
sys.stdout.write("%s%s" % (label,colorlabel))
sys.stdout.write("\nend trees;\n")
The resulting tree will be in NEXUS format and can be visua-

lized using FigTree (Fig. 1). The observed patterns of changes in
predicted protein-ligand affinities can be used to guide the experi-
mental characterization of ancestral protein function.
Fig. 1 Visualizing predicted affinities of extant and ancestral-reconstructed double-stranded RNA-binding

motif (DSRM) domains for dsRNA targets using FigTree
References
1. Dean AM, Thornton JW (2007) Mechanistic 11. Bridgham JT, Carroll SM, Thornton JW
approaches to the study of evolution: the (2006) Evolution of hormone-receptor com-
functional synthesis. Nat Rev Genet 8 plexity by molecular exploitation. Science 312
(9):675–688. https://doi.org/10.1038/ (5770):97–101. https://doi.org/10.1126/
nrg2160 science.1123348
2. Harms MJ, Thornton JW (2013) Evolution- 12. Bridgham JT, Ortlund EA, Thornton JW
ary biochemistry: revealing the historical and (2009) An epistatic ratchet constrains the
physical causes of protein properties. Nat Rev direction of glucocorticoid receptor evolu-
Genet 14(8):559–571. https://doi.org/10. tion. Nature 461(7263):515–519. https://
1038/nrg3540 doi.org/10.1038/nature08249
3. Cole MF, Gaucher EA (2011) Exploiting 13. Voordeckers K, Brown CA, Vanneste K, van
models of molecular evolution to efficiently der Zande E, Voet A, Maere S, Verstrepen KJ
direct protein engineering. J Mol Evol 72 (2012) Reconstruction of ancestral metabolic
(2):193–203. https://doi.org/10.1007/ enzymes reveals molecular mechanisms
s00239-010-9415-2 underlying evolutionary innovation through
4. Ogawa T, Shirai T (2014) Tracing ancestral gene duplication. PLoS Biol 10(12):
specificity of lectins: ancestral sequence recon- e1001446. https://doi.org/10.1371/jour
struction method as a new approach in pro- nal.pbio.1001446
tein engineering. Methods Mol Biol 14. Ugalde JA, Chang BS, Matz MV (2004) Evo-
1200:539–551. https://doi.org/10.1007/ lution of coral pigments recreated. Science
978-1-4939-1292-6_44 305(5689):1433. https://doi.org/10.1126/
5. Yang Z, Kumar S, Nei M (1995) A new science.1099597
method of inference of ancestral nucleotide 15. van Hazel I, Sabouhanian A, Day L, Endler
and amino acid sequences. Genetics 141 JA, Chang BS (2013) Functional characteri-
(4):1641–1650 zation of spectral tuning mechanisms in the
6. Shih P, Malcolm BA, Rosenberg S, Kirsch JF, great bowerbird short-wavelength sensitive
Wilson AC (1993) Reconstruction and test- visual pigment (SWS1), and the origins of
ing of ancestral proteins. Methods Enzymol UV/violet vision in passerines and parrots.
224:576–590 BMC Evol Biol 13:250. https://doi.org/10.
7. Zmasek CM, Godzik A (2011) Strong func- 1186/1471-2148-13-250
tional patterns in the evolution of eukaryotic 16. Hall BG (2006) Simple and accurate estima-
genomes revealed by the reconstruction of tion of ancestral protein sequences. Proc Natl
ancestral protein domain repertoires. Acad Sci U S A 103(14):5431–5436. https://
Genome Biol 12(1):R4. https://doi.org/10. doi.org/10.1073/pnas.0508991103
1186/gb-2011-12-1-r4 17. Ashkenazy H, Penn O, Doron-Faigenboim A,
8. Whitfield JH, Zhang WH, Herde MK, Clifton Cohen O, Cannarozzi G, Zomer O, Pupko T
BE, Radziejewski J, Janovjak H, (2012) FastML: a web server for probabilistic
Henneberger C, Jackson CJ (2015) Con- reconstruction of ancestral sequences. Nucleic
struction of a robust and sensitive arginine Acids Res 40(Web Server issue):
biosensor through ancestral protein recon- W580–W584. https://doi.org/10.1093/
struction. Protein Sci 24(9):1412–1422. nar/gks498
https://doi.org/10.1002/pro.2721 18. Redelings BD, Suchard MA (2005) Joint
9. Malcolm BA, Wilson KP, Matthews BW, Bayesian estimation of alignment and phylog-
Kirsch JF, Wilson AC (1990) Ancestral lyso- eny. Syst Biol 54(3):401–418. https://doi.
zymes reconstructed, neutrality tested, and org/10.1080/10635150590947041
thermostability linked to hydrocarbon pack- 19. Suchard MA, Redelings BD (2006) BAli-Phy:
ing. Nature 345(6270):86–89. https://doi. simultaneous Bayesian inference of alignment
org/10.1038/345086a0 and phylogeny. Bioinformatics 22
10. Clifton BE, Jackson CJ (2016) Ancestral pro- (16):2047–2048. https://doi.org/10.1093/
tein reconstruction yields insights into adap- bioinformatics/btl175
tive evolution of binding specificity in solute- 20. Anderson DP, Whitney DS, Hanson-Smith V,
binding proteins. Cell Chem Biol 23 Woznica A, Campodonico-Burnett W, Volk-
(2):236–245. https://doi.org/10.1016/j. man BF, King N, Thornton JW, Prehoda KE
chembiol.2015.12.010 (2016) Evolution of an ancient protein
function involved in organized multicellular- (11):2058–2071. https://doi.org/10.1093/

ity in animals. Elife 5:e10147. https://doi. molbev/msl091
org/10.7554/eLife.10147 32. Blanquart S, Lartillot N (2008) A site- and
21. Thornton JW (2004) Resurrecting ancient time-heterogeneous model of amino acid
genes: experimental analysis of extinct mole- replacement. Mol Biol Evol 25(5):842–858.
cules. Nat Rev Genet 5(5):366–375. https:// https://doi.org/10.1093/molbev/msn018
doi.org/10.1038/nrg1324 33. Risso VA, Gavira JA, Mejia-Carmona DF,
22. Chang BS, Jonsson K, Kazmi MA, Donoghue Gaucher EA, Sanchez-Ruiz JM (2013)
MJ, Sakmar TP (2002) Recreating a func- Hyperstability and substrate promiscuity in
tional ancestral archosaur visual pigment. laboratory resurrections of Precambrian
Mol Biol Evol 19(9):1483–1489 beta-lactamases. J Am Chem Soc 135
23. Williams PD, Pollock DD, Blackburne BP, (8):2899–2902. https://doi.org/10.1021/
Goldstein RA (2006) Assessing the accuracy ja311630a
of ancestral protein reconstruction methods. 34. Korithoski B, Kolaczkowski O, Mukherjee K,
PLoS Comput Biol 2(6):e69. https://doi. Kola R, Earl C, Kolaczkowski B (2015) Evo-
org/10.1371/journal.pcbi.0020069 lution of a novel antiviral immune-signaling
24. Matsumoto T, Akashi H, Yang Z (2015) Eval- interaction by partial-gene duplication. PLoS
uation of ancestral sequence reconstruction One 10(9):e0137276. https://doi.org/10.
methods to infer nonstationary patterns of 1371/journal.pone.0137276
nucleotide substitution. Genetics 200 35. Pugh C, Kolaczkowski O, Manny A,
(3):873–890. https://doi.org/10.1534/ Korithoski B, Kolaczkowski B (2016) Resur-
genetics.115.177386 recting ancestral structural dynamics of an
25. Susko E, Roger AJ (2013) Problems with antiviral immune receptor: adaptive binding
estimation of ancestral frequencies under sta- pocket reorganization repeatedly shifts RNA
tionary models. Syst Biol 62(2):330–338. preference. BMC Evol Biol 16(1):241.
https://doi.org/10.1093/sysbio/sys075 https://doi.org/10.1186/s12862-016-
26. Pollock DD, Chang BS (2007) Dealing with 0818-6
uncertainty in ancestral sequence reconstruc- 36. Finnigan GC, Hanson-Smith V, Stevens TH,
tion: sampling from the posterior distribu- Thornton JW (2012) Evolution of increased
tion. In: Liberles DA (ed) Ancestral complexity in a molecular machine. Nature
sequence reconstruction. Oxford University 481(7381):360–364. https://doi.org/10.
Press, Oxford 1038/nature10724
27. Dias R, Manny A, Kolaczkowski O, Kolacz- 37. Kratzer JT, Lanaspa MA, Murphy MN,
kowski B (2017) Convergence of domain Cicerchi C, Graves CL, Tipton PA, Ortlund
architecture, structure, and ligand affinity in EA, Johnson RJ, Gaucher EA (2014) Evolu-
animal and plant RNA-binding proteins. Mol tionary history and metabolic insights of
Biol Evol 34(6):1429–1444. https://doi. ancient mammalian uricases. Proc Natl Acad
org/10.1093/molbev/msx090 Sci U S A 111(10):3763–3768. https://doi.
28. Randall RN, Radford CE, Roof KA, Natarajan org/10.1073/pnas.1320393111
DK, Gaucher EA (2016) An experimental 38. Ortlund EA, Bridgham JT, Redinbo MR,
phylogeny to benchmark ancestral sequence Thornton JW (2007) Crystal structure of an
reconstruction. Nat Commun 7:12847. ancient protein: evolution by conformational
https://doi.org/10.1038/ncomms12847 epistasis. Science 317(5844):1544–1548.
29. Hanson-Smith V, Kolaczkowski B, Thornton https://doi.org/10.1126/science.1142819
JW (2010) Robustness of ancestral sequence 39. Marchler-Bauer A, Derbyshire MK, Gonzales
reconstruction to phylogenetic uncertainty. NR, Lu S, Chitsaz F, Geer LY, Geer RC, He J,
Mol Biol Evol 27(9):1988–1999. https:// Gwadz M, Hurwitz DI, Lanczycki CJ, Lu F,
doi.org/10.1093/molbev/msq081 Marchler GH, Song JS, Thanki N, Wang Z,
30. Kolaczkowski B, Thornton JW (2004) Perfor- Yamashita RA, Zhang D, Zheng C, Bryant
mance of maximum parsimony and likelihood SH (2015) CDD: NCBI’s conserved domain
phylogenetics when evolution is heteroge- database. Nucleic Acids Res 43(Database
neous. Nature 431(7011):980–984. https:// issue):D222–D226. https://doi.org/10.
doi.org/10.1038/nature02917 1093/nar/gku1221
31. Blanquart S, Lartillot N (2006) A Bayesian 40. Finn RD, Bateman A, Clements J, Coggill P,
compound stochastic process for modeling Eberhardt RY, Eddy SR, Heger A,
nonstationary and nonhomogeneous Hetherington K, Holm L, Mistry J, Sonn-
sequence evolution. Mol Biol Evol 23 hammer EL, Tate J, Punta M (2014) Pfam:
the protein families database. Nucleic Acids
Res 42(Database issue):D222–D230. Bioinform 8(4):1108–1119. https://doi.

https://doi.org/10.1093/nar/gkt1223 org/10.1109/TCBB.2009.68
41. Yue F, Shi J, Tang J (2009) Simultaneous 51. Larkin MA, Blackshields G, Brown NP,
phylogeny reconstruction and multiple Chenna R, McGettigan PA, McWilliam H,
sequence alignment. BMC Bioinformatics 10 Valentin F, Wallace IM, Wilm A, Lopez R,
(Suppl 1):S11. https://doi.org/10.1186/ Thompson JD, Gibson TJ, Higgins DG
1471-2105-10-S1-S11 (2007) Clustal W and Clustal X version 2.0.
42. Fleissner R, Metzler D, von Haeseler A Bioinformatics 23(21):2947–2948. https://
(2005) Simultaneous statistical multiple doi.org/10.1093/bioinformatics/btm404
alignment and phylogeny reconstruction. 52. Sievers F, Wilm A, Dineen D, Gibson TJ,
Syst Biol 54(4):548–561. https://doi.org/ Karplus K, Li W, Lopez R, McWilliam H,
10.1080/10635150590950371 Remmert M, Soding J, Thompson JD, Hig-
43. Herman JL, Challis CJ, Novak A, Hein J, gins DG (2011) Fast, scalable generation of
Schmidler SC (2014) Simultaneous Bayesian high-quality protein multiple sequence align-
estimation of alignment and phylogeny under ments using Clustal Omega. Mol Syst Biol
a joint model of protein sequence and struc- 7:539. https://doi.org/10.1038/msb.2011.
ture. Mol Biol Evol 31(9):2251–2266. 75
https://doi.org/10.1093/molbev/msu184 53. Edgar RC (2004) MUSCLE: multiple
44. Liu K, Warnow TJ, Holder MT, Nelesen SM, sequence alignment with high accuracy and
Yu J, Stamatakis AP, Linder CR (2012) SATe- high throughput. Nucleic Acids Res 32
II: very fast and accurate simultaneous estima- (5):1792–1797. https://doi.org/10.1093/
tion of multiple sequence alignments and phy- nar/gkh340
logenetic trees. Syst Biol 61(1):90–106. 54. Katoh K, Standley DM (2013) MAFFT mul-
https://doi.org/10.1093/sysbio/syr095 tiple sequence alignment software version 7:
45. Nuin PA, Wang Z, Tillier ER (2006) The improvements in performance and usability.
accuracy of several multiple sequence align- Mol Biol Evol 30(4):772–780. https://doi.
ment programs for proteins. BMC Bioinfor- org/10.1093/molbev/mst010
matics 7:471. https://doi.org/10.1186/ 55. Liu Y, Schmidt B, Maskell DL (2010) MSA-
1471-2105-7-471 Probs: multiple sequence alignment based on
46. Pervez MT, Babar ME, Nadeem A, Aslam M, pair hidden Markov models and partition
Awan AR, Aslam N, Hussain T, Naveed N, function posterior probabilities. Bioinformat-
Qadri S, Waheed U, Shoaib M (2014) Evalu- ics 26(16):1958–1964. https://doi.org/10.
ating the accuracy and efficiency of multiple 1093/bioinformatics/btq338
sequence alignment methods. Evol Bioinfor- 56. Roshan U, Livesay DR (2006) Probalign:
matics Online 10:205–217. https://doi.org/ multiple sequence alignment using partition
10.4137/EBO.S19199 function posterior probabilities. Bioinformat-
47. Thompson JD, Linard B, Lecompte O, Poch ics 22(22):2715–2721. https://doi.org/10.
O (2011) A comprehensive benchmark study 1093/bioinformatics/btl472
of multiple sequence alignment methods: cur- 57. Do CB, Mahabhashyam MS, Brudno M, Bat-
rent challenges and future perspectives. PLoS zoglou S (2005) ProbCons: probabilistic
One 6(3):e18093. https://doi.org/10. consistency-based multiple sequence align-
1371/journal.pone.0018093 ment. Genome Res 15(2):330–340. https://
48. Ogden TH, Rosenberg MS (2006) Multiple doi.org/10.1101/gr.2821705
sequence alignment accuracy and phyloge- 58. Notredame C, Higgins DG, Heringa J (2000)
netic inference. Syst Biol 55(2):314–328. T-Coffee: a novel method for fast and accu-
https://doi.org/10.1080/ rate multiple sequence alignment. J Mol Biol
10635150500541730 302(1):205–217. https://doi.org/10.1006/
49. Simmons MP, Muller KF, Webb CT (2011) jmbi.2000.4042
The deterministic effects of alignment bias in 59. Talavera G, Castresana J (2007) Improvement
phylogenetic inference. Cladistics 27 of phylogenies after removing divergent and
(4):402–416 ambiguously aligned blocks from protein
50. Wang LS, Leebens-Mack J, Kerr Wall P, sequence alignments. Syst Biol 56
Beckmann K, dePamphilis CW, Warnow T (4):564–577. https://doi.org/10.1080/
(2011) The impact of multiple protein 10635150701472164
sequence alignment on phylogenetic estima- 60. Gouveia-Oliveira R, Sackett PW, Pedersen AG
tion. IEEE/ACM Trans Comput Biol (2007) MaxAlign: maximizing usable data in
an alignment. BMC Bioinformatics 8:312.
https://doi.org/10.1186/1471-2105-8- 72. Ripplinger J, Sullivan J (2010) Assessment of

312 substitution model adequacy using frequen-
61. Capella-Gutierrez S, Silla-Martinez JM, tist and Bayesian methods. Mol Biol Evol 27
Gabaldon T (2009) trimAl: a tool for auto- (12):2790–2803. https://doi.org/10.1093/
mated alignment trimming in large-scale phy- molbev/msq168
logenetic analyses. Bioinformatics 25 73. Darriba D, Taboada GL, Doallo R, Posada D
(15):1972–1973. https://doi.org/10.1093/ (2011) ProtTest 3: fast selection of best-fit
bioinformatics/btp348 models of protein evolution. Bioinformatics
62. Wu M, Chatterji S, Eisen JA (2012) Account- 27(8):1164–1165. https://doi.org/10.
ing for alignment uncertainty in phyloge- 1093/bioinformatics/btr088
nomics. PLoS One 7(1):e30288. https:// 74. Le SQ, Gascuel O (2008) An improved gen-
doi.org/10.1371/journal.pone.0030288 eral amino acid replacement matrix. Mol Biol
63. Castresana J (2000) Selection of conserved Evol 25(7):1307–1320. https://doi.org/10.
blocks from multiple alignments for their use 1093/molbev/msn067
in phylogenetic analysis. Mol Biol Evol 17 75. Anisimova M, Gascuel O (2006) Approxi-
(4):540–552 mate likelihood-ratio test for branches: a fast,
64. Wheeler WC, Gatesy J, DeSalle R (1995) Eli- accurate, and powerful alternative. Syst Biol
sion: a method for accommodating multiple 55(4):539–552. https://doi.org/10.1080/
molecular sequence alignments with 10635150600755453
alignment-ambiguous sites. Mol Phylogenet 76. Anisimova M, Gil M, Dufayard JF,
Evol 4(1):1–9. https://doi.org/10.1006/ Dessimoz C, Gascuel O (2011) Survey of
mpev.1995.1001 branch support methods demonstrates accu-
65. de Queiroz A, Gatesy J (2007) The superma- racy, power, and robustness of fast likelihood-
trix approach to systematics. Trends Ecol Evol based approximation schemes. Syst Biol 60
22(1):34–41. https://doi.org/10.1016/j. (5):685–699. https://doi.org/10.1093/sys
tree.2006.10.002 bio/syr041
66. Mar JC, Harlow TJ, Ragan MA (2005) Bayes- 77. Hill J, Davis KE (2014) The Supertree
ian and maximum likelihood phylogenetic Toolkit 2: a new and improved software pack-
analyses of protein sequence data under rela- age with a Graphical User Interface for super-
tive branch-length differences and model vio- tree construction. Biodivers Data J 2:e1053.
lation. BMC Evol Biol 5:8. https://doi.org/ https://doi.org/10.3897/BDJ.2.e1053
10.1186/1471-2148-5-8 78. Pagel M, Meade A, Barker D (2004) Bayesian
67. Kolaczkowski B, Thornton JW (2009) Long- estimation of ancestral character states on
branch attraction bias and inconsistency in phylogenies. Syst Biol 53(5):673–684.
Bayesian phylogenetics. PLoS One 4(12): https://doi.org/10.1080/
e7891. https://doi.org/10.1371/journal. 10635150490522232
pone.0007891 79. Eswar N, Eramian D, Webb B, Shen MY, Sali
68. Price MN, Dehal PS, Arkin AP (2010) Fas- A (2008) Protein structure modeling with
tTree 2--approximately maximum-likelihood MODELLER. Methods Mol Biol
trees for large alignments. PLoS One 5(3): 426:145–159. https://doi.org/10.1007/
e9490. https://doi.org/10.1371/journal. 978-1-60327-058-8_8
pone.0009490 80. Madhusudhan MS, Webb BM, Marti-Renom
69. Liu K, Linder CR, Warnow T (2011) RAxML MA, Eswar N, Sali A (2009) Alignment of
and FastTree: comparing two methods for multiple protein structures based on sequence
large-scale maximum likelihood phylogeny and structure features. Protein Eng Des Sel 22
estimation. PLoS One 6(11):e27731. (9):569–574. https://doi.org/10.1093/pro
https://doi.org/10.1371/journal.pone. tein/gzp040
0027731 81. Kalaimathy S, Sowdhamini R, Kanagaraja-
70. Stamatakis A (2014) RAxML version 8: a tool durai K (2011) Critical assessment of
for phylogenetic analysis and post-analysis of structure-based sequence alignment methods
large phylogenies. Bioinformatics 30 at distant relationships. Brief Bioinform 12
bioinformatics/btu033 bib/bbq025
71. Ripplinger J, Sullivan J (2008) Does choice in 82. Kim C, Lee B (2007) Accuracy of structure-
model selection affect maximum likelihood based sequence alignment of automatic meth-
analysis? Syst Biol 57(1):76–85. https://doi. ods. BMC Bioinformatics 8:355. https://doi.
org/10.1080/10635150801898920 org/10.1186/1471-2105-8-355
83. Ashtawy HM, Mahapatra NR (2012) A com- PDB2PQR: expanding and upgrading auto-
parative assessment of ranking accuracies of mated preparation of biomolecular structures
conventional and machine-learning-based for molecular simulations. Nucleic Acids Res
scoring functions for protein-ligand binding 35(Web Server issue):W522–W525. https://
affinity prediction. IEEE/ACM Trans Com- doi.org/10.1093/nar/gkm276
put Biol Bioinform 9(5):1301–1313. https:// 93. Pronk S, Pall S, Schulz R, Larsson P,
doi.org/10.1109/TCBB.2012.36 Bjelkmar P, Apostolov R, Shirts MR, Smith
84. Ashtawy HM, Mahapatra NR (2015) JC, Kasson PM, van der Spoel D, Hess B,
BgN-Score and BsN-Score: bagging and Lindahl E (2013) GROMACS 4.5: a high-
boosting based ensemble neural networks throughput and highly parallel open source
scoring functions for accurate binding affinity molecular simulation toolkit. Bioinformatics
prediction of protein-ligand complexes. BMC 29(7):845–854. https://doi.org/10.1093/
Bioinformatics 16(Suppl 4):S8. https://doi. bioinformatics/btt055
org/10.1186/1471-2105-16-S4-S8 94. Dias R, Timmers LF, Caceres RA, de Azevedo
85. Brylinski M (2013) Nonlinear scoring func- WF Jr (2008) Evaluation of molecular dock-
tions for similarity-based ligand docking and ing using polynomial empirical scoring func-
binding affinity prediction. J Chem Inf Model tions. Curr Drug Targets 9(12):1062–1070
53(11):3097–3112. https://doi.org/10. 95. De Paris R, Quevedo CV, Ruiz DD, Norberto
1021/ci400510e de Souza O, Barros RC (2015) Clustering
86. Rose PW, Bi C, Bluhm WF, Christie CH, molecular dynamics trajectories for optimiz-
Dimitropoulos D, Dutta S, Green RK, Good- ing docking experiments. Comput Intell Neu-
sell DS, Prlic A, Quesada M, Quinn GB, rosci 2015:916240. https://doi.org/10.
Ramos AG, Westbrook JD, Young J, 1155/2015/916240
Zardecki C, Berman HM, Bourne PE (2013) 96. Seo MH, Park J, Kim E, Hohng S, Kim HS
The RCSB Protein Data Bank: new resources (2014) Protein conformational dynamics dic-
for research and education. Nucleic Acids Res tate the binding affinity for a ligand. Nat
41(Database issue):D475–D482. https:// Commun 5:3724. https://doi.org/10.
doi.org/10.1093/nar/gks1200 1038/ncomms4724
87. Comeau SR, Gatchell DW, Vajda S, Camacho 97. Kruger DM, Ignacio Garzon J, Chacon P,
CJ (2004) ClusPro: an automated docking Gohlke H (2014) DrugScorePPI
and discrimination method for the prediction knowledge-based potentials used as scoring
of protein complexes. Bioinformatics 20 and objective function in protein-protein
(1):45–50 docking. PLoS One 9(2):e89466. https://
88. Kastritis PL, Bonvin AM (2010) Are scoring doi.org/10.1371/journal.pone.0089466
functions in protein-protein docking ready to 98. Camacho CJ, Zhang C (2005) FastContact:
predict interactomes? Clues from a novel rapid estimate of contact and binding free
binding affinity benchmark. J Proteome Res energies. Bioinformatics 21(10):2534–2536.
9(5):2216–2225. https://doi.org/10.1021/ https://doi.org/10.1093/bioinformatics/
pr9009854 bti322
89. Kozakov D, Beglov D, Bohnuud T, Mottar- 99. Dias R, Kolaczkowski B (2017) Improving
ella SE, Xia B, Hall DR, Vajda S (2013) How the accuracy of high-throughput protein-pro-
good is automated protein docking? Proteins tein affinity prediction may require better
81(12):2159–2166. https://doi.org/10. training data. BMC Bioinformatics 18(Suppl
1002/prot.24403 5):102. https://doi.org/10.1186/s12859-
90. Lensink MF, Wodak SJ (2013) Docking, scor- 017-1533-z
ing, and affinity prediction in CAPRI. Pro- 100. Dias R, Kolazckowski B (2015) Different
teins 81(12):2082–2095. https://doi.org/ combinations of atomic interactions predict
10.1002/prot.24428 protein-small molecule and protein-DNA/
91. Roberts VA, Thompson EE, Pique ME, Perez RNA affinities with similar accuracy. Proteins
MS, Ten Eyck LF (2013) DOT2: macromo- 83(11):2100–2114. https://doi.org/10.
lecular docking with improved biophysical 1002/prot.24928
models. J Comput Chem 34 101. O’Boyle NM, Banck M, James CA, Morley C,
(20):1743–1758. https://doi.org/10.1002/ Vandermeersch T, Hutchison GR (2011)
jcc.23304 Open Babel: an open chemical toolbox. J
92. Dolinsky TJ, Czodrowski P, Li H, Nielsen JE, Cheminform 3:33. https://doi.org/10.
Jensen JH, Klebe G, Baker NA (2007) 1186/1758-2946-3-33
Chapter 9
Ancestral Sequence Reconstruction as a Tool

for the Elucidation of a Stepwise Evolutionary Adaptation
Kristina Straub and Rainer Merkl
Abstract
Ancestral sequence reconstruction (ASR) is a powerful tool to infer primordial sequences from contempo-
rary, i.e., extant ones. An essential element of ASR is the computation of a phylogenetic tree whose leaves
are the chosen extant sequences. Most often, the reconstructed sequence related to the root of this tree is of
greatest interest: It represents the common ancestor (CA) of the sequences under study. If this sequence
encodes a protein, one can “resurrect” the CA by means of gene synthesis technology and study biochemi-
cal properties of this extinct predecessor with the help of wet-lab experiments.
However, ASR deduces also sequences for all internal nodes of the tree, and the well-considered analysis
of these “intermediates” can help to elucidate evolutionary processes. Moreover, one can identify key
mutations that alter proteins or protein complexes and are responsible for the differing properties of extant
proteins. As an illustrative example, we describe the protocol for the rapid identification of hotspots
determining the binding of the two subunits within the heteromeric complex imidazole glycerol phosphate
synthase.
Key words Ancestral sequence reconstruction, Vertical analysis, Evolutionary biochemistry, In silico
mutagenesis, Protein–protein interaction
1 Introduction
A major goal of life scientists is to understand the function of

proteins at the residue level, and often, computational biology
contributes a lot to the finding of functionally or structurally
important residues; for a review see [1]. For example, if the 3D
structure of a protein is known, one can assess the contribution of
individual residues to protein stability [2]; additionally, one can
predict catalytic sites [3] and protein interfaces [4] by analyzing
cavities or surface residues. Moreover, the comparison of results
deduced for homologous proteins allows one to elucidate the
evolution of specific protein functions [5]. Similarly, protein
sequences can be utilized; however, the predictive power of
corresponding algorithms depends on the number of sequences
171
172 Kristina Straub and Rainer Merkl
that are at hand. In the post-genomic era, computational protein

biology benefits from the enormous number of known orthologs,
i.e., sequences from different species that have the same ancestor
and encode identical or similar functions. In order to identify
residue positions that are crucial for a specific family, it is a common
approach to generate a multiple sequence alignment (MSA), which
is subsequently utilized to determine for each position in the pro-
tein the conservation level of each residue [6].
This and similar approaches are often named “horizontal”,
because they are based on the analysis of a certain phase of evolu-
tion represented by the proteins found in extant species. Due to the
enormous number of known sequences, these residue distributions
can be determined quite precisely and the horizontal approach
allows the identification of residues that are important for all mem-
bers of a family. However, this method rarely identifies sets of
residues that determine specificity in a family of functionally diverse
proteins [7]. Thus, to study protein evolution, a more detailed
analysis is needed, for example, based on a clustering of sequences
by means of neighbor joining [8]. A state-of-the-art method for the
study of divergent evolution even in very large protein families is
the usage of sequence similarity networks and genome neighbor-
hood networks; for a recent review see [9]. Such cluster algorithms
are based on a simplified model of protein evolution; due to their
computational complexity, models that are more elaborate are not
applicable for the analysis of large datasets.
Although only applicable to a relatively small number of
sequences, the implementation of highly reliable phylogenetic algo-
rithms has added a further dimension to sequence analysis: It makes
possible to trace back the evolution of a fair number of extant
orthologs to common ancestors. If functional diversity is known
for some of the extant orthologs, this “vertical” approach has great
potential, because one can reconstruct the sequences of putative
predecessors and identify those mutations that occurred along that
branch of the family tree on which functional diversification
occurred [7].
The vertical approach is a specific application of ancestral
sequence reconstruction (ASR), which became popular during the
last decade, especially in combination with “resurrection” experi-
ments; for recent reviews see [10–13]. The typical protocol of each
ASR consists of two steps: First, the user has to compute a phylo-
genetic tree trphylo. In all cases, the extant orthologs chosen by the
user constitute the leaves, but the topology of trphylo is determined
by sequence similarity, the selected evolutionary model, and the
algorithm used for its computation. In contrast to a classical phylo-
genetic analysis, ASR requires a subsequent step that deduces for all
internal nodes of trphylo sequences that represent predecessors. The
composition of these sequences critically depends on the content of
the leaves (extant orthologs) but also on the topology of trphylo.
Elucidating a Stepwise Evolutionary Adaptation by Means of ASR 173
This is why trphylo has to fulfill certain quality criteria to guarantee

proper sequence reconstruction. Nowadays, it is straightforward to
supplement such an in silico reconstruction with wet-lab experi-
ments: One can recombinantly resurrect proteins with the help of
gene synthesis and characterize them with classical biochemical and
biophysical methods [11]. Besides their relevance for answering
evolutionary problems, resurrected proteins became increasingly
important in protein engineering, because one can beneficially
exploit their promiscuity [14] to tailor protein function [15].
In addition, the fact that ancestral proteins are frequently “gen-
eralists” motivates their usage in vertical approaches. In the follow-
ing, we detail a protocol for the identification of specificity-
determining residues. The general strategy is to select a protein
family of interest and a property to be evaluated. Then, one has to
infer a phylogenetic tree and choose the branches of the family tree
to be analyzed. The selection of branches may depend on in silico or
wet-lab experiments aimed at finding branch-determining leaves,
i.e., extant proteins with differing functions. The final task is to
reconstruct the sequences of predecessors with the help of ASR
(Subheading 2.1) and to identify specificity-determining residues
by comparing the sequences of ancestral sequences within the
chosen branches (Subheading 2.2). Again, the assessment of these
residues may comprise in silico and/or wet-lab analyses.
We used this strategy to study the stepwise adaptation of the
protein–protein interface (PPI) from the heterodimeric imidazole
glycerol phosphate synthase (ImGPS). This enzyme mediates the
incorporation of nitrogen into PRFAR by catalyzing the transfer of
the amido nitrogen of glutamine to an acceptor substrate
[16, 17]. In bacteria and archaea, ImGPS consists of the cyclase
subunit HisF and the glutaminase subunit HisH, which assemble
with high affinity to a bi-enzyme complex [18]. Despite detailed
biochemical and structural studies [19], the specific residue posi-
tions responsible for HisF:HisH complex formation were
unknown. This is why we identified key residue positions of this
PPI by means of a vertical approach [20, 21], which is illustrated in
Fig. 1.
2 Method
2.1 Ancestral 1. Collect a large number of orthologs. Start with a specific

Sequence sequence of interest and use BLAST [23] to deduce orthologs
Reconstruction from the nr or refseq_protein databases of the NCBI [24] or the
EBI database UniProt [25]; alternatively select the
corresponding InterPro family [26] (see Note 1). Choose a
bona fide protein as a reference sequence and, if possible, several
sequences that can serve as an outgroup. Additionally, include
the sequences of those proteins (proti) that possess differing
Fig. 1 Identification of specificity-determining residue positions of the HisF:HisH interface by means of a

vertical approach. Initial binding studies had shown that subunits from phylogenetically unrelated species are
not compatible: The HisF subunit from the Crenarchaeon Pyrobaculum arsenaticum (paHisF) did not bind HisH
from the Proteobacterium Zymomonas mobilis (zmHisH). For the rapid identification of crucial residue
positions within the HisF interface, 87 HisF sequences from seven phyla were chosen for a vertical analysis.
Thus, we deduced ancestral sequences linking the native interaction partner of zmHisH, namely zmHisF (the
leaf of the gray branch) and the distant paHisF (the leaf of the brown branch). Ancestral proteins were
resurrected, and their binding to zmHisH was characterized experimentally. HisF corresponding to the last
universal common ancestor (LUCA-HisF) bound zmHisH. In contrast, the first intermediate (Anc1pa-HisF) on
the branch leading to paHisF that differed markedly from LUCA-HisF did not bind zmHisH. Anc1pa-HisF
deviates from LUCA-HisF by not more than 29 residues, but from paHisF by 74 residues. A subsequent in silico
analysis focusing on the PPI of HisF allowed us to narrow down the number of putative key residue positions to
two. Their role was assessed by experimental binding studies; one was identified as an interface hotspot. To
trace the species-specific evolution of PPIs in more detail, the two predecessors (Anc1tm-HisF and Anc2tm-
HisF) on the path (shown in blue) leading to HisF from Thermotoga maritima (tmHisF) were resurrected as well.
Both intermediates bound zmHisH, but tmHisF was a poor binder. The mutual exchange of residues from the
latter three sequences at corresponding positions confirmed their hotspot quality; for details see [21]. Note
that these residues are located at the rim of the PPI and only moderately conserved, which explains why they
have not been discovered previously. To avoid overloading the graph, only a few of the extant sequences are
shown with their Key2Ann [22] annotation indicating the phylogenetic lineage, i.e., the superkingdom (first
character), the phylum (following three characters), and the species name (last three characters)
properties, whose determinants shall be elucidated by the

subsequent analysis.
2. Create an MSA. According to our experience, MAFFT [27] is a
highly versatile and robust method that can cope with large
sequence sets (see Note 2).
3. Eliminate redundant sequences and obvious outliers like those
that are much shorter or longer than the reference sequence.
Additionally, eliminate sequences that induce conspicuously
large indels in the MSA (see Note 3). A versatile tool support-
ing these tasks is Jalview [28] (see Note 4).
4. Repeat steps 2 and 3 until the MSA consists of a homogeneous
set of sequences.
5. If the protein under study is part of a larger complex, perform
MSA generation for each subunit. Afterward, concatenate the
sequences in a species-specific manner (see Note 5), and create
an MSA consisting of the concatenated sequences.
6. Optionally, replace the database identifiers with more informa-
tive names for the sequences (see Note 6).
7. Remove less informative residue positions from the MSA.
Apply Gblocks [29] to eliminate all columns containing more
than 50% gaps. Use the resulting MSA for the inference of the
phylogenetic tree, but not for the subsequent sequence recon-
struction, which is based on the full MSA.
8. Compute a phylogenetic tree trphylo with a method of choice.
We prefer PhyloBayes [30] and start eight independent MCMC
samplings in parallel with a maximal length of 50,000 samples
to guarantee congruence (see Note 7). If congruence is
reached, we deduce the consensus tree computed by readpb
from the samples following the burn-in phase of the MCMC
computation. The number of samples that have to be excluded
(burn-in) can be determined with VMCMC [31]; often, the
first 25% of the samples are considered as burn-in and dis-
carded. Alternatively, use other state-of-the-art probabilistic
methods like MrBayes [32] or BEAST [33] to compute the
phylogenetic tree (see Note 8). For a given MSA of amino acid
sequences, one can utilize ProtTest [34] to determine the best
fitting evolutionary model prior to MCMC sampling.
9. Visualize trphylo by means of NJplot [35] or FigTree [36] and
assess the length of the individual edges and their posterior
probabilities. All edge lengths must indicate mutation rates <<
1 mutation per site and the posterior probabilities of relevant
internal nodes must exceed the value of 0.75. Furthermore,
make sure that the resulting phylogenetic hierarchy of the
chosen sequences (species) is plausible: For example, compare
the topology of trphylo with the relationships of the sequences
(species), determined for the iTOL project [37] or the “nearly

universal tree” of life [38]. This comparison allows one to
eliminate cases of horizontal gene transfer and to avoid long-
branch attraction. If tree topology is not plausible, consider to
choose a different set of sequences and repeat the procedure
(see Note 9).
10. If the sequence set does not contain an outgroup, use NJplot
[35] or an alternative algorithm to root trphylo for subsequent
sequence reconstruction. Positioning the root is critical for the
computation of the CA sequence. Choose the location of the
root according to a plausible hierarchy to be determined by one
of the methods described in step 9. If an outgroup was used for
rooting, we recommend to eliminate the corresponding
sequences during sequence reconstruction to prevent unde-
sired effects on residue composition.
11. Use the rooted tree prepared in the last step and the full MSA
to reconstruct the ancestral sequences related to internal
nodes. Methods of choice are PAML [39] or FastML [40],
which can handle indels (see Note 10). If possible, choose the
same substitution model as used for tree construction. ASR
programs compute for each residue position posterior prob-
abilities for all 20 amino acids. If alternative predictions with
relatively high posterior probabilities exist, a near-ancestor
sequence ensemble can be calculated for each node; for details
see [22]. If one sequence per internal node is of interest, select
for each position the residue possessing the highest posterior
probability.
2.2 Identification 1. In analogy to Fig. 1, determine the branches of trphylo that

of Specificity- interconnect the two or more recent proteins proti under
Determining Residues study, i.e., those that possess diversified properties.
by Means 2. Compile an initial set anc_prot, consisting of ancestral proteins
of Intermediate that differ most likely from the extant proteins proti and sup-
Sequences port an efficient characterization. For example, one can pair-
wise compare all ancestral sequences to choose several
intermediates, i.e., ancestral sequences that span the sequence
differences between the proti in approximately similar propor-
tions. We recommend the usage of Jalview for sequence selec-
tion (see Note 11). The finding that primordial proteins are
often generalists suggests to add the CA sequence to anc_prot
and to characterize the corresponding protein with high
preference.
3. Optional step: If the 3D structure of a proti is known, compute
homology models of all anc_prot (see Note 12) and try to
minimize further the number of candidate residues to be stud-
ied in the following steps. If protein function is of interest, use
the compiled annotations of PDBsum (www.ebi.ac.uk/pdbsum/)

or an alternative database to assess the position of the differing
residues with respect to a catalytic center or a binding site. If
complex formation is under study, consider a webserver-like
PISA (www.ebi.ac.uk/pdbe/pisa/) that details characteristics of
residues located in PPIs. One can also predict the contribution
of residues to protein or complex stability by utilizing force
fields to calculate differences in free energy (see Note 13). For
the example presented in Fig. 1, we could reduce the number of
putative key residue positions to two by combining in silico
approaches.
4. Optional step: If experimental characterization is intended,
choose protein sequences for the resurrection experiments
and design their gene sequences. Produce the proteins recom-
binantly and characterize them according to the specific prob-
lem. The choice of suitable wet-lab experiments depends on
the characteristics under assessment and may contain tests of
enzyme activity or complex stability. Additionally, it is advisable
to confirm proper protein folding by means of far-UV CD
spectroscopy.
5. Associate the determined effects with the introduced muta-
tions to deduce the stepwise evolutionary adaptation toward
the properties of recent proteins. In case of ambiguous results,
repeat steps 2–4 of the protocol given in Subheading 2.2 and
extend the analyses to additional intermediates and/or single
point mutations.
3 Notes
1. Compiling an appropriate sequence set for ASR is more an art

than an artisanal activity and sequence selection is an iterative
process that requires several rounds of user interaction. This is
why the initial number of sequences should be as high as
possible. Choose sequences that are most likely orthologs and
avoid the addition of paralogous sequences by comparing gene
duplicates. If a Bayesian approach is used to infer the phyloge-
netic tree, running time is an issue that currently limits the
finally selected number of recent sequences to 200. Make
sure that the chosen sequences originate from phyla needed
to deduce the intended set of predecessors. If one wants to
represent the last universal common ancestor, the chosen
sequences must at least come from several bacterial and archeal
clades. Store sequence sets in multi-FASTA file format, which is
accepted by most tools required for ASR.
2. Use a MAFFT method that is accuracy-oriented, i.e., one of the
“INS” modes. This selection depends on the size of the MSA;
for details see the MAFFT manual. For the initial generation of
large MSAs, the option – auto is also appropriate.
3. Modeling the history of insertions and deletions on an evolu-
tionary time scale is difficult and requires for most ASR algo-
rithms the manual adjustment of primordial sequences. One
can minimize errors by choosing a set of sequences of relatively
uniform length.
4. Jalview is an excellent tool for the preparation of sequence sets
used in ASR. The Jalview command Edit\Remove redundancy
allows the selection of a percentage identity threshold and
initiates the subsequent comparison of all sequence pairs. If
the similarity of any two sequences exceeds this cutoff, the
shorter sequence is discarded. A cutoff of 95% or lower is useful
to remove redundant sequences and to avoid highly articulated
subtrees. The command Calculate\Sort by length makes it pos-
sible to identify easily sequences that are much shorter or
longer than the reference sequence. These sequences and
those introducing strikingly long indels can be erased by click-
ing their name and the delete button. The command Web
Service\Alignment offers several alternatives for MSA creation,
among them is MAFFT.
5. Concatenation helps to deduce a robust tree due to the stron-
ger phylogenetic signal spread over a larger set of residue posi-
tions. Make sure that the sequences originate from the same
species by using for their linkage the Tax-Id assigned by the
taxonomy browser of the NCBI. Note that concatenation is only
valid for sequences that coevolve and share the same evolution-
ary history for the entire period under study.
6. For the visual inspection of trees, it is helpful to replace the
hard to interpret database identifiers with names that indicate
the function of the proteins and/or the phylogenetic position
of the species contributing the sequences. We use our in-house
tool Key2Ann [41] to denote the phylogenetic lineage; see Fig. 1
for an example.
7. A detailed description of all the programs and their options
belonging to the software suite PhyloBayes can be found at
www.phylobayes.org. For the reconstruction of amino acid
sequences, we use the CAT or JTT model and specify a minimal
effective sample size of 100. Congruence can be tested by
calculating the maximum difference (maxdiff) of posterior
probabilities of tree bipartitions by using the PhyloBayes tool
bpcomp; the maxdiff value should be below 0.3 [29]. Compu-
tation time can be reduced by using the multi-core version
PhyloBayes-MPI. Note that an MCMC calculation may take
several weeks, if a large number of recent sequences were
chosen.
8. A detailed description of the BEAST functionality can be found

at www.beast2.org. The BEAST tool LogCombiner can be used
to discard the burn-in samples and Tracer allows one to deter-
mine the effective sample size. TreeAnnotator assists the user in
summarizing information from a sample of trees onto a con-
sensus tree. Computation time of BEAST can be reduced by
incorporating the BEAGLE library for parallel processing.
9. Long branches (>1.0 mutations per site) and low posterior
probabilities (<0.75) prevent the reliable computation of
ancestral states. The same is true, if divergence of the sequence
set is too small or if the tree is highly articulated. To overcome
these problems, the content of the sequence set has to be
altered. For example, one can exclude sequence sets amendable
to long branches and erase some sequences in highly articu-
lated subtrees.
10. According to our experience with MSAs containing a small
number of indels, FastML performs well in ASR. If the MSA
contains a larger number of indels, one can try several values of
the advanced option probability cutoff to prefer ancestral indel
over character and compare the results. For further processing,
choose the sequences computed as a marginal reconstruction.
Note that FastML does not offer all evolutionary models
implemented for PhyloBayes or BEAST. Alternatively, one can
use PRANK [42] or Historian [43] that are based on alterna-
tive models of indel evolution. Due to the method used for
indel reconstruction, the lengths of reconstructed sequences
may deviate from the mean length of extant sequences as N-
and C-termini are of higher variability than the rest of the
sequence. Thus, it might be necessary to trim the recon-
structed sequences.
11. For a set of sequences, the similarity of all pairs can easily be
determined by executing the Jalview command Calculate
\Pairwise Alignment.
12. Several alternatives are available to compute homology models
of subunits and protein complexes, among them are YASARA
[44], I-Tasser [45], or HHSearch [46] in combination with
Modeller [47]. For ASR experiments, one can expect reliable
models, because the sequence similarity between the template
(a proti) and the target (an anc_prot) is usually high.
13. The effect of a mutation on protein or complex stability can be
assessed in silico by utilizing programs like FoldX [48], which is
a stand-alone application, but also integrated into YASARA.
To predict the contribution of point mutations on protein
stability, assess the corresponding ΔΔG values. To estimate
the effect on complex stability, compute the ΔΔG value indi-
cating the binding energy difference between a “wild-type
complex” and a complex with a mutated PPI. |ΔΔG values| >

2 kcal/mol are considered a significant contribution of one
residue to complex stability. For this FoldX analysis, three
functions have to be executed subsequently, namely
RepairPDB, BuildModel, and AnalyseComplex. For this specific
application of FoldX, the “wild-type complexes” may consist of
proti or anc_prot sequences, which can differ in their length. In
order to identify the corresponding residues, create an MSA
containing proti and anc_prot sequences to coordinate their
positions.
Acknowledgement
This work was supported by the Deutsche Forschungsgemeinschaft

(ME2259/2-1). Calculations were facilitated by using advanced
computational infrastructure provided by the Leibniz Supercom-
puting Center of the Bavarian Academy of Sciences and Humanities
(www.lrz.de) under grant pr48fu. We thank Samuel Blanquart for
continuous support, many helpful hints, and fruitful discussions.
References
1. Lee D, Redfern O, Orengo C (2007) Predict- 7. Harms MJ, Thornton JW (2010) Analyzing
ing protein function from sequence and struc- protein structure and function using ancestral
ture. Nat Rev Mol Cell Biol 8(12):995–1005. gene reconstruction. Curr Opin Struct Biol 20
https://doi.org/10.1038/nrm2281 (3):360–366. https://doi.org/10.1016/j.sbi.
2. Schymkowitz J, Borg J, Stricher F et al (2005) 2010.03.005
The FoldX web server: an online force field. 8. Saitou N, Nei M (1987) The neighbor-joining
Nucleic Acids Res 33(Web Server issue): method: a new method for reconstructing phy-
W382–W388 logenetic trees. Mol Biol Evol 4(4):406–425
3. Janda JO, Meier A, Merkl R (2013) CLIPS- 9. Gerlt JA (2017) Genomic enzymology: web
4D: a classifier that distinguishes structurally tools for leveraging protein family sequence-
and functionally important residue-positions function space and genome context to discover
based on sequence and 3D data. Bioinformat- novel functions. Biochemistry 56
ics 29(23):3029–3035. https://doi.org/10. (33):4293–4308. https://doi.org/10.1021/
1093/bioinformatics/btt519 acs.biochem.7b00614
4. Zellner H, Staudigel M, Trenner T et al (2012) 10. Merkl R, Sterner R (2016) Ancestral protein
PresCont: predicting protein-protein interfaces reconstruction: techniques and applications.
utilizing four residue properties. Proteins 80 Biol Chem 397(1):1–21. https://doi.org/10.
(1):154–168. https://doi.org/10.1002/prot. 1515/hsz-2015-0158
23172 11. Thornton JW (2004) Resurrecting ancient
5. Plach MG, Löffler P, Merkl R, Sterner R genes: experimental analysis of extinct mole-
(2015) Conversion of anthranilate synthase cules. Nat Rev Genet 5(5):366–375. https://
into isochorismate synthase: implications for doi.org/10.1038/nrg1324
the evolution of chorismate-utilizing enzymes. 12. Liberles DA (2007) Ancestral sequence recon-
Angew Chem Int Ed 54(38):11270–11274. struction. Oxford University Press, Oxford
https://doi.org/10.1002/anie.201505063 13. Hochberg GKA, Thornton JW (2017) Recon-
6. Edgar RC, Batzoglou S (2006) Multiple structing ancient proteins to understand the
sequence alignment. Curr Opin Struct Biol causes of structure and function. Annu Rev
16(3):368–373 Biophys 46:247–269. https://doi.org/10.
1146/annurev-biophys-070816-033631
14. Bornscheuer UT, Huisman GW, Kazlauskas RJ Nucleic Acids Res 40(Database issue):
et al (2012) Engineering the third wave of D306–D312. https://doi.org/10.1093/nar/
biocatalysis. Nature 485(7397):185–194. gkr948
https://doi.org/10.1038/nature11117 27. Katoh K, Standley DM (2013) MAFFT multi-
15. Romero-Romero ML, Risso VA, Martinez- ple sequence alignment software version 7:
Rodriguez S et al (2016) Engineering ancestral improvements in performance and usability.
protein hyperstability. Biochem J 473 Mol Biol Evol 30(4):772–780. https://doi.
(20):3611–3620. https://doi.org/10.1042/ org/10.1093/molbev/mst010
BCJ20160532 28. Waterhouse AM, Procter JB, Martin DMA,
16. Massiere F, Badet-Denisot MA (1998) The Clamp M, Barton GJ, (2009) Jalview Version
mechanism of glutamine-dependent amido- 2—a multiple sequence alignment editor and
transferases. Cell Mol Life Sci 54(3):205–222 analysis workbench. Bioinformatics 25
17. Zalkin H, Smith JL (1998) Enzymes utilizing (9):1189–1191. https://doi.org/10.1093/
glutamine as an amide donor. Adv Enzymol bioinformatics/btp033
Relat Areas Mol Biol 72:87–144 29. Castresana J (2000) Selection of conserved
18. Beismann-Driemeyer S, Sterner R (2001) blocks from multiple alignments for their use
Imidazole glycerol phosphate synthase from in phylogenetic analysis. Mol Biol Evol 17
Thermotoga maritima. Quaternary structure, (4):540–552
steady-state kinetics, and reaction mechanism 30. Lartillot N, Lepage T, Blanquart S (2009) Phy-
of the bienzyme complex. J Biol Chem 276 loBayes 3: a Bayesian software package for phy-
(23):20387–20396 logenetic reconstruction and molecular dating.
19. List F, Vega MC, Razeto A et al (2012) Cataly- Bioinformatics 25(17):2286–2288. https://
sis uncoupling in a glutamine amidotransferase doi.org/10.1093/bioinformatics/btp368
bienzyme by unblocking the glutaminase active 31. Ali RH, Bark M, Miro J et al (2017) VMCMC:
site. Chem Biol 19(12):1589–1599. https:// a graphical and statistical analysis tool for Mar-
doi.org/10.1016/j.chembiol.2012.10.012 kov chain Monte Carlo traces. BMC Bioinfor-
20. Reisinger B, Sperl J, Holinski A et al (2014) matics 18(1):97. https://doi.org/10.1186/
Evidence for the existence of elaborate enzyme s12859-017-1505-3
complexes in the Paleoarchean era. J Am Chem 32. Ronquist F, Huelsenbeck JP (2003) MrBayes
Soc 136(1):122–129. https://doi.org/10. 3: Bayesian phylogenetic inference under
1021/ja4115677 mixed models. Bioinformatics 19
21. Holinski A, Heyn K, Merkl R, Sterner R (12):1572–1574
(2017) Combining ancestral sequence recon- 33. Bouckaert R, Heled J, Kuhnert D et al (2014)
struction with protein design to identify an BEAST 2: a software platform for Bayesian
interface hotspot in a key metabolic enzyme evolutionary analysis. PLoS Comput Biol 10
complex. Proteins 85(2):312–321. https:// (4):e1003537. https://doi.org/10.1371/jour
doi.org/10.1002/prot.25225 nal.pcbi.1003537
22. Bar-Rogovsky H, Stern A, Penn O et al (2015) 34. Abascal F, Zardoya R, Posada D (2005) Prot-
Assessing the prediction fidelity of ancestral Test: selection of best-fit models of protein
reconstruction by a library approach. Protein evolution. Bioinformatics 21(9):2104–2105.
Eng Des Sel 28(11):507–518. https://doi. https://doi.org/10.1093/bioinformatics/
org/10.1093/protein/gzv038 bti263
23. Altschul SF, Gish W, Miller W et al (1990) 35. Perriere G, Gouy M (1996) WWW-query: an
Basic local alignment search tool. J Mol Biol on-line retrieval system for biological sequence
215(3):403–410 banks. Biochimie 78(5):364–369
24. Pruitt KD, Tatusova T, Klimke W, Maglott DR 36. Rambaut A (2012) FigTree v1.4. http://tree.
(2009) NCBI Reference Sequences: current bio.ed.ac.uk/software/figtree/
status, policy and new initiatives. Nucleic 37. Ciccarelli FD, Doerks T, von Mering C et al
Acids Res 37(Database issue):D32–D36. (2006) Toward automatic reconstruction of a
https://doi.org/10.1093/nar/gkn721 highly resolved tree of life. Science 311
25. Apweiler R, Martin M, O’Donovan C et al (5765):1283–1287
(2013) Update on activities at the Universal 38. Puigbo P, Wolf YI, Koonin EV (2009) Search
Protein Resource (UniProt) in 2013. Nucleic for a ‘Tree of Life’ in the thicket of the phylo-
Acids Res 41(D 1):D43–D47 genetic forest. J Biol 8(6):59. https://doi.org/
26. Hunter S, Jones P, Mitchell A et al (2012) 10.1186/jbiol159
InterPro in 2011: new developments in the 39. Yang Z (2007) PAML 4: phylogenetic analysis
family and domain prediction database. by maximum likelihood. Mol Biol Evol 24
(8):1586–1591. https://doi.org/10.1093/ 44. Krieger E, Joo K, Lee J et al (2009) Improving

molbev/msm088 physical realism, stereochemistry, and side-
40. Ashkenazy H, Penn O, Doron-Faigenboim A chain accuracy in homology modeling: four
et al (2012) FastML: a web server for probabi- approaches that performed well in CASP8.
listic reconstruction of ancestral sequences. Proteins 77 Suppl 9:114–122. https://doi.
Nucleic Acids Res 40(Web Server issue): org/10.1002/prot.22570
W580–W584. https://doi.org/10.1093/ 45. Zhang Y (2008) I-TASSER server for protein
nar/gks498 3D structure prediction. BMC Bioinformatics
41. Pürzer A, Grassmann F, Birzer D, Merkl R 9:40. https://doi.org/10.1186/1471-2105-
(2011) Key2Ann: a tool to process sequence 9-40
sets by replacing database identifiers with a 46. Söding J (2005) Protein homology detection
human-readable annotation. J Integr Bioin- by HMM-HMM comparison. Bioinformatics
form 8(1):153. https://doi.org/10.2390/ 21(7):951–960. https://doi.org/10.1093/
biecoll-jib-2011-153 bioinformatics/bti125
42. Löytynoja A, Goldman N (2008) Phylogeny- 47. Webb B, Sali A (2014) Protein structure mod-
aware gap placement prevents errors in eling with MODELLER. Methods Mol Biol
sequence alignment and evolutionary analysis. 1137:1–15. https://doi.org/10.1007/978-1-
Science 320(5883):1632–1635. https://doi. 4939-0366-5_1
org/10.1126/science.1158395 48. Guerois R, Nielsen JE, Serrano L (2002) Pre-
43. Holmes IH (2017) Historian: accurate recon- dicting changes in the stability of proteins and
struction of ancestral sequences and evolution- protein complexes: a study of more than 1000
ary rates. Bioinformatics 33(8):1227–1229. mutations. J Mol Biol 320(2):369–387
https://doi.org/10.1093/bioinformatics/
btw791
Chapter 10
Enhancing Statistical Multiple Sequence Alignment

and Tree Inference Using Structural Information
Joseph L. Herman
Abstract
For highly divergent sequences, there is often insufficient information to reliably construct alignments and
phylogenetic trees. Since protein structure may be strongly conserved despite large divergences in
sequence, structural information can be used to help identify homology in such cases.
While there exist well-studied models of sequence evolution, structurally informed alignment methods
have typically made use of geometric measures of deviation that do not take into account the underlying
mutational processes. In order to integrate structural information into sequence-based evolutionary
models, we recently developed a stochastic model of structural evolution on a phylogenetic tree and
implemented this as the StructAlign plugin for the StatAlign statistical alignment package.
In this chapter, we will outline the types of analyses that can be carried out using StructAlign, illustrating
how the inclusion of structural information can be used to inform joint estimation of alignments and trees.
StructAlign can also be used to infer branch-specific rates of structural evolution, and analysis of an example
globin dataset highlights strong variation in the inferred rate across the tree. While structure is more highly
conserved within clades, the rate of structural divergence as a function of sequence variation is larger
between functionally divergent proteins. Allowing for the rate of structural divergence to vary over the tree
results in an improved fit to the empirically observed pairwise RMSD values.
Key words Protein structure, Structural alignment, RMSD, Statistical alignment, Alignment uncer-
tainty, Bayesian hierarchical models, MCMC, Parallel tempering, Molecular phylogenetics, Globins
1 Introduction
Homologous sequences may diverge substantially while still pre-

serving function. When comparing such sequences, a large number
of alternative alignments may achieve a very similar score or likeli-
hood [1, 2], and selecting any one of these alignments can strongly
bias the downstream inference [3–7], especially if a single guide tree
is used to construct the alignment [8].
The original version of this chapter was revised. The correction to this chapter is available at https://doi.org/
10.1007/978-1-4939-8736-8_23
183
184 Joseph L. Herman
One solution to this issue is to jointly estimate alignments

along with other parameters of interest (e.g., tree topologies,
branch lengths, or indel rates), as part of a hierarchical probabilistic
model [9, 10]. This allows the uncertainty in each of these vari-
ables, including the alignment, to be incorporated into the infer-
ence of the other parameters, eliminating the bias associated with
selecting a single alignment for downstream inference. As well as
reducing errors in tree reconstruction, joint estimation of align-
ments has been found to improve inference of indel rates [11, 12],
positive selection parameters [13], and the location of putative
regulatory elements [14, 15].
However, for sequences that are highly divergent, sequence
similarity alone may be insufficient for identifying homology, giving
rise to high uncertainty, and a diffuse posterior distribution over
alignments, trees, and other model parameters. While inclusion of
additional sequences into the dataset may reduce tree uncertainty
given a particular alignment and set of model parameters, the set of
plausible alignments increases as more sequences are added, espe-
cially in regions of low homology, such that the resulting distribu-
tion over trees may become less reliable and more sensitive to the
choice of model [16, 17]. Filtering out regions of the alignment
that are predicted to be unreliable or highly variable may also
appear to reduce uncertainty in such cases [18, 19], although
systematic removal of indel-rich regions may introduce additional
bias into the inferred trees [20–22].
Since protein structure often needs to be conserved in order to
preserve function, structural similarity can be used to identify
homologous proteins even in the so-called twilight zone of very
low sequence identity [23]. However, while there have been some
attempts to construct phylogenetic trees directly from pairwise
structural distances [24–26], the lack of well-developed models
for structural evolution has made it difficult to translate structural
deviation into evolutionary time. Examining the relative rates of
sequence and structural divergence over large sets of proteins, there
appear to be some global trends across multiple classes [27],
although the exact relationship may vary between families
[28, 29] as well as across different sites within a protein [30]. In
addition, sequences may undergo structurally constrained neutral
evolution while still preserving fold [31–36], such that the increase
in structural divergence may be a nonlinear [37] or discontinuous
[32, 38, 39] function of evolutionary distance.
Challis and Schmidler [40] developed a stochastic model
describing the joint evolution of structure and sequence over
time, whereby local structural deviations within a particular fold
are described using an Ornstein-Uhlenbeck diffusion process. In
order to facilitate efficient inference, the authors focused on an
independent-sites model of structural evolution, for which the
overall metric of structural similarity between two proteins can be
expressed as a sum of per-site contributions, such that dynamic
Enhancing Alignment and Tree Inference with Protein Structures 185
programming algorithms can still be used to perform analytical

summation over alignments. Herman et al. [41] extended this
model to multiple structures related by a tree, with an additional
baseline variance term that takes account of the intrinsic structural
variability at different sites. This model allows for the rate of struc-
tural divergence to vary independently for each branch of the tree,
allowing for more complex relationships between sequence and
structure divergence, and was incorporated into the StatAlign sta-
tistical alignment package [42] as the StructAlign plugin.
2 Materials
2.1 Running StatAlign is written in Java and is run via a JAR file, which can either
StatAlign be obtained from the pre-compiled distribution or built from
source. The pre-compiled package can be obtained from GitHub at
https://github.com/statalign/statalign/releases/download/
v3.3/StatAlign-v3.3.zip.
Source code can also be downloaded and compiled from the
GitHub repository if desired.
The graphical version of StatAlign can be run on Windows,
Mac, and Linux by double-clicking on the JAR archive (see Note 1).
Instructions for using this GUI version can be found on the StatA-
lign website at http://statalign.github.io/doc/user_manual.html.
For longer-running analyses running on multiple cores, we
need to make use of the command-line version of StatAlign. In
this chapter, we will present the commands required to run the
command-line version under Unix-based systems; Windows users
can run these commands using a terminal emulator such as Cygwin.
The single-threaded version can be invoked from the JAR via
java -jar StatAlign.jar [OPTIONS] DATA_FILE1 [DATA_FILE2 ...]
For the MPI-based parallel version, there is a wrapper script

that passes the arguments to the JAR in the appropriate form
StatAlignParallel NCORES [OPTIONS] DATA_FILE1 [DATA_FILE2 ...]
Running java -jar StatAlign.jar or StatAlignParallel

with no arguments will output additional usage information and a
list of available options.
2.2 Example Dataset In this chapter we will investigate the phylogenetic relationship
between cytoglobin [43, 44] and a set of nine other globin struc-
tures (Table 1). The functional role of cytoglobin is currently
unknown, and there has been recent interest in determining its
relationship to other globins [45–47].
For input to StatAlign, each PDB file should contain a single
chain to be analyzed (see Note 2). To construct the example
Table 1
Protein structures used in example dataset, with heme coordination number and exogenous ligand
shown
Structure Protein Organism Coordination Ligand

2oif NsGb H. vulgare (barley) 6 CN
1bin Lhb G. max (soybean) 5 Acetate
1lh1 Lhb L. luteus (lupin bean) 5 Acetate
1urv Cygb H. sapiens (human) 6 None
2lhb CycHb P. marinus (lamprey) 5 CN
1myt Mb T. albacares (tuna) 5 None
2mm1 Mb H. sapiens (human) 5 None
1psgA α-Hb L. xanthurus (spot croaker) 5 CO
2hhbA α-Hb H. sapiens (human) 5 None
2hhbB β-Hb H. sapiens (human) 5 None
dataset, the A-chain was extracted from the PDB file corresponding
to each structure in Table 1; for 2hhbB, the B-chain was used.
The PDB files corresponding to this dataset can be found in the
examples/10_globins subdirectory distributed with StatAlign,
along with a FASTA-formatted file 10_globins.fasta contain-
ing only the primary sequences.
2.3 Analysis We will make use of R in order to analyze the output files generated
of StatAlign Output by StatAlign, and code is provided for each analysis step in this
chapter. Unless otherwise specified, the example commands are to
be run from within the directory where StatAlign is installed,
with the path to this directory saved as a STATALIGN_HOME variable
in the shell and R environments. Several R packages and some
additional scripts will also be required; once installed, the requisite
packages can be loaded using the code below:
packages = c('dplyr','coda','magrittr','ape','ggplot2',
'reshape2','data.table','gridExtra')
for (package in packages) require(package,character.only=TRUE)
3 Methods
3.1 Running As a preliminary analysis, we will first analyze the globin dataset
StatAlign in Sequence- using the original sequence-only model in StatAlign; later we will
Only Mode go on to assess the effect of including structural information.
Table 2
Output files created by running StatAlign on the example dataset
File Contents Format

10_globins.fasta.ll Sample number, sequence model log Tab-separated
likelihood, total log likelihood
10_globins.fasta.tree Sampled trees NEXUS
10_globins.fasta.coreModel.params Sample number, parameters for the indel Tab-separated
model (TKF92)
10_globins.fasta.log Specified via -log argument (by default Modified FASTA
includes sampled alignments, total log
likelihood, and MCMC acceptance rates)
10_globins.fasta.length Sample number, alignment length Tab-separated
StatAlign can be invoked via the command below (see Note 3).
Note that the backslashes below split up the command over multi-
ple lines to avoid ambiguity, but this can also be run all on one line:
java -jar StatAlign.jar \

-mcmc=1m,5m,200,0 \
examples/10_globins/10_globins.fasta
The program options specify a burn-in period of one million

iterations (during which the MCMC sampler is automatically tuned
in order to improve convergence), followed by five million sam-
pling iterations (during which the state is recorded every
200 moves), with zero pre-burn-in randomization period (setting
this to a non-zero value is useful when assessing the sensitivity of
the results to the initial state). On an Intel i7-4790K CPU
(4.00GHz), this takes just over 2h to complete, producing the
output files summarized in Table 2 (these will be located in the
same directory as the input data file).
3.2 Analysis As discussed earlier, StatAlign uses an MCMC sampler to generate

of MCMC Output samples from the posterior distribution over alignments, trees, and
model parameters. As with all MCMC analyses, the sampler will
take time to converge to the posterior distribution, and it is impor-
tant to assess whether convergence has been achieved before pro-
ceeding with downstream analysis.
Two basic summary statistics that can help to assess conver-
gence are the total log likelihood and the alignment length. We can
read in the output files and plot the posterior distribution of these
quantities using the following commands:
base.name = paste0(STATALIGN_HOME,
"examples/10_globins/10_globins.fasta")
log.likelihood = paste0(base.name,".ll") %>%
fread %>% select(ll.all)
ali.length = paste0(base.name,".length") %>%
fread %>% select(ali.length)
plot(mcmc(cbind(log.likelihood,ali.length)))
As shown in Fig. 1, although the mean values appear to have

converged, both of these trace plots show significant autocorrela-
tion. In the context of an MCMC sampler, this indicates what is
referred to as “poor mixing,” meaning that the sampler may not
explore the posterior efficiently, requiring a long time to fully
sample the relevant alignments, trees, and model parameters. We
can quantify this more rigorously by examining the autocorrelation
and effective sample size (ESS) using the acf and effectiveSize
functions in R (see Fig. 2). The effective sample size of 109 for the
alignment length indicates poor mixing in this case.
layout(t(1:2))
acf(log.likelihood)
acf(ali.length)
effectiveSize(ali.length)
## ali.length
## 108.9655
Fig. 1 Trace plots for the log likelihood and alignment length, illustrating some noticeable autocorrelation
Fig. 2 Autocorrelation plots for the log likelihood and alignment length, indicating
poor mixing
3.3 Using Parallel Slow mixing between different regions of the posterior is a com-
Tempering to Improve mon issue when sampling high-dimensional hierarchical models
Mixing involving discrete parameters such as sequence alignments and
tree topologies. Although StatAlign employs a number of advanced
proposal distributions to efficiently explore the parameter space,
transitions between modes of similar posterior density may still
require traversal of configurations that are highly unfavorable,
making these transitions very infrequent.
One way to address this is to use a multiple-chain MCMC
sampler where each chain has an associated heat parameter, t, used
to flatten out the posterior surface in order to increase the frequency
of transitions between configurations [48]. Under this scheme, the
chain with t ¼ 1 (the “cold chain”) generates samples from the true
posterior, and the heated chains (with t > 1) sample from flattened
versions thereof, which can be traversed more easily. By swapping
heats between chains, accepting/rejecting heat proposals according
to the appropriate Metropolis-Hastings ratio, the stationary distribu-
tion of the cold chain is maintained.
Running several chains in parallel at different heats, parallel
tempering can be implemented efficiently by swapping tempera-
tures between specified chains according to a random sequence that
is shared between all chains [49]. Empirically we have found linear
spacings between adjacent inverse temperature values to be effec-
tive, with the default step size set to 0.01 in StatAlign v3.3 (this can
be modified via the -tempDiff argument to StatAlign). As well as
exchanging temperature parameters, StatAlign also swaps para-
meters that determine the variance of MCMC proposal distribu-
tions, ensuring that optimal acceptance ratios for individual moves
are maintained.
3.4 Running Before running the parallel version on the test dataset, we will first
the Parallel Version copy the input files to a new directory, so that the new output does
of StatAlign not overwrite the existing output files:
mkdir -p mpi_output
cp examples/10_globins/10_globins.fasta mpi_output
StatAlign can then be run in parallel via the StatAlignParallel

wrapper script:
StatAlignParallel 8 \
-mcmc=1m,5m,200,0,20 \
-tempDiff=0.02 \
mpi_output/10_globins.fasta
The additional (fifth) MCMC argument specifies that temper-

ature exchanges should be proposed between parallel chains every
20 iterations. This ensures that the chains do not drift too far apart
at different temperatures, increasing the acceptance rate of temper-
ature swaps. Running StatAlign with these parameters takes
approximately 5h on a machine with eight 2.3GHz Xeon
processors.
3.5 Analyzing When running in parallel mode, each chain generates a separate set
Parallel Output of output files, indicated by chainX in the filename (see Note 4).
The .coreModel.params files now contain the inverse tempera-
ture parameter (beta) as the second column. We will first extract
this information in order to aggregate the samples based on the
chain temperature:
n.chains = 8
chains = 0:(n.chains-1)
base.name = paste0(STATALIGN_HOME,
"/mpi_output/10_globins.fasta")
# read in core model (TKF92) MCMC output file
coreModel.list = lapply(chains,function(x)
paste0(base.name,".chain",x,".coreModel.params") %>% fread
)
coreModel = do.call(rbind,coreModel.list)
# extract the values of the beta (inverse temperature)
# parameter
beta.values = as.character(sort(unique(coreModel$beta)))
We can then read in the per-sample log likelihood and align-

ment length for each chain:
log.likelihood = lapply(chains, function(x)

paste0(base.name,".chain",x,".ll") %>% fread
) %>% do.call(rbind,.)
ali.length = lapply(chains, function(x)
paste0(base.name,".chain",x,".length") %>% fread
These values can then be aggregated based on the beta (inverse

temperature) parameter:
log.likelihood =
Map(function(i)
log.likelihood %>%
filter(coreModel$beta==i) %>%
arrange(sample) %>%
select(ll.all),
beta.values)
ali.length =
Map(function(i)
ali.length %>%
filter(coreModel$beta==i) %>%
arrange(sample) %>%
select(ali.length),
beta.values)
In order to assess the mixing between temperatures, we can

create a table indicating the frequencies with which each chain
samples the different temperature parameters. For a well-mixing
parallel tempering scheme, each chain should be sampling multiple
temperature parameters with high frequency. As shown in Table 3,
for the example dataset, each chain samples a range of temperature
values with high frequency, indicating very good mixing across
temperatures:
beta.freqs = lapply(coreModel.list,with,beta) %>%

lapply(.,table) %>%
do.call(cbind,.)
Table 3
Frequency of inverse temperature parameters sampled in each MCMC chain
1 2 3 4 5 6 7 8
0.86 3200 2686 3177 2748 3472 3578 3477 2662
0.88 3171 2982 3252 2901 3216 3350 3381 2747
0.90 3033 3102 3395 3021 3223 3141 3222 2863
0.92 3137 3319 3311 3280 3100 3111 2932 2810
0.94 3227 3161 3119 3209 3041 2973 3051 3219
0.96 3153 3239 3057 3218 2978 2962 2967 3426
0.98 3126 3153 2901 3294 3036 2979 2979 3532
1.00 2953 3358 2788 3329 2934 2906 2991 3741
Fig. 3 Distribution of alignment lengths as a function of the temperature

parameter
Fig. 4 Autocorrelation plots for the log likelihood and alignment length, sampled
using parallel tempering
As shown in Fig. 3, at higher temperature (lower inverse tem-

perature) values, the distribution of alignment lengths broadens,
including much longer alignments, indicating that varying the
temperature is allowing the MCMC sampler to explore a wider
variety of configurations:
do.call(cbind,ali.length) %>%
boxplot(outline = FALSE,
names = beta.values,
xlab = expression("Inverse temperature ("*beta*")"),
ylab = "Alignment length",
cex.axis = 0.8)
Examining the autocorrelation for the log likelihood and align-

ment length, we see that it has reduced significantly (Fig. 4), and
the effective sample size has increased more than eightfold. This
indicates that, as well as now moving between different regions of
the space more efficiently, we are now obtaining more information

about the posterior distribution than would be obtained from eight
independent chains running with beta ¼ 1:
layout(t(1:2))
acf(log.likelihood$`1`)
acf(ali.length$`1`)
effectiveSize(mcmc(ali.length$`1`))
## ali.length
## 970.4661
The effective sample size for the insertion-deletion model para-

meters is also very high, indicating good mixing across the parame-
ter space:
tkf = coreModel %>% filter(beta==1) %>%

select(-c(sample,beta))
effectiveSize(mcmc(tkf))
## R Lambda Theta
## 2588.848 5630.171 12228.428
The R parameter governs the geometric distribution on indel

lengths, with the expected fragment length equal to 1/R; λ is the
insertion rate per time unit (as defined by the substitution
model, usually substitutions per site), μ is the deletion rate, and
θ ¼ λ/(λ + μ) < 0.5 is the insertion rate relative to the total indel
rate [50].
3.6 Running In addition to basic checks on the trace plots and autocorrelation
with Different Random functions of individual parameters, a common approach for asses-
Seeds to Assess sing convergence is to run multiple MCMC samplers with different
Convergence random seeds or starting configurations. This can be accomplished
by rerunning StatAlign using a different value for the -seed argu-
ment, storing the output into a separate directory for each run. If
each of these samplers ends up sampling from the same posterior
distribution, it is a good indication that they have converged.
Agreement between the independent runs can be quantified more
rigorously via the Gelman-Rubin potential scale reduction factor
[51]. We will return to this when analyzing the output of
StructAlign.
3.7 Consensus Trees The .tree files generated by StatAlign contain samples from the
posterior distribution over phylogenetic trees. To summarize this
distribution, a majority consensus tree can be generated by using
StatAlign’s ConsensusTree plugin, which can be called from the
command line on the MCMC output via the following command
(see Note 5):
Fig. 5 Consensus tree for trees sampled under the sequence-only model
java -cp $STATALIGN_JAR \

statalign.postprocess.plugins.contree.CTMain \
$SEQ_DIR/$FASTA.chain{0..7}.tree \
> $SEQ_DIR/$FASTA.ctree
This command will create a .ctree file containing a Newick-

formatted consensus tree, which can then be examined in R using
the ape package:
# read in the consensus tree created by StatAlign

consensus = read.tree(paste0(STATALIGN_HOME,
"/mpi_output/10_globins.fasta.ctree"))
# collapse splits with less than 60% posterior support into
# polytomies
source(paste0(STATALIGN_HOME,"/scripts/apply.min.support.R"))
consensus %<>% apply.min.support(tol=60)
root(consensus,"2oif") %>%
plot(show.node.label = TRUE,
use.edge.length = TRUE,
edge.color = "grey",
cex = 1.2,
show.tip.label = TRUE,
edge.width = 4,
no.margin = TRUE)
add.scale.bar(lwd=4,lcol="grey")
As shown in Fig. 5, the consensus tree contains a four-way polyt-

omy, indicating high uncertainty regarding the relative placement of
the hemoglobin, myoglobin, and cytoglobin/cyclostome clades.
3.8 Including Protein In order to better resolve the relationships between the different
Structures clades, we can utilize the StructAlign plugin for StatAlign, which
models structural divergence using a continuous-time stochastic
process on C-alpha coordinates, combined with a Markov model
of sequence evolution [41]. The rate of structural divergence along
each branch is modeled via a diffusivity parameter, σ. To account for
non-evolutionary sources of structural variability (e.g., due to con-
formational flexibility, differences in experimental conditions, or
technical noise), each residue has an intrinsic baseline variability
parameterized based on the crystallographic B-factors (see Note 6).
StructAlign can read protein structure coordinates directly
from PDB files and can be run on the example globin dataset
using the following command (see Note 7):
pdb_files=(1bina.pdb 1lh1.pdb 1myt.pdb 1spga.pdb 1urv.pdb

2hhbA.pdb 2hhbB.pdb 2lhb.pdb 2mm1.pdb 2oif.pdb)
StatAlignParallel 8 \
-plugin:structal[printTree,printRmsd] \
-mcmc=1m,5m,200,0,20 \
-seed=1 \
-tempDiff=0.01 \
"${pdb_files[@]}"
This run takes approximately 7h on a machine with eight

2.8GHz Xeon processors and generates the same output files as
for the sequence-only model, plus several additional files pertaining
to the structural component of the model (see Table 4). The
printTree option causes the .struct.tree file to be generated,
which is useful for examining how rates of structural drift vary over
the tree. The printRmsd option generates additional files that can
be used to examine the relationship between sequence and
Table 4
Additional output files generated by StructAlign
File Contents Format

*.struct.params Structural model parameters Tab-separated
*.struct.tree Trees with branch lengths equal to NEXUS
structural diffusivity
*.msd Pairwise mean-squared deviation between each Tab-separated
structure, for every MCMC sample
*.mle.fasta Maximum likelihood alignment FASTA
*.mle.rmsd Per-site RMSD for MLE alignment Single-column
pffiffiffiffiffi
*.mle.bfactors Per-site average B-factors (weighted by 3ε) for Single-column
MLE alignment
*.mle.super.pdb Maximum likelihood structural superposition PDB
Fig. 6 Comparison of posterior distributions for alignment length and TKF92 model parameters under the
sequence-only and sequence + structure models
structural divergence. We will return to these files in more detail

later. The output files when running StructAlign will all be prefixed
by the name of the first PDB file in the list passed at the command
line, which in this case is “1bina.pdb”.
As before, we can examine the effective sample size for the
TKF92 parameters. The ESS values are now reduced, since we are
simultaneously sampling a number of additional parameters for the
structural model, although still remain very high, indicating good
mixing within the TKF92 component of the model:
effectiveSize(mcmc(tkf.struc))
## R Lambda Theta
## 1528.826 3241.599 11782.182
Comparing the posterior distributions for the model parameters,

we can see that the structural model favors longer alignments, result-
ing in an increase in the estimated values for R and λ (see Fig. 6):
comparison =
rbind(
cbind(‘Alignment length‘=ali.length.seq$‘1‘$ali.length,
tkf.seq,
Model=rep("seq",nrow(tkf.seq))
),
cbind(‘Alignment length‘=ali.length.struc$‘1‘$ali.length,
tkf.struc,
Model=rep("seq+struc",nrow(tkf.struc))
)
)
df.m = melt(comparison, id.var="Model")
ggplot(df.m, aes(x=variable, y=value)) +
geom_boxplot(aes(fill=Model)) +
facet_wrap( ~ variable, scales="free")
This same observation was noted by Herman et al. [41] and

arises mainly from differences in the alignment at loop regions:
while the sequence-only model tries to align the loops and encoun-
ters high uncertainty due to the higher sequence divergence in
these regions, the structural model tends to favor indels when the
structural divergence is significantly higher than would be expected
due to baseline variability alone.
3.9 Effect We can visualize the change in distribution over alignments in more
of Structural detail by computing a summary alignment annotated with associated
Information posterior probabilities for each column. To do so, we will utilize the
on Sequence program WeaveAlign, which is distributed along with StatAlign. Wea-
Alignments veAlign takes as input a file or files containing multiple alignments for
the same set of sequences and computes the summary alignment that
maximizes the expected accuracy under a chosen scoring scheme [7].
When running StatAlign in parallel mode, the alignment
samples generated by each chain must be combined into a single
file before running WeaveAlign. This can be achieved using the
combine_logfiles.pl script distributed with StatAlign, run as
shown below (with the environment variables set to the appropriate
values):
scripts/combine_logfiles.pl $SEQ_DIR/$FASTA.chain{0..7}.log
scripts/combine_logfiles.pl $STRUC_DIR/$PDB.chain{0..7}.log
WeaveAlign can then be run on the combined logfile; here we

will generate an alignment that maximizes the expected total col-
umn score, i.e., the proportion of correct columns in the alignment
[7], using the commands below:
java -jar WeaveAlign.jar -optgi -outgi \

$SEQ_DIR/$FASTA.combined.log
java -jar WeaveAlign.jar -optgi -outgi \
$STRUC_DIR/$PDB.combined.log
As well as a summary alignment file, WeaveAlign also generates

a graphical representation of the alignment, annotated with poste-
rior probabilities. As shown in Fig. 7, the addition of structural
information into the model results in decreased uncertainty for the
majority of columns, as indicated by the higher posterior probabil-
ities (shown as a blue line above the alignment). In addition, there
are several regions where the alignment changes significantly, which
we will examine in further detail.
WeaveAlign can be used to generate figures for specific
regions of the alignment, defined via column ranges or via sub-
sequences of a particular sequence. As an example, the region of
the alignment containing the first occurrence of the subsequence
AWEVAYDE (Ala128-Glu135) in the structure 1bin
(corresponding to the center of the H-helix) can be displayed
using the command below:
java -cp WeaveAlign.jar \

alignshow.Show \
-f imagefile.png \
-r=1bin,AWEVAYDE \
-t $SEQ_DIR/$FASTA.combined.log.scr \
$SEQ_DIR/$FASTA.combined.log.fsa
Under the sequence-only model, the three plant globins are

determined to have a length-6 indel in this region, whereas the
structural model lines up all the sequences in this region, with very
low uncertainty (see Fig. 8). Examining the corresponding struc-
tures in these regions (Fig. 9), we can see that the structural model
aligns contiguous regions, whereas the sequence-only model infers
an indel in the middle of the H-helix, which is a very unlikely
scenario.
Other notable differences in the structurally informed align-
ment occur in the region corresponding to residues Val82-Ala87
in 1bin, comprising the EF-loop region. Under the sequence-only
model, the three plant globins align with each other in this
region, with a length-3 indel relative to the other sequences,
corresponding to Val83-Asp85 in 1bin, Val85-Asp87 in 1lh1,
and Val93-Glu95 in 2oif. The leghemoglobin structure 1bin is
also inferred to have a shortened loop region relative to the other
plant globins, with a length-3 indel at the start of the F-helix,
between Ala87 and Leu89.
This is very similar to the alignment reported by Hoy et al.
[53], who proposed a length-4 indel between Ala87 and Leu89 in
1bin relative to the 2oif structure at the start of the F-helix, positing
that this deletion, along with others, may have led to reduced
conformational flexibility, causing the leghemoglobins to lose the
Fig. 7 Summary alignment under sequence-only model (left) and structural model (right), annotated with
posterior probabilities for each column (blue lines)
Fig. 8 Zoomed-in view of two regions of the summary alignment corresponding to Ala128-Glu135 and Val82-
Ala87 in the sequence for 1bin (left and right panels, respectively), aligned under the sequence-only model
(left within each panel) and structural model (right within each panel)
Fig. 9 Aligned regions of helix H for leghemoglobin structure 1bin (green) and myoglobin structure 1myt (blue),
with the corresponding aligned regions shown in red using the sequence-only model (center), and structural
model (right). Figures generated using VMD [52]
ability to adopt a hexacoordinate configuration with both distal and

proximal histidines in coordination with the heme.
In contrast, the structural model favors aligning 1bin:Val83-
Asp85 and 1lh1:Val85-Asp87 with the other eight globin
sequences, treating the start of the F-helix in 2oif as an indel, due
to the very high structural deviation observed in the region
between Thr92 and Thr97.
We can visualize the structural variability at this region via the
maximum likelihood structural superposition generated by Struc-
tAlign, which is outputted into the .mle.super.pdb files
Fig. 10 Maximum likelihood superposition for the ten globin structures in the
example dataset, oriented with the E-helix descending from top right to bottom
left, and the F-helix ascending vertically (left), and a view of the EF-loop section
of 2oif, including heme and cyanide ligand (right). Highlighted in red and green
on the left panel are sections of the structures 1bin and 2oif, illustrating the
large deformation that occurs in the latter at the start of the F-helix, which may
stabilize the ligand-bound conformation. Figures generated using VMD [52]
(see Note 8). As shown in Fig. 10, the structure 2oif exhibits a large
deformation at the start of the F-helix, with a number of side chains
coming into contact with residues within the E-helix, for example,
Arg94 with Glu80. As discussed by Hoy et al. [53], this deforma-
tion may help to stabilize the conformation in which the exogenous
ligand displaces His70.
3.10 Posterior As mentioned in the previous section, the large structural displace-
Distribution ment in the EF-loop region of 2oif relative to the other structures
of Structural Model causes StructAlign to treat Thr92-Thr97 as an indel. To understand
Parameters how the model makes this decision and how this affects parameter
inference, we can examine the individual parameters of the struc-
tural model in more detail.
The structural parameters can be read in from the
.struct.params output files in a manner similar to the TKF92
model parameters. Here we will illustrate reading in parameters
from four independent runs conducted with different random
number seeds, each executed in its own separate ’run_x’ subdirec-
tory, in order to assess the consistency of the parameter estimates
across runs:
# for each of four independent runs with different starting

# seeds
run = 1:4
# for each run, read in structural model parameters

struct.list =
lapply(run,
function(r) {
base = paste0(STRUC_DIR,"/run_",r,"/",PDB)
core =
lapply(chains,function(x)
fread(paste0(base,".chain",x,".coreModel.params"))
) %>%
do.call(rbind,.)
struct.params =
lapply(chains,function(x)
fread(paste0(base,".chain",x,".struct.params"))
) %>%
do.call(rbind,.) %>%
filter(core$beta==1) %>%
select(c(tau,eps,s2_g,nu))
return(struct.params)
}
)
Before analyzing the individual parameters, we can first assess

the reliability of the inference by comparing the posterior distribu-
tions across runs. As shown in Fig. 11, the inferred posteriors are
very similar for all four global structural parameters, indicating that
the chains have converged:
comparison =
lapply(run,
function(r) cbind(struct.list[[r]],Seed=r)
) %>%
do.call(rbind,.)
comparison$Seed = factor(comparison$Seed)
df.m = melt(comparison, id.var="Seed")
ggplot(df.m, aes(x=variable, y=value)) +
geom_boxplot(aes(fill=Seed)) +
facet_wrap( ~ variable, scales="free")
We can quantify this convergence more rigorously using the

Gelman-Rubin potential scale reduction factor (PSRF), which
should be close to unity for chains that have converged. As shown
Fig. 11 Comparison of posterior distributions for structural model parameters with four different starting seeds
Table 5
Gelman-Rubin potential scale reduction factors for the structural model
parameters
Point est. Upper C.I.

tau 1 1.001
eps 1.009 1.028
s2_g 1 1.001
nu 1.001 1.002
in Table 5, for each of the four structural parameters, the upper

bound for the PSRF is very close to unity, indicating excellent
convergence:
gelman = gelman.diag(mcmc.list(lapply(struct.list,mcmc)))$psrf
3.11 Interpreting StructAlign models the structural variation at each residue as a

the Structural combination of a baseline non-phylogenetic term (usually parame-
Parameters terized via crystallographic B-factors) and a branch-specific
phylogenetic term that governs the expected structural deviation as

a function of sequence variation [41]. The amount of
non-phylogenetic baseline variance is specified via the parameter
ε, and the phylogenetic variance via the branch-specific σ para-
meters. The prior distribution of the σ parameters is governed by
a global σ g parameter and a variance parameter, ν, with the expected
value for σ k2 for branch k given by σ g2 exp(ν/2). The prior distribu-
tion of atoms around the center of mass is governed by the τ
pffiffiffiffiffi
parameter, with the expected radius of gyration equal to 3τ. In
our example, this can be computed from the posterior samples via
struct = struct.list[[1]] # use only run 1 for this analysis

HPDinterval(mcmc(sqrt(3*struct[,'tau'])))
## lower upper
## tau 14.66535 16.48737
## attr(,"Probability")
## [1] 0.95
The top end of this range includes the value of 16.4 0.2
reported by Lobanov et al. [54] for all-alpha proteins of comparable
size taken from SCOP.
The ε parameter acts as a multiplier on the baseline variance
associated with each alignment
pffiffiffiffiffi column (estimated via squared nor-
malized B-factors), with Bi 3ε yielding the expected standard devia-
tion for site i arising from non-phylogenetic sources of structural
variability (including uncertainty in the structural superposition).
We can examine the correlation between these predicted values and
the observed per-site RMSD via three of the additional files generated
via the printRmsd option to StructAlign, i.e., the .mle.fasta
alignment file and the .mle.rmsd and .mle.bfactors files that
contain the RMSD and B-factor-based predictions, respectively.
When running in parallel mode, each chain again generates its own
version of these files; we will select the chain with the highest likeli-
hood MLE (in our example case, chain 7) for further analysis. Wea-
veAlign can again be used to generate an image with these
annotations plotted above the alignment. As shown in Fig. 12, the
correlation between predicted and observed values is very high, with
higher structural variability occurring mostly in the areas of high
alignment uncertainty in the loop regions (see Note 9):
java -cp WeaveAlign.jar \

alignshow.Show \
-t $STRUC_DIR/$PDB.chain7.mle.rmsd \
-t $STRUC_DIR/$PDB.chain7.mle.bfactors \
-c=RED -c=GREEN -png \
$STRUC_DIR/$PDB.chain7.mle.fasta
Fig. 12 Maximum likelihood alignment, annotated with predicted (green) and observed (red) per-site RMSD
The magnitude of σ has a more direct interpretation, since σ 2t is

the expected mean-squared deviation along each of the three spatial
coordinates occurring from phylogenetic structural drift over a
time interval of length t.
Previous studies have tried to model the relationship between
substitutions per site and RMSD via a global linear regression with
a fixed constant of proportionality. Using this approach, Illergård
et al. [29] derived a linear coefficient of 0.312 for the globin family.
In contrast, StructAlign models the MSD (squared RMSD) as
increasing linearly with sequence divergence, as implied by a diffu-
sion model, but permits the rate of structural divergence to vary in a
piecewise linear fashion across the tree, allowing for more complex
relationships between sequence and structure to be modeled. As we
shall see in the following section, the resulting fit between observed
and predicted pairwise RMSD is typically very good.
3.12 Effect We can compute consensus trees on the StructAlign output as

of Structural previously, now running the ConsensusTree plugin both on the
Information .tree files and the .struct.tree files, in order to create con-
on Consensus Trees sensus trees with branch weighted by branch length and structural
diffusivity (σ 2), respectively:
java -cp StatAlign.jar \

$STRUC_DIR/$PDB.chain{0..7}.tree \
> $STRUC_DIR/$PDB.ctree
java -cp StatAlign.jar \

$STRUC_DIR/$PDB.chain{0..7}.struct.tree \
> $STRUC_DIR/$PDB.struct.ctree
Plotting the consensus tree as before, we can see that most of

the uncertainty present in the tree generated by the sequence-only
model has been resolved, with the cyclostome (2lhb) and cytoglo-
bin (1urv) clade placed next to the plant globins, and the
Fig. 13 Consensus tree for trees sampled under the sequence+structure model, with branches scaled
according to branch length (left) and structural diffusivity (right)
myoglobins and hemoglobins clustering closer to each other (see

Fig. 13). This matches the observations of Herman et al. [41] and
Christensen et al. [55], who inferred the same relationship between
these clades based on analysis of larger sets of sequences and
structures.
Comparing the inferred branch lengths under the sequence
and sequence+structure models, most are of very similar length
under both models. We can examine this more closely by looking
both at the branch lengths at the tips of the tree and the pairwise
distances between leaves of the tree:
consensus.seq = read.tree(paste0(SEQ_DIR,"/10_globins.fasta.ctree"))
consensus.struc = read.tree(paste0(STRUC_DIR,"/1bina.pdb.ctree"))
n = length(consensus.struc$tip.label)
# map ordering of sequences in consensus.seq
# to that of consensus.struc
map = pmatch(consensus.struc$tip.label,consensus.seq$tip.label)
# extract edge lengths for branches leading to tips of tree

tip.edges.struc =
consensus.struc$edge.length[consensus.struc$edge[,2]<=n]
tip.edges.seq =
consensus.seq$edge.length[consensus.seq$edge[,2]<=n][map]
# combine into data.frame

df.tips = data.frame(name=consensus.struc$tip.label,
seq=tip.edges.seq,
struc=tip.edges.struc)
# find distances between leaves along edges of the tree

dist.seq = cophenetic(consensus.seq)
seqs = colnames(dist.seq)
# for consensus.struc we reorder rows and cols to match

dist.struc = cophenetic(consensus.struc)[seqs,seqs]
# extract off-diagonal elements and combine into data.frame

df.pw = data.frame(seq=dist.seq[lower.tri(dist.seq)],
struc=dist.struc[lower.tri(dist.struc)])
Plotting these quantities for the sequence-only versus sequence

+structure models, most values are seen to be very similar (see
Fig. 14), indicating that the structural model does not systemati-
cally distort the estimated branch lengths. The tip branches leading
to 2lhb and 2oif are lengthened slightly in the sequence+structure
model, reflecting the fact that the structural model infers additional
indels in these sequences:
outliers = subset(df.tips,name %in% c('2oif','2lhb'))
# plot edge lengths for tip branches

p1 = ggplot(df.tips,aes(x=seq,y=struc)) +
geom_point() +
geom_point(data=outliers,shape=1,size=3) +
xlab("Tip branch length (seq-only)") +
ylab("Tip branch length (seq + struc)") +
geom_abline(intercept=0,slope=1) +
geom_text(data=outliers,aes(label=name),
size=4,hjust=1.5,vjust=1)
# plot pairwise distances
p2 = ggplot(df.pw,aes(x=seq,y=struc)) +
geom_point() +
xlab("Tree distance (seq-only)") +
ylab("Tree distance (seq + struc)") +
geom_abline(intercept=0,slope=1)
grid.arrange(p1,p2,ncol=2)
3.13 Variation As discussed above, although inclusion of the structural model

of Rate of Structural significantly alters the distribution over alignments and trees, it
Evolution Across does not significantly impact the inferred branch lengths. This is
the Tree due to the fact that the model allows the rate of structural evolu-
tion to vary independently along each branch, such that the branch
lengths do not need to adjust in order to reflect the structural
deviation across the tree. This allows the well-calibrated sequence
model to be the primary determinant of divergence times, while
making use of the structural model to improve homology
inference.
Fig. 14 Left: branch lengths for branches leading to tips of the tree. Right: Pairwise tree distances
(in substitutions per site) between each pair of structures for the consensus tree computed under the
sequence-only and sequence + structure models
As shown in Fig. 13, the rate of structural evolution varies

significantly across the tree, with very low diffusivity parameters
observed within the myoglobin, α-hemoglobin and β-hemoglobin
clades, and larger rates inferred between clades. The low rates
within clades may reflect the need to preserve structure in order
to maintain function, whereas the increased rates between clades
may represent duplication/divergence events, leading to decreased
selective pressure on structure. A notable example of the latter
occurs at the branch between the hexacoordinate nonsymbiotic
plant globin, 2oif, and the two pentacoordinate leghemoglobins,
1bin and 1lh1 [56].
It is also notable that the structural diffusivity along the branch
to 2oif is very low, explaining why the model preferentially creates
an indel at the start of the F-helix, where the structural deviation is
much higher than would be expected given the level of sequence
divergence.
3.14 Pairwise As discussed in the introduction, permitting the rate of structural

Structural Diffusion deviation to vary over the tree allows for more complex relation-
Distance Is Linearly ships between sequence and structural divergence. In order to
Related to RMSD examine the model fit, we can first use the MCMC output to
estimate the expected RMSD between each pair of structures,
averaged over alignments and structural superpositions. This can
then be compared to the pairwise structural diffusion distances
using the consensus tree, corresponding to the expected pairwise
deviation under the StructAlign model.
We will focus here on one of the four runs, since all four
appeared to be very similar. First we read in the core model para-
meters again, in order to locate the cold chain at each iteration:
base = paste0(STRUC_DIR,"/run_1/",PDB)
core = lapply(chains, function(x)
fread(paste0(base,".chain",x,".coreModel.params"))
Next, we will read in the pairwise mean-squared deviation for

each MCMC sample, and compute the square root of the posterior
mean, to yield an estimate of the RMSD between each pair of
structures:
rmsd =
lapply(chains,
function(x) fread(paste0(base,".chain",x,".msd"))
) %>%
do.call(rbind,.) %>%
filter(core$beta==1) %>% colMeans %>% sqrt
The distance between each leaf on the tree can then be com-
puted using the cophenetic function from ape:
d1 = cophenetic(consensus)
The structural diffusion distance under the StructAlign model

can also be computed this way, weighting each branch by the
product of the edge length and the structural diffusivity parameter,
σ k2. The expected mean-squared deviation under the model is then
given by three times this pairwise distance, which includes a contri-
bution from x, y, and z coordinates:
# tree distance weighted by structural diffusivity

ct = consensus
ct$edge.length = ct$edge.length * consensus.struct$edge.length
# convert from MSD in 1D to RMSD in 3D
d2 = sqrt(3*cophenetic(ct))
We can now plot the two tree distances versus RMSD for each
pair of proteins:
# combine into a data.frame for plotting

st = do.call(rbind,strsplit(names(rmsd),"_"))
df = data.frame(rmsd,diffusion=d2[st],dist=d1[st])
p1 = ggplot(df,aes(x=dist,y=rmsd)) +
geom_point() +
xlab("Substitutions per site") +
ylab(expression(paste("Pairwise RMSD / ",ring(A)))) +

# relationship inferred by Illergard et al. (2009)
geom_abline(intercept=0.734,slope=0.312)
# non-phylogenetic variability (0.25 gives a good fit)

baseline.sd = 0.25
p2 = ggplot(df,aes(x=diffusion,y=rmsd)) +
geom_point() +
xlab("Weighted tree distance") +
ylab(expression(paste("Pairwise RMSD / ",ring(A)))) +
# relationship implied by StructAlign model
geom_abline(intercept=baseline.sd,slope=1)
grid.arrange(p1,p2,ncol=2)
As shown in Fig. 15, the estimated pairwise RMSD from the

structural model closely matches the empirically observed values
and is a much better fit to the data than a linear fit to sequence
distance in substitutions per site. This may explain why previous
studies have encountered difficulties modeling the relationship
between sequence divergence and structural variability using a
single linear model [29].
3.15 Distinguishing In the case of 2oif, much of the local structural deviation observed
Structural Drift from in the EF-loop region can be attributed to the effect of binding the
Conformational exogenous cyanide ligand [53]. In contrast, a rice globin with very
Change similar sequence was crystallized in the hexacoordinate form [57]
and does not display this large deviation at the start of the F-helix.
Fig. 15 Structural deviation as a function of evolutionary distance computed using the tree distance, before
and after weighting by the branch-specific structural diffusivity parameters (left and right, respectively). The
line on the left plot shows the linear relationship inferred by Illergård et al. [29] for the globins; the line on the
right shows y ¼ x + 0.25
Recent studies have highlighted how the existence of multiple

functional conformations may constrain sequence evolution [58];
inclusion of this information when parameterizing the baseline
variability modeled by StructAlign may help to further refine the
estimates of model parameters derived under the joint sequence-
structure model. In addition, modeling structural deviation via an
angle-based model rather than a coordinate-based model may
improve the ability to detect localized structural changes in cases
where structures cannot easily be superposed due to large confor-
mational changes [59, 60].
4 Notes
1. On Linux systems the JAR may need to first be made execut-

able by running
chmod a+x StatAlign.jar
2. There is currently no support for multiple chains; hence, any

ATOM entries corresponding to other chains will be inter-
preted as alternative conformations for the first chain. If alter-
native conformations are present in the PDB file, only the first
conformation will be used.
3. By default, the Dayhoff substitution model is used; this can be
changed via the -subst command-line argument to StatAlign.
The list of available substitution models can be obtained by
running
java -jar StatAlign.jar -list:subst
4. The default behavior is for all chains to output model para-

meters at every sample point, with tree and alignment output
generated only when the temperature of the chain is equal to
unity, to simplify downstream processing. In order to save disk
space, the -reportOnlyColdChain¼true argument can be
used, such that all postprocessing output will also be restricted
to the cold chain (t ¼ 1).
5. The -start¼N option to the ConsensusTree plugin can be
used to specify that the first N trees should be ignored when
computing the consensus. This is useful in cases where inspec-
tion of the log likelihood trace suggests that convergence has
not occurred by the end of the burn-in period, but occurs some
number of iterations thereafter.
6. For structures derived using NMR spectroscopy, a measure of
baseline coordinate variability can be obtained from the S2
order parameters; this can be added as an additional column in
a .coor file for input to StructAlign (more details regarding

these .coor files can be found in the StatAlign documentation).
7. Note that for the analysis under the structural model we have
decreased the spacing between inverse temperature parameters
to 0.01, since the addition of the structural likelihood increases
the effect of changing the temperature parameter.
8. When running in parallel mode, each chain generates its own
.mle.super.pdb file. Typically we are most interested in the
superposition corresponding to the chain that samples the
maximum likelihood configuration, which can be determined
by inspecting the contents of the .ll files. The MLE superpo-
sition files are PDB-formatted files, with the aligned C-alpha
coordinates of each structure corresponding to a separate
chain. These aligned residues are numbered starting at 1;
hence, an offset may need to be applied in order to highlight
specific residues based on the original indexing.
9. The observed RMSD values for each column are computed
based on the aligned structures at each site, using a single
structural superposition, and constitute only a portion of the
non-phylogenetic variance explained via ε. Hence the predicted
baseline variability is typically larger than the empirical average
pairwise RMSD values.
References
1. Godzik A (1996) The structural alignment alignments using directed acyclic graphs.
between two proteins: is there a unique BMC Bioinformatics 16:108
answer? Protein Sci 5:1325–1338 8. Nelesen S, Liu K, Zhao D, Linder CR, Warnow
2. Sela I, Ashkenazy H, Katoh K, Pupko T (2015) T (2008) The effect of the guide tree on multi-
GUIDANCE2: accurate detection of unreli- ple sequence alignments and subsequent phy-
able alignment regions accounting for the logenetic analyses. In: Proceedings of the 2008
uncertainty of multiple parameters. Nucleic Pacific Symposium on Biocomputing. World
Acids Res 43:W7–W14 Scientific. p 25–36
3. Morrison DA, Ellis JT (1997) Effects of nucle- 9. Lunter G, Drummond AJ, Miklós I, Hein J
otide sequence alignment on phylogeny esti- (2005) Statistical alignment: recent progress,
mation: a case study of 18S rDNAs of new applications, and challenges. In: Statistical
apicomplexa. Mol Biol Evol 14:428–441 Methods in Molecular Evolution. Statistics for
4. Ogden TH, Rosenberg MS (2006) Multiple Biology and Health. Springer, New York, NY
sequence alignment accuracy and phylogenetic 10. Redelings BD, Suchard MA (2005) Joint
inference. Syst Biol 55:314–328 Bayesian estimation of alignment and phylog-
5. Wong KM, Suchard MA, Huelsenbeck JP eny. Syst Biol 54:401–418
(2008) Alignment uncertainty and genomic 11. Westesson O, Lunter G, Paten B, Holmes I
analysis. Science 319:473–476 (2012) Accurate reconstruction of insertion-
6. Lunter G, Rocco A, Mimouni N, Heger A, deletion histories by statistical phylogenetics.
Caldeira A, Hein J (2008) Uncertainty in PLoS One 7:e34572
homology inferences: assessing and improving 12. Holmes IH (2017) Historian: accurate recon-
genomic sequence alignment. Genome Res struction of ancestral sequences and evolution-
18:298–309 ary rates. Bioinformatics 33:1227–1229
7. Herman JL, Novák Á, Lyngsø R, Szabó A, 13. Redelings BD (2014) Erasing errors due to
Miklós I, Hein J (2015) Efficient representa- alignment ambiguity when estimating positive
tion of uncertainty in multiple sequence selection. Mol Biol Evol 31:1979–1993
14. Satija R, Pachter L, Hein J (2008) Combining 29. Illergård K, Ardell DH, Elofsson A (2009)
statistical alignment and phylogenetic foot- Structure is three to ten times more conserved
printing to detect regulatory elements. Bioin- than sequence: a study of structural response in
formatics 24:1236–1242 protein cores. Proteins 77:499–508
15. Satija R, Novák Á, Miklós I, Lyngsø R, Hein J 30. Echave J, Spielman SJ, Wilke CO (2016)
(2009) BigFoot: Bayesian alignment and phy- Causes of evolutionary rate variation among
logenetic footprinting with MCMC. BMC protein sites. Nat Rev Genet 17:109–121
Evol Biol 9:217 31. Worth CL, Gong S, Blundell TL (2009) Struc-
16. Philippe H, Brinkmann H, Lavrov DV, Little- tural and functional constraints in the evolu-
wood DTJ, Manuel M, Wörheide G, Baurain D tion of protein families. Nat Rev Mol Cell Biol
(2011) Resolving difficult phylogenetic ques- 10:709–720
tions: why more sequences are not enough. 32. Gilson AI, Marshall-Christensen A, Choi J-M,
PLoS Biol 9:e1000602 Shakhnovich EI (2017) The role of evolution-
17. Kumar S, Filipski AJ, Battistuzzi FU, Kosa- ary selection in the dynamics of protein struc-
kovsky Pond SL, Tamura K (2012) Statistics ture evolution. Biophys J 112:1350–1365
and truth in phylogenomics. Mol Biol Evol 33. Choi SC, Hobolth A, Robinson DM,
29:457–472 Kishino H, Thorne JL (2007) Quantifying
18. Talavera G, Castresana J (2007) Improvement the impact of protein tertiary structure on
of phylogenies after removing divergent and molecular evolution. Mol Biol Evol
ambiguously aligned blocks from protein 24:1769–1782
sequence alignments. Syst Biol 56:564–577 34. Kleinman CL, Rodrigue N, Lartillot N, Phi-
19. Wu M, Chatterji S, Eisen JA (2012) Account- lippe H (2010) Statistical potentials for
ing for alignment uncertainty in phyloge- improved structurally constrained evolutionary
nomics. PLoS One 7:e30288 models. Mol Biol Evol 27:1546–1560
20. Gatesy J, DeSalle R, Wheeler W (1993) 35. Rodrigue N, Philippe H, Lartillot N (2006)
Alignment-ambiguous nucleotide sites and Assessing site-interdependent phylogenetic
the exclusion of systematic data. Mol Phylo- models of sequence evolution. Mol Biol Evol
genet Evol 2:152–157 23:1762–1775
21. Lee MS (2001) Unalignable sequences and 36. Sadowski M, Taylor W (2010) On the evolu-
molecular evolution. Trends Ecol Evol tionary origins of “fold space continuity”: a
16:681–685 study of topological convergence and diver-
22. Löytynoja A, Goldman N (2008) Phylogeny- gence in mixed alpha-beta domains. J Struct
aware gap placement prevents errors in Biol 172:244–252
sequence alignment and evolutionary analysis. 37. Rackovsky S (2015) Nonlinearities in protein
Science 320:1632–1635 space limit the utility of informatics in protein
23. Hasegawa H, Holm L (2009) Advances and biophysics. Proteins 83:1923–1928
pitfalls of protein structural alignment. Curr 38. Sadreyev RI, Kim B-H, Grishin NV (2009)
Opin Struct Biol 19:341–348 Discrete–continuous duality of protein struc-
24. Johnson MS, Šali A, Blundell TL (1990) Phy- ture space. Curr Opin Struct Biol 19:321–328
logenetic relationships from three-dimensional 39. Holzgr€a fe C, Wallin S (2014) Smooth func-
protein structures. Methods Enzymol tional transition along a mutational pathway
183:670–690 with an abrupt protein fold switch. Biophys J
25. Bujnicki JM (2000) Phylogeny of the restric- 107:1217–1225
tion endonuclease-like superfamily inferred 40. Challis CJ, Schmidler SC (2012) A stochastic
from comparison of protein structures. J Mol evolutionary model for protein structure align-
Evol 50:39–44 ment and phylogeny. Mol Biol Evol
26. Lundin D, Poole AM, Sjöberg B-M, Högbom 29:3575–3587
M (2012) Use of structural phylogenetic net- 41. Herman JL, Challis CJ, Novák Á, Hein J,
works for classification of the ferritin-like Schmidler SC (2014) Simultaneous Bayesian
superfamily. J Biol Chem 287:20565–20575 estimation of alignment and phylogeny under
27. Chothia C, Lesk AM (1986) The relation a joint model of protein sequence and struc-
between the divergence of sequence and structure. Mol Biol Evol 31:2251–2266
ture in proteins. EMBO J 5:823 42. Novák Á, Miklós I, Lyngsø R, Hein J (2008)
28. Panchenko AR, Wolf YI, Panchenko LA, Madej StatAlign: an extendable software package for
T (2005) Evolutionary plasticity of protein joint Bayesian estimation of alignments and
families: coupling between sequence and struc- evolutionary trees. Bioinformatics
ture variation. Proteins 61:535–544 24:2403–2404
43. Burmester T, Ebner B, Weich B, Hankeln T 52. Humphrey W, Dalke A, Schulten K (1996)
(2002) Cytoglobin: a novel globin type ubiq- VMD: visual molecular dynamics. J Mol
uitously expressed invertebrate tissues. Mol Graph 14:33–38
Biol Evol 19:416–421 53. Hoy JA, Robinson H, Trent JT, Kakar S,
44. de Sanctis D, Dewilde S, Pesce A, Moens L, Smagghe BJ, Hargrove MS (2007) Plant
Ascenzi P, Hankeln T, Burmester T, Bolognesi hemoglobins: a molecular fossil record for the
M (2004) Crystal structure of cytoglobin: the evolution of oxygen transport. J Mol Biol
fourth globin type discovered in man displays 371:168–179
heme hexa-coordination. J Mol Biol 54. Lobanov M, Bogatyreva N, Galzitskaia O
336:917–927 (2008) Radius of gyration is indicator of com-
45. Hoffmann FG, Opazo JC, Storz JF (2010) pactness of protein structure. Mol Biol
Gene cooption and convergent evolution of 42:701–706
oxygen transport hemoglobins in jawed and 55. Christensen AB, Herman JL, Elphick MR,
jawless vertebrates. Proc Natl Acad Sci U S A Kober KM, Janies D, Linchangco G, Semmens
107:14274–14279 DC, Bailly X, Vinogradov SN, Hoogewijs D
46. Hoffmann FG, Opazo JC, Storz JF (2011) (2015) Phylogeny of echinoderm hemoglo-
Differential loss and retention of cytoglobin, bins. PLoS One 10:e0129668
myoglobin, and globin-e during the radiation 56. Gupta KJ, Hebelstrup KH, Mur LA, Igamber-
of vertebrates. Genome Biol Evol 3:588–600 diev AU (2011) Plant hemoglobins: important
47. Hoffmann FG, Opazo JC, Hoogewijs D, players at the crossroads between oxygen and
Hankeln T, Ebner B, Vinogradov SN, nitric oxide. FEBS Lett 585:3843–3849
Bailly X, Storz JF (2012) Evolution of the glo- 57. Hargrove MS, Brucker EA, Stec B, Sarath G,
bin gene family in deuterostomes: lineage- Arredondo-Peter R, Klucas RV, Olson JS, Phil-
specific patterns of diversification and attrition. lips GN (2000) Crystal structure of a nonsym-
Mol Biol Evol 29:1735–1745 biotic plant hemoglobin. Structure
48. Geyer C (2011) Importance sampling, 8:1005–1014
simulated tempering, and umbrella sampling. 58. Sharir-Ivry A, Xia Y (2017) The impact of
In: Brooks S, Gelman A, Jones G, Meng X native state switching on protein sequence evo-
(eds) Handbook of Markov Chain Monte lution. Mol Biol Evol 34:1378–1390
Carlo. Chapman & Hall/CRC, Boca Raton, 59. Maadooliat M, Zhou L, Najibi SM, Gao X,
pp 295–311 Huang JZ (2016) Collective estimation of
49. Altekar G, Dwarkadas S, Huelsenbeck JP, Ron- multiple bivariate density functions with appli-
quist F (2004) Parallel Metropolis coupled cation to angular-sampling-based protein loop
Markov chain Monte Carlo for Bayesian phylo- modeling. J Am Stat Assoc 111:43–56
genetic inference. Bioinformatics 20:407–415 60. Golden M, Garcı́a-Portugués E, Sørensen M,
50. Thorne JL, Kishino H, Felsenstein J (1992) Mardia KV, Hamelryck T, Hein J (2017) A
Inching toward reality: an improved likelihood generative angular model of protein structure
model of sequence evolution. J Mol Evol evolution. Mol Biol Evol 34:2085–2100
34:3–16
51. Gelman A, Rubin DB (1992) Inference from
iterative simulation using multiple sequences.
Stat Sci 7:457–472
Chapter 11
The Influence of Protein Stability on Sequence Evolution:

Applications to Phylogenetic Inference
Ugo Bastolla and Miguel Arenas
Abstract
Phylogenetic inference from protein data is traditionally based on empirical substitution models of evolu-
tion that assume that protein sites evolve independently of each other and under the same substitution
process. However, it is well known that the structural properties of a protein site in the native state affect its
evolution, in particular the sequence entropy and the substitution rate. Starting from the seminal proposal
by Halpern and Bruno, where structural properties are incorporated in the evolutionary model through
site-specific amino acid frequencies, several models have been developed to tackle the influence of protein
structure on sequence evolution. Here we describe stability-constrained substitution (SCS) models that
explicitly consider the stability of the native state against both unfolded and misfolded states. One of them,
the mean-field model, provides an independent sites approximation that can be readily incorporated in
maximum likelihood methods of phylogenetic inference, including ancestral sequence reconstruction.
Next, we describe its validation with simulated and real proteins and its limitations and advantages with
respect to empirical models that lack site specificity. We finally provide guidelines and recommendations to
analyze protein data accounting for stability constraints, including computer simulations and inferences of
protein evolution based on maximum likelihood. Some practical examples are included to illustrate these
procedures.
Key words Stability-constrained substitution models, Mean-field substitution model, Protein folding
stability, Protein evolution, Ancestral protein reconstruction
1 Introduction
Mathematical models of protein evolution not only improve our

understanding of the evolutionary process [1] but also have practi-
cal applications such as the design of therapies [2–4] and novel
enzymatic properties [5–7]. Traditional substitution models of
protein evolution are based on empirical amino acid substitution
matrices such as JTT, WAG, or HIVb (see [8] for a review). How-
ever, these models assume that protein sites evolve independently of
each other under the same substitution process, while it is well
known that natural selection targets the structure and the stability
of the native state of the protein, which is achieved through physical
215
216 Ugo Bastolla and Miguel Arenas
interactions between amino acids at different sites (for a review see

[9–12]). This suggests that protein evolution models must explic-
itly represent the selective constraints on the structure and the
stability of the native state.
A variety of models have been developed for this purpose. They
belong to two main classes, depending on how selective constraints
are implemented. The first group is that of stability-constrained
protein evolution models (for a review see [13–15]), which put the
focus on the stability of the native state. These models attempt to
estimate the folding free energy ΔG of the native state of a mutated
protein under the assumption that the mutation maintains the
coarse-grained structure of the native state (typically, the contact
matrix) and changes the interactions in the native and non-native
states (although only few models consider non-native states). The
fitness of the mutant is modeled as the fraction of correctly folded
protein, f ¼ 1þe ΔG=RT
1
: The second class corresponds to structurally
constrained models of protein evolution, introduced by Julian
Echave [16], which estimate the structural change due to the
mutation through elastic network models (ENMs) [17] and
model the fitness as a function of the structural distance between
the mutated structure and a target structure. In this model the
change in stability is neglected, also because ENMs do not allow
for estimating it. The selective importance of the native structure,
assumed by structurally constrained models, is justified by the fact
that the precise native structure determines the functional dynamics
of the protein, as ENM-based studies have shown [18]. Under
realistic situations, mutations modify both the structure and the
stability of the native state, and both of them are targeted by natural
selection. However, for the moment, there are no models that try
to estimate the effect of both changes on the protein fitness.
In this chapter we focus our attention on stability-constrained
substitution models (SCS), which have been investigated in com-
puter simulations since many years [19, 20]. However, these mod-
els are not yet well-established in current methods for phylogenetic
inference, mainly because they require the computation of protein
stability, which involves physical interactions and induces probabi-
listic dependencies between protein sites. In contrast, current soft-
ware for phylogenetic inference computes the likelihood of the
observed data under the assumption that protein sites evolve inde-
pendently of each other. that likelihood computations are necessary
for selecting the most supported substitution model, which is a
crucial step for phylogenetic inference [21, 22]. Consequently, we
believe that it is urgent not only to design novel models of protein
evolution but also to implement these models in software tools
useful for the community.
Unfortunately, substitution models with dependencies
between sites increase enormously the complexity of likelihood
computations, so that they can be managed only through sampling
Phylogenetic Inference with Stability Constraints 217
methods such as Monte Carlo [23, 24] that have inherent limita-
tions in computer efficiency and may get trapped in local maxima.
An alternative is to derive a model with independent sites that
effectively enforces stability constraints, in the spirit of mean-field
models from physics. Of course, the resulting model will be less
realistic than a model with dependencies between sites but still may
represent real data better than empirical substitution models that
neglect stability constraints. One of these models is the mean-field
model (MF) [25, 26].
In the present chapter we describe models of protein evolution
with explicit stability constraints and models that effectively incor-
porate these constraints into site-independent substitution matri-
ces, highlighting their implementation in phylogenetic frameworks.
We also present some applications of these frameworks in the
simulation and evolutionary analysis of diverse protein data. We
finally provide guidelines and recommendations to use the pre-
sented frameworks.
2 Simulation of Protein Evolution with Stability Constraints
Models of protein evolution that explicitly represent selection on

protein stability allow for describing the coevolution between pro-
tein sites [27, 28]. Here we describe two of these models, designed
in our lab, that have been implemented in a phylogenetic frame-
work to simulate the evolution of protein sequences.
2.1 Modeling Protein The thermodynamic model adopted in the simulator of protein
Evolution with Stability sequence evolution ProteinEvolver [29] estimates the stability of
Constraints the native state not only against the unfolded state but also against
compact, wrongly folded conformations (misfolded states) that are
usually neglected in other models of protein stability. The charac-
teristics of protein sequences that weaken the stability of frequently
formed misfolded conformations are referred to as negative design,
and its evolutionary importance is recognized through statistical
analysis of protein sequences [29], and it was proposed to have
important evolutionary consequences (for a review see [15, 30]).
The stability of the native state is estimated from the contact
matrix representation of one native structure in the Protein Data
Bank (PDB), Cij ¼ 1 if any two atoms in residues i and j are closer
than 4.5Å and 0 otherwise:
X
G nat ðC nat ; A Þ ¼ C ijnat U A i ; A j ð1Þ
ij
where Ai is the amino acid at site i (for instance leucine), Cnat is the
native contact matrix, and U(a, b) are the 210 contact interaction
parameters derived in [31]. Contacts with |i j| < 4 are not
considered because they are formed in almost all alternative con-

formations. The conformational entropy of the native state is not
considered because it is assumed to be almost the same as in other
misfolded conformations. The folding free energy is computed as:

ΔG ¼ G nat ðC nat ; A Þ þ kT ln e G =kT þ e G =kT
misf unf
ð2Þ
unf
where GkT ¼ LS unf is the free energy of the unfolded state, consid-
ered independent of the protein sequence of L residues, and Gmisf is
the free energy of the misfolded state, which depends on the
sequence. It is expected that the free energy of unfolding is negligi-
ble with respect to misfolding for hydrophobic sequences and long
proteins, since Gmisf is expected to increase faster than L with the
number of residues. Gmisf is estimated from a statistical mechanical
model of the misfolded state as [32]:
X 1 X
G misf ðA Þ ¼ C ij U ij C ij C kl C ij hC kl i U ij U kl kTLS c
ij 2kT ijkl
X ð3Þ
1
þ C ij C ij ðC kl hC kl iÞðC mn hC mn iÞ U ij U kl U mn
6ðkT Þ2 ijklmn
where Uij ¼ U(Ai, Aj) and hCiji represents the frequency of con-
tacts between residues at sequence distance |i j| in compact
structures of L residues and hCijCkli represents contact correlations,
which are precomputed from a representative subset of the PDB.
The program DeltaGREM computes stabilities ΔG for sequence-
structure pairs in the PDB and a list of user-supplied mutations or
for multiple sequence alignments that include the PDB sequence. It
is freely available from https://ub.cbm.uam.es/index.php.
Given the estimate of ΔG, two alternative models are used to
compute the acceptance probability of a mutation.
1. In the neutral model, all sequences with ΔG < ΔGthr are con-
sidered viable and equally fit, and all other sequences are elimi-
nated by negative selection. The threshold is chosen as 98% of
the ΔG of the sequence in the PDB, so that this sequence would
be selected and less stable sequences would be discarded.
2. In the fitness model, the fitness of the protein sequence is com-
puted as the fraction of the folded protein:
1
f ¼ ð4Þ
1 þ e ΔG=kT
that, for low temperature and large protein sequences, f tends to be
a sigmoidal function, f ¼ 1 if ΔG < 0 and f ¼ 0 otherwise. This
binary fitness function enforces neutral evolution that is unable to
distinguish between proteins with ΔG with the same sign. In gen-

eral, if the starting sequence has fitness fwt and the mutated
sequence has fitness fmut, the acceptance probability is computed
as the fixation probability in a population of N individuals:
ðf wt =f mut Þ 1
P fix ðf mut ; f wt ; N Þ ¼ N
ð5Þ
ðf wt =f mut Þ 1
As extensively discussed elsewhere [33, 34], the above fitness
function establishes a formal analogy between molecular evolution
and statistical physics, in the sense that an evolving population
(in the limit of very low mutation rate, in which Eq. 5 is valid,
under an unbiased mutation process) reaches a Boltzmann-like
distribution in sequence space, P(A1. . .AL) / eN ln f(A), in which
ln f(A) plays the role of energy (sequences with higher fitness are
more frequently found) and 1/N plays the role of evolutionary
temperature (small populations are more tolerant to slightly delete-
rious mutations and attain lower fitness). This analogy with statisti-
cal mechanics plays a key role in the development of the MF model.
2.2 Implementation We implemented these SCS models in the computer program

in the Computer ProteinEvolver. This framework simulates protein sequence evolu-
Simulator tion along phylogenetic trees. that computer simulations are very
ProteinEvolver useful in population genetics and evolution for hypothesis testing,
validation of analytical methods, model selection, and estimation of
evolutionary parameters [35, 36].
ProteinEvolver implements the following steps. First, a phylo-
genetic tree is either specified by the user or is internally simulated
under the coalescent model [37] extended with recombination
(including recombination hotspots following Posada and Wiuf
[38] and an adaptation of the intracodon recombination algorithm
[39, 40] to simulate protein evolution with recombination (see
Fig. 1)), demographics (population growth rate and demographic
periods), longitudinal sampling, and user-specified populations
structure with migration [41, 42]. Second, a protein sequence is
assigned to the most recent common ancestor (MRCA), or grand
MRCA (GMRCA) if recombination is simulated, and is evolved
forward in time, from the root to the tip nodes, along the phylog-
eny (Fig. 1) [43]. The number of simulated substitution events
depends on the branch lengths, and the kind of simulated substitu-
tions depends on the applied substitution model of evolution. In
addition to the SCS substitution models described above, Protei-
nEvolver implements a variety of empirical substitution models of
protein evolution. ProteinEvolver is freely available from https://
github.com/MiguelArenas/proteinevolver.
Fig. 1 Illustrative example of the recursive algorithm to simulate protein evolution along an ancestral
recombination graph based on two recombination events. White and gray circles correspond to coalescence
and recombination nodes, respectively. (1) The evolution starts from the GMRCA node; the protein is evolved
along branches according to the SCS substitution model and the branch lengths. (3) The simulation reaches a
recombinant node and because its parental recombinant node has not been assigned to a protein yet, the
evolutionary process continues toward other direction (4). (6) The simulation reaches a parental recombinant
node, and because its parental has already been assigned to a protein, (7) the simulation combines the two
proteins according to the recombination breakpoint at position 3. (9) Another recombinant node is reached,
and because its parental node has not been reached yet, a protein is assigned to this node and the simulation
continues in the other direction (10). (11) The parental node is reached, and (12) the recombinant fragments
are combined according to the recombination breakpoint at position 4. At the end of the process, a sequence
was simulated for every internal and tip node
2.3 SCS Models We tested whether the SCS models improve results obtained with
Outperform Empirical traditional empirical substitution models analyzing ten protein
Substitution Models families (phototactive yellow proteins, triosephosphate isomerases,
in Terms rubredoxins, kinesins, phage lysozymes, ferredoxins, DNA ligases,
of Distribution heat shock proteins, oxysterol-binding proteins, and retroviral
of Frequencies Among aspartyl proteases) [29]. For each protein family, we downloaded
Sites and Maximum from the Pfam database a multiple sequence alignment (MSA),
Likelihood together with its associated phylogenetic tree and a representative
protein structure deposited in the PDB. We also selected the best-
fitting empirical amino acid substitution model with ProtTest [44].
We then performed 200 simulations of protein evolution along

the Pfam tree under the best-fitting empirical amino acid substitu-
tion model, the neutral SCS model, and the fitness SCS model.
Using the Kullback-Leibler (KL) divergence [45], we found
that the simulated amino acid distributions based on the SCS
models were closer to the real distribution than the simulated
amino acid distributions based on the empirical substitution
model [29]. We also found that the neutral SCS model was robust
in generating similar results under different thermodynamic fea-
tures, while the fitness SCS model was more dependent on the
thermodynamic parameters.
In addition, protein structures reconstructed with homology
modeling [46] from simulations under the SCS models generated a
better sequence-structure pair (protein folding stability closer to
that from the real protein) than proteins simulated with empirical
amino acid substitution models [29]. Altogether, substitution
models that consider protein stability provide a better approxima-
tion of the real evolutionary process.
3 The Mean-Field Substitution Model Accounts for Stability Constraints While

Adopting Independent Sites
3.1 The Mean-Field As we discussed above, modeling selection on protein stability

Model induces dependencies between protein sites that make the compu-
tation of the likelihood function extremely cumbersome. A possible
shortcoming consists in developing a model with independent sites
that effectively enforces protein stability, in the spirit of mean-field
models in statistical physics. Specifically, we assume independence
between sites, P(A1. . .AL) ¼ ∏iPi(Ai), and we compute the site-
specific amino acid distributions Pi(Ai) obtained when the evolu-
tionary process becomes stationary imposing two conditions:
1. The stationary distribution has minimum KL divergence with
respect to the site-unspecific distribution Pmut(a) obtained
under mutation alone, i.e., we minimize ∑iPi(Ai)(log
(Pi(Ai)) log (Pmut(Ai))).
2. This minimization is performed for a given average value of the
fitness or, which is equivalent, for a given value of the average
stability
P attained in the sequence ensemble,
∏i P i ðA i ÞΔG ðC nat ; A 1 . . . A L Þ ¼ X , where ΔG is com-
A1 ...AL
puted with Eqs.1–3 considering stability against both unfold-
ing and misfolding.
These conditions are analogous to those that determine the
Boltzmann distribution, which is the maximum entropy distribu-
tion (i.e., the distribution with minimum KL divergence with
respect to the uniform distribution) for a given average value of the

energy. The constrained minimization is performed through the
global Lagrange multiplier Λ that represents natural selection,
analogous to 1/T in a physical system. The main difference
between a physical system and molecular evolution is that the
reference distribution is not the uniform distribution, but it is the
distribution induced by mutation alone. The final site-specific
amino acid distributions are given by:
P mut ðA i Þe Λϕi ðAi Þ
P i ðA i Þ ¼ P mut ð6Þ
aP ðaÞe Λϕi ðaÞ
Given Pmut(a) and Λ, the site-specific selective factors ϕi(a) are
computed recursively without any new free parameter in a time that
increases only quadratically with the number of residues L.
For a given Pmut(a), the selective parameter Λ is computed by
maximizing the likelihood of the sequence in the PDB with respect
to the model, i.e., we maximize ∑i log (PPDBi(Ai, Λ )). For the
choice of the mutational distribution, three options are left to the
user.
(a) The first option identifies each frequency Pmut(a) with the
frequency of the amino acid a in the PDB sequence or in a
user-supplied MSA that contains the PDB sequence.
(b) The Pmut(a) are obtained from a mutation model at the nucle-
otide level, with three parameters that represent the equilib-
rium nucleotide frequencies and a fourth one that represents
the transition to transversion ratio. The stationary distribu-
tions of the 61 sense codons (excluding stop codons, which
have fixation probability equal to zero) are computed, and the
amino acid distribution is obtained summing over the codons
of each amino acid. The four mutational parameters are opti-
mized by maximizing the likelihood of the observed amino
acids, i.e., ∑an(a) log (Pmut(a, μ )), where μ des the muta-
tional parameters.
(c) Pmut(a) is the mean between (a) and (b); (d) Pmut(a) is
obtained as in (b), but with mutational parameters that are
input by the user. In particular, in this way we can fix the value
of the nucleotide frequency and, consequently, the hydropho-
bicity of the protein sequence, which is correlated with the
thymine (T) content since T at second codon position is
almost only found in codons that code for hydrophobic
amino acids.
Adopting the MF model, we verified the evolutionary impor-
tance of selection against misfolded conformations by comparing
results obtained with the full model with results obtained with a
reduced model in which stability against misfolding is not imple-
mented, i.e., ΔG ¼ Gnat(Cnat, A) þ kTLSunf. We call this reduced
model the native model. This model is inferior to the full model
under several aspects: (i) if misfolding is not considered, the result-
ing sequences are on the average more hydrophobic than sequences
in the PDB; (ii) in particular, exposed sites with few contacts are
more hydrophobic than it is observed, indicating that it is negative
design against misfolding that acts to limit the hydrophobicity of
exposed sites; (iii) the likelihood of observed sequences is much
higher with the full model than with the native model; (iv) the
average folding free energy (taking into account both unfolded and
misfolded states) is negative with the full model but positive with
the reduced model, i.e., the sequences produced with the reduced
model are not stable. These results confirm that it is important to
impose stability against misfolding in SCS models.
3.2 The Wild-Type Another possibility to develop a model with independent sites that
(WT) Model implements stability constraints consists in computing the effect on
stability and fitness of any possible mutation at site i starting from
the wild-type sequence. We thus computesite-specific amino acid
frequencies from Eq. 6 with ϕi ða Þ ¼ logf A 1WT . . . ALWT ; A i ¼ a ,
i.e., the wild-type sequence with the mutation A iWT ! a. that the
WT evolutionary model is only valid one mutation away from the
sequence in the PDB, while the MF model is designed to remain
valid after a long evolutionary divergence. The parameters Pmut(a)
and Λ are determined as in the MF model.
3.3 The Substitution To fully specify the substitution process, the site-specific amino acid
Process frequencies P ai ¼ P i ða Þ modeled with Eq. 6 must be complemented
with site-specific exchangeability matrices Eiab, and the site-specific
substitution rates that define the substitution process used to com-
pute the likelihood function are computed as Q ab i
¼ E abi
P bi . The
exchangeability matrices that characterize the dynamics of the sub-
stitution process are assumed to be symmetric; thus, the detailed
balance is satisfied, and Pia are the stationary distributions. The
i
matrices E ab are computed with the method of Halpern and Bruno
mut
as the product between a global exchangeability matrix E ab that
represents the mutation process and a fixation probability analo-
gous to Eq. 5, such that the site-specific frequency of amino acid
a is the power of its site-specific fitness [47]. Specifically, if we write
P ai ¼ P amut F ai , where F ai are site-specific selective factors, the
exchangeability matrices are given by:
logF ai logF bi
i
E ab ¼ E ab
mut
ð7Þ
F ai F bi
which is also a symmetric matrix that fulfills detailed balance. The
substitution rates are maximal if the two amino acids have the same
selective factors, in which case the fixation probability tends to
1, the selective factors are large, and the mutational exchangeability

is large.
We allow three different models for the global exchangeability
mut
matrix. (a) E ab is equal to an empirical exchangeability matrix,
WAG [48] or JTT [49]. We call it as emp exchangeability model.
(b) The average flux between each pair of amino acids,
1
P i i i
L i P a E ab P b is equal to the flux of the empirical model,
emp emp
E ab P aemp P b ( flux exchangeability model). (c) E abmut
is computed
from a mutation process at the nucleotide level, with parameters
that are either optimized to fit the observed amino acid frequencies
or imposed as input (mut exchangeability model).
3.4 Implementation The MF model was implemented in the computer simulator Pro-
in the Ancestral tEvol and in the ancestral sequence reconstruction (ASR) frame-
Sequence work ProtASR [50].
Reconstruction ProtEvol computes global (whole protein) and local (site-
Framework ProtASR specific) amino acid frequencies and exchangeability matrices that
satisfy stability of the native state against both unfolding and mis-
folding. The program is freely available from https://ub.cbm.uam.
es/index.php.
ProtASR is an evolutionary framework to infer ancestral pro-
tein sequences from a multiple sequence alignment (MSA) of pro-
teins, a rooted phylogenetic tree, a protein structure representative
of the proteins of the MSA, and a set of thermodynamic parameters.
Internally, ProtASR runs ProtEvol to generate global and local
amino acid frequencies and exchangeability matrices. Next, these
frequencies and matrices are transferred to the well-established
program PAML [51] where the ASR is performed under joint or
marginal maximum likelihood (ML) approaches [52]. ProtASR is
freely available from https://github.com/MiguelArenas/protasr.
3.5 Ancestral Ancestral sequence reconstruction (ASR) is a useful tool of evolu-

Proteins tionary biology [53, 54] with a wide variety of applications such as
Reconstructed Under HIV vaccine development [2, 3, 55] or reconstruction of proteins
SCS Models Present of extinct organisms [56, 57]. The accuracy of ASR methods is
More Realistic Folding crucial to obtain realistic sequences and thus considering realistic
Stabilities than Those substitution models is recommended.
Reconstructed Under In this example, we analyzed the performance of ProtASR in
Empirical Substitution reconstructing ancestral proteins under MF and empirical models
Models in terms of protein stability [50].
We analyzed a total of six protein families present in different
bacterial species (D-ala-D-ala ligases, chaperone proteins dnaK, trio-
sephosphate isomerases, tryptophan synthases α chain,
thioredoxins I, and SH2 domain) whose folding stability had
been previously studied [58]. After obtaining the corresponding
MSA, a ML phylogenetic tree was inferred and rooted using a
Eukaryotic protein as the out-group. Next, for each protein family,
a total of 50 computer simulations were performed with ProteinE-

volver by evolving a representative protein (for which there is a PDB
structure) of the MSA along the corresponding phylogeny and
under a SCS model (see Subheading 2). We also performed com-
puter simulations with the SCS model described by Williams
et al. [59].
The simulated MSA was later used to perform ASR with Pro-
tASR under MF and empirical substitution models. In addition,
ProtASR adopting the MF model was compared with PhyloBayes
[60] adopting CAT models [61]. Next, the folding free energy of
the inferred ancestral protein sequences was computed with the
program DeltaGREM described above [32].
We found that ancestral sequences inferred with ProtASR
under MF generated free energies significantly closer to those of
the simulated sequences than ancestral protein sequences recon-
structed with empirical models. We also found that the recon-
structed sequences were more stable than the simulated
sequences, a bias that was previously observed in [59]. However,
ASR adopting the MF model reduced this bias with respect to
empirical substitution models, which is an apparently counterintui-
tive result since the MF model enforces folding stability, whereas
empirical models do not consider this condition [50].
3.6 Ancestral In this example, we reconstructed the ancestral sequences of five

Prokaryotic Proteins extant prokaryotic protein families (D-ala-D-ala ligases, chaperone
Reconstructed Under proteins dnaK, triosephosphate isomerases, tryptophan synthases α
SCS Models Present chain, and thioredoxins I) with ProtASR under the MF model.
Different Energy These protein families were selected because they were used in a
Fluctuations over Time previous study that evaluated the evolution of folding thermody-
namic properties [58].
We found that the folding free energies varied broadly across
evolution [50]. All protein families presented periods of increase,
conservation, and decrease of free energies following a seascape
model of protein evolution [62].
4 Guidelines, Recommendations, and Practical Examples for Using ProteinEvolver

and ProtASR
In this section we present some guidelines and recommendations to

simulate protein evolution with ProteinEvolver and to infer ances-
tral protein sequences with ProtASR under SCS models. A practical
example for each framework is also described.
ProteinEvolver is a computer program written in C that runs from

the command line. Its input is very simple with just a main input file
that calls secondary input files.
4.1 Guidelines and As for any computer simulator, the first step is to design the
Recommendations simulation study including the choice of the parameters to mimic
for Simulating Protein the desired evolutionary scenario, the required number of simula-
Evolution with tions, and the output format. Second, ProteinEvolver includes
ProteinEvolver detailed documentation and several examples, which we recom-
mend to read in detail. Next, we describe the input and output
information of this framework.
Since the simulation of molecular evolution is a stochastic
process [43], the user has to indicate the number of computer
simulations to be performed. The simulation of protein evolution
is performed upon a phylogeny. This phylogeny can be user-
specified or can be simulated with ProteinEvolver under the coales-
cent with recombination, demographics, longitudinal sampling,
population structure, and migration (see Subheading 2.2). For the
latter, the user has to specify the sample size (number of protein
sequences of the simulated MSA), population size, and, optionally,
other population genetics parameters (i.e., recombination rate,
distributions for recombination hotspots, population growth rate,
demographic periods, number of populations and migration rate,
among others). Next, the user has to specify a substitution model of
protein evolution, which could be empirical or stability-
constrained. Concerning SCS models, the user has to indicate a
protein structure, a representative set of alternative contact matrices
(already included in the package), and some thermodynamic para-
meters (see Subheading 2.2). Proportion of invariable sites and
additional rate heterogeneity among sites can be optionally speci-
fied. Finally, a sequence for the MRCA node can also be user-
specified or, alternatively, internally computed by sampling from
the amino acid frequencies.
Concerning the outputs, the program generates a MSA of
proteins of the sample (and, optionally, of proteins of ancestral
nodes) that can be written in formats such fasta, phylip, or nexus.
Optionally, the program also outputs the simulated recombination
breakpoints and folding energies of the simulated proteins.
Next, we describe a practical example to simulate data with
ProteinEvolver under a site-dependent SCS model. We apply the
second example (simulation of protein sequences under the neutral
site-dependent SCS model) included in the program package.
1. Setting up the input files. First, we can explore the file para-
meters, which is the main input file. In this file, that text in
brackets is ignored by the program. The specifications by default
in this example indicate the simulation of two replicates. Since
the setting input tree/s file is empty, the program will perform a
coalescent simulation. The coalescent simulation considers a
sample of 8 individuals (proteins) with length 255 amino acids.
Effective population size is 1000 individuals, and its variation
over time is considered with the specification of a population
growth rate. Longitudinal sampling is not specified in this exam-

ple. A homogeneous recombination rate along the sequence is
specified, also a substitution rate and an out-group with a fixed
branch length of 0.1. The settings of the substitution model are
specified in the file Pop_evol.in. There, a PDB file 1TRE.pdb, its
chain, a list of alternative contact matrices structures.in, several
thermodynamic parameters (temperature and configurational
entropies) that we recommend do not alter [29], the specifica-
tion of the neutral SCS model and other minor information, are
specified. Coming back to the settings file, the user can indicate
the desired output information such as the format of the
simulated MSA, coalescent trees and network, coalescent
times, or recombination breakpoints.
2. Running the computer simulations. First, the program must be
compiled. In the directory src of the package just type make all,
some warnings without importance may appear. Next, the exe-
cutable file ProteinEvolver1.2.0 should be placed in the same
directory of the input files and to run it one has to type ./
ProteinEvolver1.2.0. The program will automatically recognize
the input file parameters, and the simulation may take a few
seconds. Simulating a larger sample size, sequence length, pop-
ulation size, substitution rate, and/or recombination rate will
increase the computer time.
3. Analyzing the results. A folder named Results will be created in
the working directory and will include all the output data. For
each replicate (#), the output file sequences# provides the
simulated protein MSA, and NetworkFile# provides the
simulated recombination network in branch list format
[63]. The output file breakpoints presents a list of simulated
recombination breakpoints, times presents a list of times of
coalescent events, and trees presents the simulated coalescent
tree/s in Newick format. A folder named ProteinStability pre-
sents the folding energy for each simulated protein at every
ancestral and tip node.
4.2 Guidelines and ProtASR is a computer program written in C and Perl that runs on
Recommendations the command line. The program includes detailed documentation
for Inferring Ancestral and several examples, which we also recommend to read in detail.
Protein Sequences Its input is very simple with just a main input file that calls second-
with ProtASR ary input files. The input files are a MSA of protein sequences, a
rooted phylogenetic tree for the MSA, a PDB protein that should
be representative of the MSA, and a series of parameters to specify
the desired substitution model. For beginners we recommend
applying the parameters provided by default in the examples
included in the package since those parameters have provided a
good fitting with diverse real data [25, 29, 50].
Next, we describe a practical example to infer ancestral protein

sequences with ProtASR under the MF model. We apply the exam-
ple of rubredoxins that is included in the program package.
1. Setting up the input files. In the file Settings, the user has to
specify the alignment file (in nexus format) that must include a
rooted phylogenetic tree (in Newick format), a substitution
model (in this example, it is MF), a PDB file and chain, and a
variety of thermodynamic parameters that we recommend to use
with values provided by default.
2. Running the inferences. First, the program must be compiled. In
the directory src of the package just type make all, some warn-
ings without importance may appear, and the compilation
should take less than a minute. Next, all the material (files and
folders) generated after the compilation should be placed in the
directory of the input files (or vice versa), and there just type perl
ProtASR_main.pl Settings.txt to run the ASR. The analysis will
take several seconds. Datasets with a higher sample size and/or
sequence length will increase the computer time.
3. Analyzing the results. A folder named RESULTS will be created
in the working directory. There the output file InferredAnces-
tralProteins.txt presents the inferred ancestral protein sequence
for each node of the phylogenetic tree, and LocalResultsLikeli-
hood.txt presents the estimated ML at local (site) and global
(entire protein) levels (further information is included in the
output directories Global_ASR_ML and Local_ASR_ML). The
output directory Meanfield includes additional information gen-
erated by the MF model such as folding free energies, local and
global amino acid frequencies, and exchangeability matrices.
5 Concluding Remarks
Protein evolution is a complex process where different evolutionary

forces occur to generate new variants upon which selection operates
(e.g., toward stable proteins). As a consequence, substitution mod-
els of evolution that incorporate structural properties of the native
state, such as secondary structure and solvent accessibility, have
produced a better fitting to real data than traditional empirical
substitution models. A number of SCS models have been devel-
oped, but mainly due to their complexity, they have not been
implemented yet in useful frameworks for evolutionary biologists.
In this chapter we described some SCS models and their
evaluation and implementation in freely available frameworks to
simulate protein evolution and to reconstruct ancestral proteins.
We believe that the future of SCS models should of course continue
developing realistic models but also implementing such models in

useful frameworks for the evolutionary analysis, as we proposed in
the different studies described in this chapter.
Acknowledgments
M.A. was supported by the grant “Ramón y Cajal” RYC-2015-

18241 from the Spanish Government. U.B. is supported by the
grant BIO2016-79043 from the Spanish Ministry of Economy.
References
1. Schmitt AO, Schuchhardt J, Ludwig A, Brock- 9. Liberles DA, Teichmann SA, Bahar I,
mann GA (2007) Protein evolution within and Bastolla U, Bloom J, Bornberg-Bauer E, Col-
between species. J Theor Biol 249 well LJ, de Koning AP, Dokholyan NV,
(2):376–383. https://doi.org/10.1016/j.jtbi. Echave J, Elofsson A, Gerloff DL, Goldstein
2007.08.001 RA, Grahnen JA, Holder MT, Lakner C,
2. Gao F, Bhattacharya T, Gaschen B, Taylor J, Lartillot N, Lovell SC, Naylor G, Perica T,
Moore JP, Novitsky V, Yusim K, Lang D, Pollock DD, Pupko T, Regan L, Roger A,
Foley B, Beddows S, Alam M, Haynes B, Rubinstein N, Shakhnovich E, Sjolander K,
Hahn BH, Korber B (2003) Consensus and Sunyaev S, Teufel AI, Thorne JL, Thornton
ancestral state HIV vaccines. Science 299 JW, Weinreich DM, Whelan S (2012) The
(5612):1515–1518 interface of protein structure, protein biophys-
3. Arenas M, Posada D (2010) Computational ics, and molecular evolution. Protein Sci 21
design of centralized HIV-1 genes. Curr HIV (6):769–785
Res 8(8):613–621 10. Bastolla U (2014) Detecting selection on pro-
4. Wilson C, Agafonov RV, Hoemberger M, tein stability through statistical mechanical
Kutter S, Zorba A, Halpin J, Buosi V, models of folding and evolution. Biomol Ther
Otten R, Waterman D, Theobald DL, Kern D 4:291–314
(2015) Kinase dynamics. Using ancient protein 11. Wilke CO (2012) Bringing molecules back
kinases to unravel a modern cancer drug’s into molecular evolution. PLoS Comput Biol
mechanism. Science 347(6224):882–886. 8(6):e1002572
https://doi.org/10.1126/science.aaa1823 12. Sikosek T, Chan HS (2014) Biophysics of pro-
5. Perez-Jimenez R, Ingles-Prieto A, Zhao ZM, tein evolution and evolutionary protein bio-
Sanchez-Romero I, Alegre-Cebollada J, physics. J R Soc Interface 11(100):20140419.
Kosuri P, Garcia-Manyes S, Kappock TJ, https://doi.org/10.1098/rsif.2014.0419
Tanokura M, Holmgren A, Sanchez-Ruiz JM, 13. Goldstein RA (2011) The evolution and evo-
Gaucher EA, Fernandez JM (2011) Single- lutionary consequences of marginal thermosta-
molecule paleoenzymology probes the chemis- bility in proteins. Proteins 79(5):1396–1407
try of resurrected enzymes. Nat Struct Mol 14. Serohijos AW, Shakhnovich EI (2014) Merg-
Biol 18(5):592–596 ing molecular mechanism and evolution: the-
6. Wijma HJ, Floor RJ, Janssen DB (2013) Struc- ory and computation at the interface of
ture- and sequence-analysis inspired engineer- biophysics and evolutionary population genet-
ing of proteins for enhanced thermostability. ics. Curr Opin Struct Biol 26:84–91. https://
Curr Opin Struct Biol 23(4):588–594. doi.org/10.1016/j.sbi.2014.05.005
https://doi.org/10.1016/j.sbi.2013.04.008 15. Bastolla U, Dehouck Y, Echave J (2017) What
7. Cole MF, Gaucher EA (2011) Utilizing natural evolution tells us about protein physics, and
diversity to evolve protein function: applica- protein physics tells us about evolution. Curr
tions towards thermostability. Curr Opin Opin Struct Biol 42:59–66. https://doi.org/
Chem Biol 15(3):399–406. https://doi.org/ 10.1016/j.sbi.2016.10.020
10.1016/j.cbpa.2011.03.005 16. Echave J (2008) Evolutionary divergence of
8. Arenas M (2015) Trends in substitution mod- protein structure: the linearly forced elastic net-
els of molecular evolution. Front Genet 6:319. work model. Chem Phys Lett 457
https://doi.org/10.3389/fgene.2015.00319
(4):413–416. https://doi.org/10.1016/j. 29. Arenas M, Dos Santos HG, Posada D, Bastolla

cplett.2008.04.042 U (2013) Protein evolution along phyloge-
17. Tirion MM (1996) Large amplitude elastic netic histories under structurally constrained
motions in proteins from a single-parameter, substitution models. Bioinformatics 29
atomic analysis. Phys Rev Lett 77 (23):3020–3028
(9):1905–1908 30. Echave J, Wilke CO (2017) Biophysical models
18. Bahar I, Rader AJ (2005) Coarse-grained nor- of protein evolution: understanding the pat-
mal mode analysis in structural biology. Curr terns of evolutionary sequence divergence.
Opin Struct Biol 15(5):586–592. https://doi. Annu Rev Biophys 46:85–103. https://doi.
org/10.1016/j.sbi.2005.08.007 org/10.1146/annurev-biophys-070816-
19. Bornberg-Bauer E, Chan HS (1999) Modeling 033819
evolutionary landscapes: mutational stability, 31. Bastolla U, Farwer J, Knapp EW, Vendruscolo
topology, and superfunnels in sequence space. M (2001) How to guarantee optimal stability
Proc Natl Acad Sci U S A 96 for most representative structures in the Pro-
(19):10689–10694 tein Data Bank. Proteins 44(2):79–96
20. Bastolla U, Porto M, Eduardo Roman MH, 32. Minning J, Porto M, Bastolla U (2013)
Vendruscolo MH (2003) Connectivity of neu- Detecting selection for negative design in pro-
tral networks, overdispersion, and structural teins through an improved model of the mis-
conservation in protein evolution. J Mol Evol folded state. Proteins 81(7):1102–1112.
56(3):243–254 https://doi.org/10.1002/prot.24244
21. Lemmon AR, Moriarty EC (2004) The impor- 33. Sella G, Hirsh AE (2005) The application of
tance of proper model assumption in bayesian statistical physics to evolutionary biology. Proc
phylogenetics. Syst Biol 53(2):265–277 Natl Acad Sci U S A 102(27):9541–9546
22. Zhang J (1999) Performance of likelihood 34. Mustonen V, Lassig M (2005) Evolutionary
ratio tests of evolutionary hypotheses under population genetics of promoters: predicting
inadequate substitution models. Mol Biol binding sites and functional phylogenies. Proc
Evol 16(6):868–875 Natl Acad Sci U S A 102(44):15936–15941.
23. Bordner AJ, Mittelmann HD (2013) A new https://doi.org/10.1073/pnas.0505537102
formulation of protein evolutionary models 35. Arenas M (2012) Simulation of molecular data
that account for structural constraints. Mol under diverse evolutionary scenarios. PLoS
Biol Evol 31(3):736–749 Comput Biol 8(5):e1002495
24. Rodrigue N, Lartillot N, Bryant D, Philippe H 36. Hoban S, Bertorelle G, Gaggiotti OE (2012)
(2005) Site interdependence attributed to ter- Computer simulations: tools for population
tiary structure in amino acid sequence evolu- and evolutionary genetics. Nat Rev Genet 13
tion. Gene 347(2):207–217 (2):110–122
25. Arenas M, Sanchez-Cobos A, Bastolla U 37. Kingman JFC (1982) The coalescent. Stoch
(2015) Maximum likelihood phylogenetic Process Appl 13:235–248
inference with selection on protein folding sta- 38. Posada D, Wiuf C (2003) Simulating haplo-
bility. Mol Biol Evol 32(8):2195–2207. type blocks in the human genome. Bioinfor-
https://doi.org/10.1093/molbev/msv085 matics 19(2):289–290
26. Bastolla U, Porto M, Roman HE, Vendruscolo 39. Arenas M, Posada D (2010) Coalescent simu-
M (2006) A protein evolution model with lation of intracodon recombination. Genetics
independent sites that reproduces site-specific 184(2):429–437
amino acid distributions from the Protein Data 40. Arenas M (2013) Computer programs and
Bank. BMC Evol Biol 6:43 methodologies for the simulation of DNA
27. Anishchenko I, Ovchinnikov S, Kamisetty H, sequence data with recombination. Front
Baker D (2017) Origins of coevolution Genet 4:9
between residues distant in protein 3D struc- 41. Arenas M, Posada D (2014) Simulation of
tures. Proc Natl Acad Sci U S A genome-wide evolution under heterogeneous
114:9122–9127. https://doi.org/10.1073/ substitution models and complex multispecies
pnas.1702664114 coalescent histories. Mol Biol Evol 31
28. Wang ZO, Pollock DD (2005) Context depen- (5):1295–1301
dence and coevolution among amino acid resi- 42. Hudson RR (1998) Island models and the coa-
dues in proteins. Methods Enzymol lescent process. Mol Ecol 7(4):413–418
395:779–790. https://doi.org/10.1016/ 43. Yang Z (2006) Computational molecular evo-
S0076-6879(05)95040-4 lution. Oxford University Press, Oxford
44. Abascal F, Zardoya R, Posada D (2005) Prot- Weaver EA, Gao F, Haynes BF, Shaw GM,
Test: selection of best-fit models of protein Korber BT, Hahn BH (2006) Ancestral and
evolution. Bioinformatics 21(9):2104–2105 consensus envelope immunogens for HIV-1
45. Kullback S, Leibler RA (1951) On information subtype C. Virology 352(2):438–449
and sufficiency. Ann Math Stat 22(1):79–86 56. Gaucher EA, Govindarajan S, Ganesh OK
46. Marti-Renom MA, Stuart AC, Fiser A, (2008) Palaeotemperature trend for Precam-
Sanchez R, Melo F, Sali A (2000) Comparative brian life inferred from resurrected proteins.
protein structure modeling of genes and gen- Nature 451(7179):704–707
omes. Annu Rev Biophys Biomol Struct 57. Hobbs JK, Shepherd C, Saul DJ, Demetras NJ,
29:291–325 Haaning S, Monk CR, Daniel RM, Arcus VL
47. Halpern AL, Bruno WJ (1998) Evolutionary (2012) On the origin and evolution of thermo-
distances for protein-coding sequences: model- phily: reconstruction of functional precam-
ing site-specific residue frequencies. Mol Biol brian enzymes from ancestors of Bacillus. Mol
Evol 15(7):910–917 Biol Evol 29(2):825–835. https://doi.org/10.
48. Whelan S, Goldman N (2001) A general 1093/molbev/msr253
empirical model of protein evolution derived 58. Bastolla U, Moya A, Viguera E, van Ham RC
from multiple protein families using a (2004) Genomic determinants of protein fold-
maximum-likelihood approach. Mol Biol Evol ing thermodynamics in prokaryotic organisms.
18(5):691–699 J Mol Biol 343(5):1451–1466
49. Jones DT, Taylor WR, Thornton JM (1992) 59. Williams PD, Pollock DD, Blackburne BP,
The rapid generation of mutation data matrices Goldstein RA (2006) Assessing the accuracy
from protein sequences. Comput Appl Biosci 8 of ancestral protein reconstruction methods.
(3):275–282 PLoS Comput Biol 2(6):e69
50. Arenas M, Weber CC, Liberles DA, Bastolla U 60. Lartillot N, Lepage T, Blanquart S (2009) Phy-
(2017) ProtASR: an evolutionary framework loBayes 3: a Bayesian software package for phy-
for ancestral protein reconstruction with selec- logenetic reconstruction and molecular dating.
tion on folding stability. Syst Biol Bioinformatics 25(17):2286–2288. https://
66:1054–1064. https://doi.org/10.1093/sys doi.org/10.1093/bioinformatics/btp368
bio/syw121 61. Lartillot N, Philippe H (2004) A Bayesian mix-
51. Yang Z (2007) PAML 4: phylogenetic analysis ture model for across-site heterogeneities in the
by maximum likelihood. Mol Biol Evol 24 amino-acid replacement process. Mol Biol Evol
(8):1586–1591 21(6):1095–1109
52. Yang Z (1997) PAML: a program package for 62. Mustonen V, Lassig M (2009) From fitness
phylogenetic analysis by maximum likelihood. landscapes to seascapes: non-equilibrium
Comput Appl Biosci 13(5):555–556 dynamics of selection and adaptation. Trends
53. Merkl R, Sterner R (2016) Ancestral protein Genet 25(3):111–119. https://doi.org/10.
reconstruction: techniques and applications. 1016/j.tig.2009.01.002
Biol Chem 397(1):1–21. https://doi.org/10. 63. Arenas M, Patricio M, Posada D, Valiente G
1515/hsz-2015-0158 (2010) Characterization of phylogenetic net-
54. Liberles DA (2007) Ancestral sequence recon- works with NetTest. BMC Bioinformatics 11
struction. Oxford University Press, Oxford (1):268
55. Kothe DL, Li Y, Decker JM, Bibollet-Ruche F,
Zammit KP, Salazar MG, Chen Y, Weng Z,
Chapter 12
Navigating Among Known Structures in Protein Space

Aya Narunsky, Nir Ben-Tal, and Rachel Kolodny
Abstract
Present-day protein space is the result of 3.7 billion years of evolution, constrained by the underlying
physicochemical qualities of the proteins. It is difficult to differentiate between evolutionary traces and
effects of physicochemical constraints. Nonetheless, as a rule of thumb, instances of structural reuse, or
focusing on structural similarity, are likely attributable to physicochemical constraints, whereas sequence
reuse, or focusing on sequence similarity, may be more indicative of evolutionary relationships. Both types
of relationships have been studied and can provide meaningful insights to protein biophysics and evolution,
which in turn can lead to better algorithms for protein search, annotation, and maybe even design.
In broad strokes, studies of protein space vary in the entities they represent, the similarity measure
comparing these entities, and the representation used. The entities can be, for example, protein chains,
domains, supra-domains, or smaller protein sub-parts denoted themes. The measures of similarity
between the entities can be based on sequence, structure, function, or any combination of these. The
representation can be global, encompassing the whole space, or local, focusing on a particular region
surrounding protein(s) of interest. Global representations include lists of grouped proteins, protein
networks, and maps. Networks are the abstraction that is derived most directly from the similarity
data: each node is the protein entity (e.g., a domain), and edges connect similar domains. Selecting
the entities, the similarity measure, and the abstraction are three intertwined decisions: the similarity
measures allow us to identify the entities, and the selection of entities influences what is a meaningful
similarity measure. Similarly, we seek entities that are related to each other in a way, for which a simple
representation describes their relationships succinctly and accurately. This chapter will cover studies that
rely on different entities, similarity measures, and a range of representations to better understand protein
structure space. Scholars may use publicly available navigators offering a global representation, and in
particular the hierarchical classifications SCOP, CATH, and ECOD, or a local representation, which
encompass structural alignment algorithms. Alternatively, scholars can configure their own navigator
using existing tools. To demonstrate this DIY (do it yourself) approach for navigating in protein space,
we investigate substrate-binding proteins. By presenting sequence similarities among this large and
diverse protein family as a network, we can infer that one member (pdb ID 4ntl; of yet unknown
function) may bind methionine and suggest a putative binding mechanism.
Key words Protein space navigation, Structure space, Evolutionary relationships in protein space
233
234 Aya Narunsky et al.
1 Introduction
1.1 Protein Structure Protein structure space is an abstract model which we use when we
Space study large, representative, sets of protein structures and their
interrelationships. Inspecting these large datasets allows us to bet-
ter understand protein evolution and biophysics. While protein
space is not real, the entities that populate it are: for example,
these can be protein chains or domains; furthermore, their compar-
isons are meaningful. Thus, the first and essential step when study-
ing protein structure space is to decide on the set of entities and the
measure of similarity among them (coupled with a method to
compute it). We can then calculate all-against-all comparisons of
these entities to construct the initial dataset. Because the abstract
model is derived from these comparisons, it is essential that this
initial set is as accurate and comprehensive as possible. Navigating
in protein structure space is in many ways navigating within this
initial dataset, and we can do this either locally or globally.
1.2 Navigation Navigating “locally” or “globally” in protein structure space is a

Modes metaphor, which describes how we study the dataset. By “local,”
we mean that we identify small sets of comparisons, which we deem
relevant. Given a query protein chain, or query protein domain, we
think of the comparisons of that protein and its near structural
neighbors (i.e., other proteins in the dataset that are similar to it)
as covering its local region in structure space. Navigating locally is
moving between (overlapping) local regions, akin to moving
between landmarks when using a navigation app. By “global,” we
mean that we derived a model which integrates information from
many (possibly all) comparisons and explore this model. Alterna-
tively, we can think of this model as a data structure that organizes
all entities based on the relationships between them. Navigating
globally means that we either explore the properties of this data
structure, akin to staring at a map, or move between proteins based
on their location in the data structure.
1.3 The Potential Studying protein structure space can help us better understand
of Studying Protein protein evolution and biophysics. It may also have a practical
Structure Space value: insights could be used in protein structure prediction, pro-
tein function prediction, and protein design. By way of motivation,
we list a few examples; there are many more (e.g., those listed in
[1, 2].) Evolution scholars have navigated protein space looking for
clues in the remnants of evolutionary processes [3, 4]. For example,
Choi et al. [5] derive the “multiple birth model” for proteins from
maps, Dokholyan et al. [6] offered support for all proteins evolving
from a few precursors, Alva et al. [7] studied the relationship
between convergent and divergent evolution, Farias-Rico et al.
traced the evolutionary relationships between ancient superfolds
Navigating Protein Space 235
[8], and Nepomnyachiy et al. [9] highlighted the complex nature of

reuse patterns, which often overlap with each other. Studying pro-
tein structure space also revealed biophysical properties of proteins:
examples include the work of Skolnick et al. [10], Nepomnyachiy
et al. [11], and Mackenzie et al. [12]. Understanding the space of
all structures can help in protein structure prediction and in better
organizing the databases for structure search [13]. A global per-
spective also offered a hint to the relationship between protein
structure and function, showing that there is a localized region of
high function diversity [14]. Notice that one size does not fit all:
different insights were gained from representations of protein space
that varied in the sets of entities curated and in the way the entities
were compared to each other.
2 Materials and Methods
2.1 The Entities The entities are derived from the proteins of known structure in the
Protein Data Bank (PDB) [15] and can be parts of proteins of
different scales, depending on the question at hand. With minimal
processing, these can be protein complexes or protein chains. One
could also consider protein domains [16, 17] (or even supra-
domains [18]), or meaningful sub-domain entities: protein frag-
ments (e.g., [19, 20]), protein themes [9], protein interfaces [21],
protein-peptide complexes [22], repetitive secondary structure ele-
ments (e.g., Smotifs [23]), or tertiary structural motifs (TERMS)
[12]. Alternatively, the structures could possibly be predictions
[24], or homology models [25]. Typically, one would use datasets
that were curated by others (e.g., the domain sets in SCOP [26],
CATH [27], or ECOD [28]). It is important to consider if the
entities are mutually exclusive, or not. For example, domains are
mutually exclusive because when partitioning chains to domains,
each residue is associated with only a single domain; in contrast,
themes cover multiple (nested) segments in a protein chain.
2.2 Relating Comparing proteins can be based on their sequences, structures, or

the Entities functions. The most straightforward measure is sequence similarity,
which suggests shared evolutionary ancestor(s) [29]. Sequence
alignment tools vary in sensitivity: less sensitive methods rely
directly on the protein sequences (e.g., BLAST). More sensitive
methods rely on an enriched version of the sequences: either
sequence profiles (e.g., PSI-BLAST) or HMMs (e.g., HHSearch
[30] or HHMER [31]); these are probabilistic models that include
not only the protein sequence but also sequences of its close homo-
logues [30, 31]. Using sensitive sequence aligners like HHSearch
or HHMER reveals more distant evolutionary relationships. To
avoid relating pairs of proteins that have diverged beyond what
we would consider similar, scholars add an additional restriction
that the structures of the aligned residues be similar [11, 32]; it is

not impossible that structural changes emerged upon evolution
though (and anyway, proteins often undergo conformational
changes [33, 34]). Note that using profile or HMM-based
sequence aligners requires calculating these profiles or HMMs;
one can use pre-calculated ones (which influences the set of entities
available). Alternatively, it is possible to compare the structures of
the proteins. Structure similarity is often viewed as a method for
relating proteins that were similar further back in evolutionary
history, with sequences that diverged beyond the point where one
can identify their common ancestry; for example, the SCOP “fold,”
CATH “Architecture,” and ECOD “X” levels are based on struc-
ture similarity. This is akin to using a more powerful telescope to
look back in time [35]. A concern when relying only on structure
similarity to study protein evolution is that these proteins share
structures because these structures are especially favorable from a
biophysical perspective. In other words, that what we see is merely a
consequence of the biophysical properties and constraints [36],
perhaps due to convergent evolution. To compare structures, we
use one of many structural alignment methods. In fact, structural
alignment is a vast field with many intricacies, far beyond the scope
of this chapter. For more details, see [37–40] and below in the
section highlighting structural alignment servers.
The similarity measure (be it based on sequence or on struc-
ture) can be local or global.1 In global similarity, the proteins are
considered in their entirety. In contrast, in local similarity, we
consider subsections, so that proteins can be identified as similar
even if there is only a partial match. The disadvantage of using a
global similarity measure is that to be meaningful, we must first
segment our proteins to pieces, which are similar in their entirety
(e.g., domains); this creates a chicken-and-egg situation, because
we want to segment the proteins in a way that we can find globally
meaningful similarities. The disadvantage of using a local similarity
measure is that it leads to non-transitive relationships: protein A
that is locally similar to protein B, protein B that is locally similar to
protein C, and at the same time proteins A and C have nothing in
common ([1] has an illustration of this). Non-transitive relation-
ships are counterintuitive when we think of the notion of similarity
and especially when we integrate all these relationships into a uni-
fied (global) model of protein space.
2.3 Addressing The PDB is redundant, and some proteins are far more abundant
Redundancy than others (e.g., due to research interests of the scholars studying
these proteins) [41]. This suggests that when seeking a global
1
Notice that the terms used here characterize the similarity measure, not the style of navigation in protein space,
to use the same terms as in the Needleman–Wunsch and Smith–Waterman sequence alignment algorithms.
perspective, one should either rely on nonredundant datasets or

alternatively remove, or cull, the redundancy on their own.
Notice that we consider an entity redundant if the dataset includes
another copy of that entity: i.e., one that is (globally) similar to
it. Hence, both the definition of the entities and the measures of
similarity influence this redundancy removal process. There are
software packages, and servers, that implement algorithms for
removing redundancy: two popular ones are CD-HIT [42] and
PISCES [43].
2.4 Data Structures For a global perspective, one must derive a data structure, or an
for Global abstract model, from the dataset of all proteins and their compar-
Representation isons. Scholars used three types of models: (1) networks, (2) classi-
fications, and (3) maps (for a review of these, see [2]). A network is
the data structure closest to the raw data. To construct it, one only
needs to list the meaningful similarities, and the network is a
straightforward representation of the entities (as nodes) and the
similarities (as edges connecting these nodes.) A classification
groups the entities into nonoverlapping sets of proteins. It is
assumed that proteins in the same set in the classification (i.e.,
with the same classification) are similar to each other, while those
not in the same set are not (or less so). The classifications are
hierarchical, and proteins are grouped with decreasing degrees of
similarity. Hence, to construct a classification, one needs to weight
the importance of the similarities identified among the protein
entities: emphasizing the ones that are within a set and downplay-
ing the ones between sets. Finally, in a map, each protein is repre-
sented by a point, and the points are positioned in two or three
dimensions, so that the distance between them approximates the
dissimilarity between the proteins they represent. The mapping is
calculated by first converting the measures of similarity between the
protein entities to an all-by-all dissimilarity matrix, followed by a
multidimensional scaling (MDS) to project this matrix to a lower
(two or three) dimension. Because the position of a protein is not
indicative of its relationship to other proteins in a straightforward
manner, maps were not used for local navigation. Rather, the
insights were derived from a global perspective [5, 14, 35, 44, 45].
2.5 Publicly Defining a meaningful nonredundant set of entities, calculating the

Available Navigators relationships between them, and collecting all this information to a
for Protein Structure centralized data structure require both ingenuity and computa-
Space tional resources. Even more so, as the database of all protein
structures (the PDB) is constantly growing, the calculations need
to be routinely updated. Consequently, many groups have set up
web servers with data for navigating protein structure space; these
navigators have datasets which were curated, compared, and
organized—some at a single time point (but possibly with a more
elaborate organization)—while others are maintained up-to-date.
The navigators enable users to move in protein structure space as if

they are using a navigation app. Some of the navigators offer their
users a global perspective of protein structure space as well.
2.6 Navigators The most established resources for navigating protein structure
with a Global space are the hierarchical classifications; the popular ones are
Perspective SCOP from the Murzin lab, CATH from the Orengo lab, and
ECOD from the Grishin Lab; another popular classification—
Pfam [46]—is not discussed here because it is based on sequence
rather than structure. For a recent and extensive review of the
classifications, see [47]. The classifications organize the data in a
hierarchy: a user can gain a perspective of the whole space by
drilling down, starting at the top. For example, starting at the
highest level of SCOP, we see that structure space has regions of
all-alpha domains, all-beta domains, alpha+beta domains, and
alpha/beta domains, where the two latter classes include both
alpha and beta elements, separated or intertwined, respectively
[48]. Alternatively, one can search for a specific protein and con-
sider the classification of its domains and the list of all its related
proteins—ones whose domains are classified similarly (at different
levels of the hierarchy.) In short, the data structure that is used in
the hierarchies is a collection of sets (or lists), organized as a tree;
each entity is classified in several (nested) sets (depending on the
height of the hierarchy). The similarity measure used is based on
the sequences (at the lower levels of the hierarchy) and structures
(at the higher levels of the hierarchy). The entities classified are
domains: nonoverlapping subsections of the protein chains, which
cover all chain residues (or, in other words, each PDB chain is
segmented into one or more domains such that each residue is
part of exactly one domain). There is much discussion, and contro-
versy, on what is the correct definition of domains [49–51]; that
there are several domains databases (rather than one) is a clear
indication of this.
In practical terms, domains are the entities classified in SCOP,
CATH, ECOD, or in servers curating domains like CDD
[52]. More formally, there are several (not necessarily overlapping)
definitions of a domain [16, 17, 53]: (1) a structurally distinct
region (perhaps a compact unit) [54], (2) a segment that is identi-
fied as an evolutionary unit based on observations of reuse in
protein space, (3) an independently folding unit, and (4) a section
with assigned biochemical function. The domains in the hierarchi-
cal classifications are defined based on reuse. Unfortunately, these
domains, which are classified in the different databases, are not the
same ones (for comparisons, see [50, 51, 55, 56]); a recent study
estimates that only 60% of CATH domains have a similar SCOP
counterpart [53]. Nonetheless, the domains in the hierarchical
classifications have similar lengths of approximately 100 residues;
this is the average for the distributions of domain lengths in the
SCOP, CATH, and ECOD (see Fig. 8b in [28]). Indeed, splitting a

protein chain into domains is challenging [49], leading to many
algorithmic methods devoted to this task (e.g., [54, 57–59]), and a
significant amount of human intervention in some of the classifica-
tions (rather than only relying on automatic domain assignment
procedures). Regardless of how automatic the procedure for iden-
tifying the domain boundaries, a fundamental problem remains if
the domains are defined based on reuse: the reuse patterns in
protein space are not simply reuse of segments of an appropriate
length (~100 residues). Rather, it is a complicated pattern of nested
segments that are reused to different extents [9, 27]. Consequently,
there is more than one way to reduce this complex pattern into
domain definitions. Due to this very same complexity, once the
domains are defined, there are many instances of common parts
(segments) between domains that are not wholly similar and are
thus classified differently (at different levels of the hierarchy)
[11, 29, 60–62].
The classification hierarchies maintain an up-to-date dataset
representing the complete and current PDB, with an intuitive
user interface. In CATH and ECOD, one can drill down the tree
to explore different members of the sets; CATH also has a sunburst
visualization, which indicates the relative sizes of the classified sets.
Since the last version (1.75 in 2009) of the classic SCOP, the
classification diverged into two variants: SCOPe and SCOP2.
SCOPe [63] is a continuously and (mostly) automatically updated
extension of classic SCOP. In contrast, SCOP2 [64] changed the
data structure: rather than the classic tree of sets, it uses a network;
the network representation (sometimes called graphs) is implemen-
ted with a web tool based on the visualization software Graphviz
[65]. In all classifications, the user can search for a specific protein
chain or domain and explore the local context of that protein within
the data structure (typically, within the hierarchy), allowing the user
to see proteins of similar sequence (with the same classification at
the lower levels) and of similar structure (with the same classifica-
tion at higher levels.)
2.7 Publicly Another way of navigating protein structure space is zooming into
Available Navigators a local region, while ignoring the global view, and exploring, by
for Local Environments moving between such local environments. Starting from the pro-
of Structure Space tein of interest, we think of its local environment as a list of its
structural neighbors (sorted from near to distant ones); we can
then move in space by selecting one of these neighbors to see its
slightly shifted local environment (centered on this neighbor.) We
think of this process as navigating in protein structure space, like a
driver following a navigation app without seeing the full landscape.
For this, all one must have is the list of neighbors for each protein in
the dataset. The entities considered are typically both PDB chains
and domains (either taken from the classifications or calculated with
an automatic domain parser). Because the overall data structure is

not considered, the structural alignment remains the most impor-
tant computational component. Thus, such navigators were often
set up by groups developing structural alignment methods. What
transforms a structural alignment server into a useful navigator is
speed: to navigate comfortably, the server must be fast. This is
because when navigating, we search for structural neighbors repeat-
edly, each time starting at a different protein. Indeed, significant
sophistication is needed to build servers that are up-to-date, fast,
and comprehensive.
The differences between the structural alignment servers are
largely due to the differences between the structural alignment
methods. We list examples of structural alignment servers that
allow users to locally navigate in protein structure space. The
PDB website has precomputed structural alignments for a repre-
sentative nonredundant dataset, calculated using the FATCAT
aligner [66]. The European PDB website has PDBeFold [67], a
structural alignment server based on the SSM aligner [68]. NCBI’s
server is called VAST+ and is based on the aligner VAST
[69]. PhyreStorm [70] is a new server, which relies on TM-align
[71] and offers a very comfortable navigation experience. Another
new server is TopSearch (using the structural aligner TopMatch),
which has the unique feature that it considers larger entities of
protein oligomer [72].
2.8 DIY: Build-Your- There are several reasons why scholars may want to customize their
Own Navigator own navigator to explore protein structure space, or parts of
it. First, the entities they wish to include may be specific to their
problem: a set of proteins that is not covered in the public servers
(perhaps a more redundant one), unpublished structures, or even
predicted ones. Also, one may want to study subsections of pro-
teins, which are different from chains or domains, for example,
shorter themes [9] or loops [73]. Second, scholars may want to
compare the entities themselves, as it gives them flexibility in the
choice of a specific sequence or structure alignment program, full
control over the parameters used, and the ability to enforce addi-
tional conditions when comparing proteins (e.g., a minimal align-
ment length). In some cases, even though there is a publicly
available structural alignment server, it is not fast enough for navi-
gating structure space; for these, one may prefer to pre-calculate all-
against-all comparisons (e.g., using the parallel power of a com-
puter cluster). We list just a few examples of comparison methods
that were used in a similar context: HHSearch [30], Matt [74], CE
[75], Mammoth [76], 3D-BLAST [77], FragBag [78], TM-align
[71], SSM [68], GRASP [79], and STRUCTAL [80]. Third, the
structural alignment servers do not offer a global perspective of
structure space, only a local one, and one may be interested in this
global perspective. Finally, scholars have different preferences when
exploring structures in a molecular viewer, both in terms of the

viewer they are using and its configuration.
If the navigator is based on a network data structure, it is easy to
build your own navigator with the network visualization tool
Cytoscape [81] and its molecular viewer configuration apps
CyToStruct [82] or structureViz [83]. To represent a part of
protein space as a network, one needs to define the list of nodes
(entities to be compared and the edges that connect them (pairs of
entities that are similar). This is very easy to do with Cytoscape: a
(fantastic) open-source network analysis and visualization tool.
Given the list of nodes and edges, Cytoscape visualizes the infor-
mation as a well-laid two-dimensional network; one can configure
this visualization easily and extensively. For example, the color of
the nodes may depend on the structural class of the entities they
represent, and the thickness of the edges may depend on the
similarity of the entities they connect. This provides the global
perspective. To gain a local perspective, one would like to use a
molecular viewer to study the nodes or edges and the structures or
structural alignments they represent.
Molecular viewers need to be configured: these are sophisti-
cated software tools, with many alternative settings. By configuring
the molecular viewer, one can display and highlight the relevant
parts in the protein structure. Popular molecular viewers are
PyMOL [84], UCSF Chimera [85], JMol [86], VMD [87], and
recently NGL—a particularly fast web-based viewer [88]; for a
review of these and more, see [89]. There are two methods of
configuring molecular viewers: (1) manually, using the graphical
user interface (GUI) and (2) by running a script in the language
specific to that viewer. Configuring the viewer manually is easier for
a novice but far more tedious; configuring it via scripts requires
command of the scripting language but facilitates repeated visuali-
zations dramatically. To link the entities in Cytoscape with a molec-
ular viewer, one can install one of two Cytoscape apps: structureViz
or CyToStruct. structureViz is tightly coupled with UCSF Chi-
mera. In structureViz, node attributes can specify PDB names, so
that the corresponding pdb file opens in UCSF Chimera; the
molecular viewer can also be configured via its GUI. In contrast,
CyToStruct is suited for users who configure the molecular viewers
via scripts; it is very powerful in that it allows using any molecular
viewer, and within that viewer configuring anything that can be
specified via a script, or equivalently, computed with that software.
CyToStruct can run any molecular viewer (and any external
program in general) from all nodes and edges (a menu opens when
right-clicking on it), with scripts that are tailored to each node or
edge. To configure CyToStruct, the user has to specify the external
program, a template of script to be run, and a file with node- or
edge-specific data for that template. CyToStruct then creates the
runnable script by infusing the node- or edge-specific data into the
template and runs the molecular viewer with a copy of this script.
The source code of CyToStruct is publicly available (https://
bitbucket.org/sergeyn/cytostruct/wiki/Home), along a series of
demos that users can rely on as a starting point. The demos include
visualization using the four popular molecular viewers (each with
their own syntax), configuring the visualization of complete struc-
tures, protein interfaces, structurally aligning multiple structures,
and selecting specific residues. CyToStruct can also be used within
the web-based version of Cytoscape (Cytoscape.js), to provide an
online visualization combining a network and a molecular viewer.
We present two examples for DIY navigators. The first is the
navigator that Nepomnyachiy et al. customized for a global view of
protein structure space [11]. The entities, or nodes in the network,
are 9710 SCOP domains (70% nonredundant set). These domains
were compared using the structural aligner SSM [68]; for suffi-
ciently meaningful alignments, Nepomnyachiy et al. calculated
measures of the similarity of the domains. Then, they define several
networks, each characterized by its edges, which connect all domain
pairs that were aligned with parameters better than some fixed
thresholds: a minimal alignment length (55, 75 residues), maximal
RMSD (2, 2.5, and 3 Å), and minimal percent sequence similarity
(30, 40, and 50%). By coloring the nodes based on their SCOP
class, all-alpha, all-beta, alpha/beta, and alpha+beta, they could see
that protein structure space has a continuous region (the alpha/
beta domains) and discrete regions [11]. The Cytoscape networks
provide a global view, but navigating in specific regions of structure
space is also interesting. Nepomnyachiy et al. link and configure the
molecular viewer using CyToStruct [82] to see the domains and
the alignments and package and distribute the data and configura-
tion files (http://cs.haifa.ac.il/~trachel/domain_motif_networks/
), allowing anyone to study protein structure space in this way.
2.9 Case Study We present here a new example, where Cytoscape and CyToStruct
are used to navigate protein space for function inference. The
navigator helps because a careful examination of populated regions
in the protein universe can help decipher unknown qualities of
proteins found in these regions. Here, we demonstrate this using
substrate-binding proteins (SBPs) [90]. SBPs are involved in trans-
port of substrates into the cell, where their role is to recognize the
substrate and relay it to its transmembrane transporter. Although
they vary in size and share relatively low sequence similarity, they
share a similar, highly conserved, fold. In general, their shape is a
lung-like structure, formed of two structurally similar globular
domains, connected by a hinge. The hinge facilitates alteration
between substrate-free and substrate-bound conformations; sub-
strate binding to a cavity between the two domains brings them
closer to one another, into a bound, or “closed,” conformation.
Acid Cysteine Ferrum Heme Metal Phosphate Suga

Alanine DNA Glx/Asx Leucine/Isoleucine Methionine Siderophore Tungstate
Choline Fatty Acid Gluthathione Lysine/Arginine Peptide Spermidine/Putrescine Unknown
Fig. 1 Navigating protein structure space to study proteins with unknown function. Left panel: network of
substrate-binding proteins. Each node represents a single PDB chain; two nodes are connected by an edge if
they share some sequential and structural similarity. The nodes are colored according to the substrate; see
color-code at the bottom. White nodes represent proteins of unknown function. Middle panel: zooming-in on
the top-right cluster. This cluster is composed mostly of amino acid binding proteins. Right panel: zooming-in
on one connected component. Violet nodes represent methionine binding proteins. 4ntl, represented here by a
white node encircled in orange, has no bound substrate, and its function is unknown. It is connected to the two
central nodes, 4qhq and 3tqw (encircled in blue and purple). The figure was created using Cytoscape [94]
A dataset of binding proteins was collected from the 70% NR

PDB, by using the website text search. This dataset was extended by
adding proteins that share at least 30% of their sequence, over a
segment of at least 35 residues, with an RMSD lower than 3.5 Å,
with the proteins in the initial dataset. Cytoscape generated the
network (Fig. 1, left panel), where each node represents a protein in
the dataset and two nodes are connected if the proteins are deemed
related (more than 30% sequence similarity, over more than 35 resi-
dues, with less than 3.5 Å RMSD). With this particular choice,
several clusters are formed, so that in general SBPs which bind
similar substrates (as evident in their PDB structures) belong to
the same cluster (Fig. 1, left panel). Thus, their binding preferences
and modes of interaction with the substrate can be predicted by the
cluster they are found in. For example, one cluster is formed by
SBPs that bind amino acids (Fig. 1, middle panel). A connected
component within this cluster contains SBPs that generally bind
methionine (Fig. 1, right panel). The substrate of one of these SBPs
(white, encircled in orange, pdb 4ntl) is unknown. However, in this
case we can suggest a likely hypothesis is that it also binds methio-
nine. The sequence identity between the query and its neighbors is
less than 40%; thus this functional inference, which is in keeping
with the conjecture listed in CDD [52], is not trivial [91].
Using CyToStruct [82] and our molecular viewer of choice, we
can examine this hypothesis in detail. Reassuringly, comparison of
this query protein with its first neighbors in protein space (Fig. 1,
right panel, the two nodes at the center of the cluster, encircled in
cyan and green) supports this inference, as they share high struc-
tural similarity to the query (Fig. 2a). As both neighbors (pdb 4qhq
Fig. 2 Methionine binding in the SBPs 4ntl, 4qhq, and 3tqw. (A) Structural superposition of the 4ntl query
(orange) with 4qhq and 3tqw (blue and purple), respectively. The superposition is over the C-terminal lobe to
highlight the conformational change between the bound (close; 4qhq and 3tqw) and unbound (open; query)
states of the SBPs. The bound methionine is shown in red spheres. (B) The methionine binding site in 4qhq.
Methionine is presented using sticks model, and the polar residues of the binding site are depicted as
wireframes. The hydrogen bonds that mediate methionine’s interactions with these residues and with water
molecules (red sphere) are marked as red dashed lines. The highly conserved Arg143 is also marked. (C) The
methionine binding site in 3tqw. The highly conserved Arg113, equivalent of Arg143 in panel B, is marked. (D)
Putative encounter complex between methionine and the query. Arg144 (depicted as wireframe) has the same
location and rotameric state as its equivalents: Arg144 of 4qhq and Arg113 of 3tqw. The dashed line shows
the putative hydrogen bond, which could form between the arginine and the methionine carbonyl group. The
figure was created using the Pymol molecular viewer [84]
and pdb 3qwl) have a bound methionine in their PDB structure

(Fig. 2b, c), a superposition of the structures can even be used to
suggest a putative binding site (Fig. 2d). Evolutionary analysis,
using ConSurf [92, 93], shows that the binding cavity is highly
conserved, providing further support for the inferred function and
binding mode. In particular, the three binding sites feature a highly
conserved arginine residue (conservation grade of 9 on a 1–9 scale).
Furthermore, in all three proteins, the arginine populates the exact
same rotameric state, which allows it to form a hydrogen bond with
the methionine substrate (Fig. 2b–d). In addition, water molecules
that participate in the binding are also found in all the structures.
However, not all the interactions that are found in the two bound
states have equivalents in the query, and the structural superposi-
tion indicates that it is in an open conformation (Fig. 2a). It
suggests that binding may follow the population shift theory,
where methionine is initially recognized by the conserved arginine
residue in the open conformation. This interaction may induce a
shift of the protein to its closed conformation, where additional
residues interact with methionine. Further investigation is needed
to examine this suggestion.
3 Conclusions and Outlook
How did proteins emerge in evolution, and how do they evolve?

Theoretically, a protein could emerge and evolve by linking one
amino acid after another. Scholars believe that this approach is
doomed, because the vast majority of polypeptide chains would
not even fold. Thus, we presume that proteins emerged by mixing
and matching short amino acid fragments (peptides) from the
primordial soup, evolving by recombination, decoration, and muta-
tion. Lupas et al. wrote an insightful review of this [29]. While most
protein scientists would agree with this suggested scenario, the
mechanics and details of the process which gave rise to proteins,
and that govern their evolution, is still yet to be understood.
This leads to two observations: (1) We can look for clues to
address these fundamental questions in current proteins by study-
ing the reuse patterns in all proteins of known structure. (2) We can
mine the evolutionary signal to identify common ancestry and
improve methods of protein similarity search, function annotation,
and design. For both of these, navigating in protein space can be
very useful.
References
1. Kolodny R, Pereyaslavets L, Samson AO, 2. Ben-Tal N, Kolodny R (2014) Representation

Levitt M (2012) On the universe of protein of the protein universe using classifications,
folds. Annu Rev Biophys 42:559. https://doi. maps, and networks. Israel J Chem 54:1286
org/10.1146/annurev-biophys-083012- 3. Zeldovich KB, Shakhnovich EI (2008) Under-
130432 standing protein evolution: from protein
physics to Darwinian selection. Annu Rev Phys IN, Bourne PE (2000) The Protein Data Bank.
Chem 59:105–127 Nucleic Acids Res 28(1):235–242
4. Trifonov EN, Berezovsky IN (2003) Evolu- 16. Koehl P (2006) Protein structure classification.
tionary aspects of protein structure and fold- In: Reviews in Computational Chemistry. John
ing. Curr Opin Struct Biol 13(1):110–114 Wiley & Sons, Inc., New York, pp 1–55.
5. Choi IG, Kim SH (2006) Evolution of protein https://doi.org/10.1002/0471780367.ch1
structural classes and protein sequence families. 17. Ponting CP, Russell RR (2002) The natural
Proc Natl Acad Sci U S A 103 history of protein domains. Annu Rev Biophys
(38):14056–14061. https://doi.org/10. Biomol Struct 31(1):45–71. https://doi.org/
1073/pnas.0606239103 10.1146/annurev.biophys.31.082901.
6. Dokholyan NV, Shakhnovich B, Shakhnovich 134314
EI (2002) Expanding protein universe and its 18. Vogel C, Berzuini C, Bashton M, Gough J,
origin from the biological big bang. Proc Natl Teichmann SA (2004) Supra-domains: evolu-
Acad Sci 99(22):14132–14136. https://doi. tionary units larger than single protein
org/10.1073/pnas.202497999 domains. J Mol Biol 336(3):809–823.
7. Alva V, Remmert M, Biegert A, Lupas AN, https://doi.org/10.1016/j.jmb.2003.12.026
Söding J (2010) A galaxy of folds. Protein Sci 19. Kolodny R, Koehl P, Guibas L, Levitt M
19(1):124–130. https://doi.org/10.1002/ (2002) Small libraries of protein fragments
pro.297 model native protein structures accurately. J
8. Farı́as-Rico JA, Schmidt S, Höcker B (2014) Mol Biol 323(2):297–307
Evolutionary relationship of two ancient pro- 20. Vanhee P, Verschueren E, Baeten L, Stricher F,
tein superfolds. Nat Chem Biol 10 Serrano L, Rousseau F, Schymkowitz J (2011)
(9):710–715. https://doi.org/10.1038/ BriX: a database of protein building blocks for
nchembio.1579 http://www.nature.com/ structural analysis, modeling and design.
nchembio/journal/v10/n9/abs/nchembio. Nucleic Acids Res 39(Suppl 1):D435–D442
1579.html#supplementary-information 21. Davis FP, Sali A (2005) PIBASE: a comprehen-
9. Nepomnyachiy S, Ben-Tal N, Kolodny R sive database of structurally defined protein
(2017) Complex evolutionary footprints interfaces. Bioinformatics 21(9):1901–1907
revealed in an analysis of reused protein seg- 22. Vanhee P, Reumers J, Stricher F, Baeten L,
ments of diverse lengths. Proc Natl Acad Sci U Serrano L, Schymkowitz J, Rousseau F
S A 114:11703 (2009) PepX: a structural database of
10. Skolnick J, Arakaki AK, Lee SY, Brylinski M non-redundant protein–peptide complexes.
(2009) The continuity of protein structure Nucleic Acids Res 38(Suppl 1):D545–D551
space is an intrinsic property of proteins. Proc 23. Fernandez-Fuentes N, Dybas JM, Fiser A
Natl Acad Sci 106:15690. https://doi.org/10. (2010) Structural characteristics of novel pro-
1073/pnas.0907683106 tein folds. PLoS Comput Biol 6(4):e1000750
11. Nepomnyachiy S, Ben-Tal N, Kolodny R 24. Ovchinnikov S, Park H, Varghese N, Huang
(2014) Global view of the protein universe. P-S, Pavlopoulos GA, Kim DE, Kamisetty H,
Proc Natl Acad Sci 111:11691. https://doi. Kyrpides NC, Baker D (2017) Protein struc-
org/10.1073/pnas.1403395111 ture determination using metagenome
12. Mackenzie CO, Zhou J, Grigoryan G (2016) sequence data. Science 355(6322):294–298
Tertiary alphabet for the observable protein 25. Pieper U, Eswar N, Davis FP, Braberg H, Mad-
structural universe. Proc Natl Acad Sci U S A husudhan MS, Rossi A, Marti-Renom M,
113(47):E7438–E7447 Karchin R, Webb BM, Eramian D (2006)
13. Kolodny R, Petrey D, Honig B (2006) Protein MODBASE: a database of annotated compara-
structure comparison: implications for the tive protein structure models and associated
nature of ‘fold space’, and structure and func- resources. Nucleic Acids Res 34(Suppl 1):
tion prediction. Curr Opin Struct Biol 16 D291–D295
(3):393–398 26. Lo Conte L, Ailey B, Hubbard TJP, Brenner
14. Osadchy M, Kolodny R (2011) Maps of pro- SE, Murzin AG, Chothia C (2000) SCOP: a
tein structure space reveal a fundamental rela- structural classification of proteins database.
tionship between protein structure and Nucleic Acids Res 28(1):257–259
function. Proc Natl Acad Sci 108 27. Orengo C, Michie A, Jones S, Jones D,
(30):12301–12306. https://doi.org/10. Swindells M, Thornton J (1997) CATH-a hier-
1073/pnas.1102727108 archic classification of protein domain struc-
15. Berman HM, Westbrook J, Feng Z, tures. Structure 5(8):1093–1108
Gilliland G, Bhat TN, Weissig H, Shindyalov
28. Cheng H, Schaeffer RD, Liao Y, Kinch LN, 43. Wang G, Dunbrack RL (2003) PISCES: a pro-
Pei J, Shi S, Kim B-H, Grishin NV (2014) tein sequence culling server. Bioinformatics 19
ECOD: an evolutionary classification of pro- (12):1589–1591. https://doi.org/10.1093/
tein domains. PLoS Comput Biol 10(12): bioinformatics/btg224
e1003926. https://doi.org/10.1371/journal. 44. Choi I-G, Kim S-H (2007) Global extent of
pcbi.1003926 horizontal gene transfer. Proc Natl Acad Sci
29. Lupas AN, Ponting CP, Russell RB (2001) On 104(11):4489–4494. https://doi.org/10.
the evolution of protein folds: are similar 1073/pnas.0611557104
motifs in different protein folds the result of 45. Orengo CA, Flores TP, Taylor WR, Thornton
convergence, insertion, or relics of an ancient JM (1993) Identification and classification of
peptide world? J Struct Biol 134 protein fold families. Protein Eng 6
(2–3):191–203 (5):485–500. https://doi.org/10.1093/pro
30. Soding J (2005) Protein homology detection tein/6.5.485
by HMM-HMM comparison. Bioinformatics 46. Finn RD, Bateman A, Clements J, Coggill P,
21(7):951–960 Eberhardt RY, Eddy SR (2014) Pfam: the pro-
31. Eddy SR (2009) A new generation of homol- tein families database. Nucleic Acids Res 42:
ogy search tools based on probabilistic infer- D222. https://doi.org/10.1093/nar/
ence. Genome Inform 1:205–211 gkt1223
32. Alva V, Söding J, Lupas AN (2016) A vocabu- 47. Pearl FMG, Sillitoe I, Orengo CA (2015) Pro-
lary of ancient peptides at the origin of folded tein structure classification. In: eLS. John Wiley
proteins. elife 4:e09410 & Sons, Ltd., New York. https://doi.org/10.
33. Kosloff M, Kolodny R (2008) Sequence- 1002/9780470015902.a0003033.pub3
similar, structure-dissimilar protein pairs in 48. Levitt M, Chothia C (1976) Structural patterns
the PDB. Proteins 71(2):891–902 in globular proteins. Nature 261
34. Narunsky A, Nepomnyachiy S, Ashkenazy H, (5561):552–558
Kolodny R, Ben-Tal N (2015) ConTemplate 49. Holland TA, Veretnik S, Shindyalov IN,
suggests possible alternative conformations for Bourne PE (2006) Partitioning protein struc-
a query protein of known structure. Structure tures into domains: why is it so difficult? J Mol
23(11):2162–2170 Biol 361(3):562–590
35. Holm L, Sander C (1996) Mapping the protein 50. Hadley C, Jones DT (1999) A systematic com-
universe. Science 273(5275):595–603 parison of protein structure classifications:
36. Skolnick J, Gao M, Zhou H (2014) On the role SCOP, CATH and FSSP. Structure 7
of physics and evolution in dictating protein (9):1099–1112
structure and function. Israel J Chem 54 51. Day R, Beck DAC, Armen RS, Daggett V
(8–9):1176–1188 (2003) A consensus view of fold space: com-
37. Hasegawa H, Holm L (2009) Advances and bining SCOP, CATH, and the Dali Domain
pitfalls of protein structural alignment. Curr Dictionary. Protein Sci 12(10):2150–2160.
Opin Struct Biol 19(3):341–348 https://doi.org/10.1110/ps.0306803
38. Kolodny R, Koehl P, Levitt M (2005) Compre- 52. Marchler-Bauer A, Lu S, Anderson JB,
hensive evaluation of protein structure align- Chitsaz F, Derbyshire MK, DeWeese-Scott C,
ment methods: scoring by geometric measures. Fong JH, Geer LY, Geer RC, Gonzales NR
J Mol Biol 346(4):1173–1188 (2010) CDD: a conserved domain database
39. Kolodny R, Linial N (2004) Approximate pro- for the functional annotation of proteins.
tein structural alignment in polynomial time. Nucleic Acids Res 39(Suppl 1):D225–D229
Proc Natl Acad Sci U S A 101 53. Kelley LA, Sternberg MJ (2015) Partial protein
(33):12201–12206 domains: evolutionary insights and bioinfor-
40. Carugo O (2007) Recent progress in measur- matics challenges. Genome Biol 16(1):1–3.
ing structural similarity between proteins. Curr https://doi.org/10.1186/s13059-015-0663-
Protein Pept Sci 8(3):241 8
41. Yanover C, Vanetik N, Levitt M, Kolodny R, 54. Veretnik S, Gu J, Wodak S (2009) Identifying
Keasar C (2014) Redundancy-weighting for structural domains in proteins. In: Gu G,
better inference of protein structural features. Bourne P (eds) Structural bioinformatics, 2nd
Bioinformatics 30(16):2295–2301 edn. Wiley-Blackwell, Hoboken, NJ, pp
485–513
42. Li W, Godzik A (2006) Cd-hit: a fast program
for clustering and comparing large sets of pro- 55. Schaeffer RD, Jonsson AL, Simms AM, Dag-
tein or nucleotide sequences. Bioinformatics gett V (2011) Generation of a consensus pro-
22(13):1658–1659 tein domain dictionary. Bioinformatics 27
(1):46–54. https://doi.org/10.1093/bioinfor 67. Krissinel E, Henrick K (2003) Protein struc-

matics/btq625 ture comparison in 3D based on secondary
56. Csaba G, Birzele F, Zimmer R (2009) System- structure matching (SSM) followed by
atic comparison of SCOP and CATH: a new C-alpha alignment, scored by a new structural
gold standard for protein structure analysis. similarity function. Proceedings of the 5th
BMC Struct Biol 9(1):23 International Conference on Molecular Struc-
57. Redfern OC, Harrison A, Dallman T, Pearl tural Biology, Vienna, vol. 88
FM, Orengo CA (2007) CATHEDRAL: a 68. Krissinel E, Henrick K (2004) Secondary-
fast and effective algorithm to predict folds structure matching (SSM), a new tool for fast
and domain boundaries from multidomain protein structure alignment in three dimen-
protein structures. PLoS Comput Biol 3(11): sions. Acta Crystallogr D 60(Pt 12 Pt
e232. https://doi.org/10.1371/journal.pcbi. 1):2256–2268
0030232 69. Madej T, Lanczycki CJ, Zhang D, Thiessen PA,
58. Zhou H, Xue B, Zhou Y (2007) DDOMAIN: Geer RC, Marchler-Bauer A (2014) MMDB
dividing structures into domains using a nor- and VAST+: tracking structural similarities
malized domain–domain interaction profile. between macromolecular complexes. Nucleic
Protein Sci 16(5):947–955. https://doi.org/ Acids Res D42:D297. https://doi.org/10.
10.1110/ps.062597307 1093/nar/gkt1208
59. Alexandrov N, Shindyalov I (2003) PDP: pro- 70. Mezulis S, Sternberg MJE, Kelley LA (2016)
tein domain parser. Bioinformatics 19 PhyreStorm: a web server for fast structural
(3):429–430. https://doi.org/10.1093/bioin searches against the PDB. J Mol Biol 428
formatics/btg006 (4):702–708. https://doi.org/10.1016/j.
60. Krishna SS, Grishin NV (2005) Structural drift: jmb.2015.10.017
a possible path to protein fold change. Bioin- 71. Zhang Y, Skolnick J (2005) TM-align: a pro-
formatics 21(8):1308–1310 tein structure alignment algorithm based on
61. Pascual-Garcı́a A, Abia D, Ortiz ÁR, Bastolla U the TM-score. Nucleic Acids Res 33
(2009) Cross-over between discrete and con- (7):2302–2309. https://doi.org/10.1093/
tinuous protein structure space: insights into nar/gki524
automatic classification and networks of pro- 72. Wiederstein M, Gruber M, Frank K, Melo F,
tein structures. PLoS Comput Biol 5(3): Sippl Manfred J (2014) Structure-based char-
e1000331. https://doi.org/10.1371/journal. acterization of multiprotein complexes. Struc-
pcbi.1000331 ture 22(7):1063–1070. https://doi.org/10.
62. Edwards H, Deane CM (2015) Structural 1016/j.str.2014.05.005
bridges through fold space. PLoS Comput 73. Berezovsky IN, Guarnera E, Zheng Z (2017)
Biol 11(9):e1004466 Basic units of protein structure, folding, and
63. Fox NK, Brenner SE, Chandonia J-M (2014) function. Prog Biophys Mol Biol 128:85–99.
SCOPe: structural classification of proteins— https://doi.org/10.1016/j.pbiomolbio.
extended, integrating SCOP and ASTRAL data 2016.09.009
and classification of new structures. Nucleic 74. Menke M, Berger B, Cowen L (2008) Matt:
Acids Res 42(D1):D304–D309. https://doi. local flexibility aids protein multiple structure
org/10.1093/nar/gkt1240 alignment. PLoS Comput Biol 4(1):e10
64. Andreeva A, Howorth D, Chothia C, 75. Shindyalov I, Bourne P (1998) Protein struc-
Kulesha E, Murzin AG (2013) SCOP2 proto- ture alignment by incremental combinatorial
type: a new approach to protein structure extension (CE) of the optimal path. Protein
mining. Nucleic Acids Res 42:D310. https:// Eng 11(9):739–747
doi.org/10.1093/nar/gkt1242 76. Ortiz A, Strauss C, Olmea O (2002) MAM-
65. Ellson J, Gansner E, Koutsofios L, North SC, MOTH (matching molecular models obtained
Woodhull G (2001) Graphviz—open source from theory): an automated method for model
graph drawing tools. In: International sympo- comparison. Protein Sci 11(11):2606–2621
sium on graph drawing. Springer, Heidelberg, 77. Tung CH, Huang JW, Yang JM (2007) Kappa-
pp 483–484 alpha plot derived structural alphabet and
66. Prlić A, Bliven S, Rose PW, Bluhm WF, BLOSUM-like substitution matrix for rapid
Bizon C, Godzik A, Bourne PE (2010) search of protein structure database. Genome
Pre-calculated protein structure alignments at Biol 8(3):R31
the RCSB PDB website. Bioinformatics 26 78. Budowski-Tal I, Nov Y, Kolodny R (2010)
(23):2983–2985. https://doi.org/10.1093/ FragBag, an accurate representation of protein
bioinformatics/btq572 structure, retrieves structural neighbors from
the entire PDB quickly and accurately. Proc 87. Humphrey W, Dalke A, Schulten K (1996)
Natl Acad Sci U S A 107(8):3481–3486. VMD: visual molecular dynamics. J Mol
https://doi.org/10.1073/pnas.0914097107 Graph 14(1):33–38
79. Petrey D, Xiang Z, Tang CL, Xie L, 88. Rose AS, Hildebrand PW (2015) NGL viewer:
Gimpelev M, Mitros T, Soto CS, Goldsmith- a web application for molecular visualization.
Fischman S, Kernytsky A, Schlessinger A, Koh Nucleic Acids Res 43(Web Server issue):
IY, Alexov E, Honig B (2003) Using multiple W576–W579. https://doi.org/10.1093/
structure alignments, fast model building, and nar/gkv402
energetic analysis in fold recognition and 89. O’Donoghue SI, Goodsell DS, Frangakis AS,
homology modeling. Proteins 53(Suppl Jossinet F, Laskowski RA, Nilges M, Saibil HR,
6):430–435. https://doi.org/10.1002/prot. Schafferhans A, Wade RC, Westhof E (2010)
10550 Visualization of macromolecular structures.
80. Subbiah S, Laurents DV, Levitt M (1993) Nat Methods 7:S42–S55
Structural similarity of DNA-binding domains 90. Berntsson RP-A, Smits SH, Schmitt L, Slot-
of bacteriophage repressors and the globin boom D-J, Poolman B (2010) A structural
core. Curr Biol 3(3):141–148 classification of substrate-binding proteins.
81. Saito R, Smoot ME, Ono K, Ruscheinski J, FEBS Lett 584(12):2606–2617
Wang P-L, Lotia S, Pico AR, Bader GD, Ideker 91. Radivojac P, Clark WT, Oron TR, Schnoes
T (2012) A travel guide to Cytoscape plugins. AM, Wittkop T, Sokolov A, Graim K,
Nat Methods 9(11):1069–1076 Funk C, Verspoor K, Ben-Hur A (2013) A
82. Nepomnyachiy S, Ben-Tal N, Kolodny R large-scale evaluation of computational protein
(2015) CyToStruct: augmenting the network function prediction. Nat Methods 10
visualization of cytoscape with the power of (3):221–227
molecular viewers. Structure 23(5):941–948 92. Glaser F, Pupko T, Paz I, Bell RE, Bechor-
83. Morris JH, Huang CC, Babbitt PC, Ferrin TE Shental D, Martz E, Ben-Tal N (2003) Con-
(2007) structureViz: linking Cytoscape and Surf: identification of functional regions in pro-
UCSF chimera. Bioinformatics 23 teins by surface-mapping of phylogenetic
(17):2345–2347. https://doi.org/10.1093/ information. Bioinformatics 19(1):163–164
bioinformatics/btm329 93. Ashkenazy H, Abadi S, Martz E, Chay O, May-
84. Schrodinger, LLC (2010) The PyMOL molec- rose I, Pupko T, Ben-Tal N (2016) ConSurf
ular graphics system, Version 1.3r1. Schrodin- 2016: an improved methodology to estimate
ger, LLC, New York and visualize evolutionary conservation in
85. Pettersen EF, Goddard TD, Huang CC, macromolecules. Nucleic Acids Res 44(W1):
Couch GS, Greenblatt DM, Meng EC, Ferrin W344–W350
TE (2004) UCSF chimera—a visualization sys- 94. Shannon P, Markiel A, Ozier O, Baliga NS,
tem for exploratory research and analysis. J Wang JT, Ramage D, Amin N,
Comput Chem 25(13):1605–1612 Schwikowski B, Ideker T (2003) Cytoscape: a
86. Jmol: an open-source java viewer for chemical software environment for integrated models of
structure in 3D. http://www.jmol.org/ biomolecular interaction networks. Genome
Res 13(11):2498–2504. https://doi.org/10.
1101/gr.1239303
Chapter 13
A Graph-Based Approach for Detecting Sequence Homology

in Highly Diverged Repeat Protein Families
Jonathan N. Wells and Joseph A. Marsh
Abstract
Reconstructing evolutionary relationships in repeat proteins is notoriously difficult due to the high degree
of sequence divergence that typically occurs between duplicated repeats. This is complicated further by the
fact that proteins with a large number of similar repeats are more likely to produce significant local sequence
alignments than proteins with fewer copies of the repeat motif. Furthermore, biologically correct sequence
alignments are sometimes impossible to achieve in cases where insertion or translocation events disrupt the
order of repeats in one of the sequences being aligned. Combined, these attributes make traditional
phylogenetic methods for studying protein families unreliable for repeat proteins, due to the dependence
of such methods on accurate sequence alignment.
We present here a practical solution to this problem, making use of graph clustering combined with the
open-source software package HH-suite, which enables highly sensitive detection of sequence relationships.
Carrying out multiple rounds of homology searches via alignment of profile hidden Markov models, large
sets of related proteins are generated. By representing the relationships between proteins in these sets as
graphs, subsequent clustering with the Markov cluster algorithm enables robust detection of repeat protein
subfamilies.
Key words Repeat proteins, Sequence homology, Graph clustering, Profile-HMM alignment, Protein
families, Evolution
1 Introduction
Proteins comprising tandem structural motif repeats are ubiquitous

and can be separated into five broad classes [1, 2]; these range from
low-complexity domains with repeat units less than ten residues in
length to large “beads on a string” type proteins such as titin
(Fig. 1). Due to their sequence diversity and functional importance,
the two most important classes of repeat proteins are those that
form open, solenoid structures, such as leucine-rich repeat (LRR)
proteins [3], and those that form closed, toroidal structures, such as
the WD40 domain [4].
While common in all domains of life, repeat proteins are partic-
ularly prevalent in eukaryotes, and many families are ancient and
251
252 Jonathan N. Wells and Joseph A. Marsh
Titin (domains 9-11)

Pds5B Titin repeats
HEAT repeats 5JDD
5HDT
Poly-X Collagen
repeats G-X-Y repeats
1BKV
And-1 N-terminus
WD40 repeats
5GVA
Class I Class II Class III Class IV Class V
0 5 35 40 45 50 55 80 90
Repeat unit length (amino acids)
Estimated performance of method
Fig. 1 Repeat protein classes. Repeat proteins can be roughly categorized according to the length of the
repeated sequence motif. The very simplest repeats are simply long tracts of a single amino acid, most
commonly alanine or glutamine. These are often associated with disease, most famously the poly-Q tracts in
Huntington’s disease, but are nonetheless prevalent and have recently been shown to play an important role in
facilitating rapid protein divergence in eukaryotes [28]. The most diverse classes, both structurally and
functionally, are III and IV, which include families of solenoid proteins such as the HEAT repeat containing
Hawk family [22], of which Pds5B is a member, and ubiquitous domains such as the WD40 beta-propeller
[4]. At the other extreme are proteins such as titin, which comprises hundreds of repeated domains joined by
short linker regions. The method described here is likely to perform best on those proteins in classes III and IV
highly conserved [5, 6]. However, the precise sequence of the

repeating unit tends to be of secondary importance to its structure.
Indeed, the sequence similarity between many homologous repeat
proteins approaches that of random sequences, even while their
structures remain highly similar [7, 8]. Additional repeats tend to
be gained via internal duplications [9], and provided the overall
structure of the protein is not disrupted, the insertion/deletion
rate can be quite rapid, particularly in the case of proteins where
the repeats are structurally independent of each other [10].
Homology in Repeat-Protein Families 253
An observation with important implications for this method is that,

despite considerable evidence for the high degree of conservation of
repeat proteins across eukaryotes, within individual proteins, the
repeats themselves are often highly divergent. Attractive work from
the Koonin lab has shown that this is because recent duplications are
generally subject to a period of relaxed purifying selection, leading to
rapid sequence divergence prior to fixation [11].
However, this presents serious difficulties when attempting to
reconstruct the evolutionary history of these proteins. Accurate
phylogenetic trees are critically dependent on good sequence align-
ments, and in the case of repeat proteins, these are often impossible
to produce, particularly when multiple internal insertion or dele-
tion events have occurred. Numerous attempts have been made to
develop methods that can be used to improve the quality of repeat
proteins’ multiple sequence alignments, and thus phylogenetic
trees [12–16]. However, an independent benchmark of several
such programs has shown them to be highly dependent on the
type of repeat unit under investigation and concluded that charac-
teristics of detected repeats (unit length, number of repeats, etc.)
varied more with the algorithm being used than with the underly-
ing data [17].
In this chapter, we describe a conceptually simple method that
can be used to reveal within-species evolutionary relationships
between highly divergent repeat proteins when phylogenetic analy-
sis is not possible. This method makes extensive use of the open-
source software HH-suite [18, 19], available from the Söding
research group or GitHub (https://github.com/soedinglab/hh-
suite), and also the Markov cluster (MCL) algorithm [20], which
has been in widespread use in computational biology, including for
detection of protein families [21]. Though computationally inten-
sive, the method that follows can be applied straightforwardly by a
user familiar with UNIX-based command line applications and with
basic programming experience.
Briefly, multiple rounds of homology searches in HH-suite are
used to generate a large set of putatively related repeat proteins
from those of initial interest. This larger set of proteins is repre-
sented as a graph, in which edges between proteins are weighted
according to the rank of the best alignment between them, relative
to all other homologous proteins detected. By clustering this graph
using the MCL algorithm, different subfamilies of repeat proteins
are revealed. These clusters are robust to parameter changes, and
their statistical significance can be assessed with permutation tests
of the underlying graph.
The principal problem motivating the development of this
method lies in the fact that proteins sharing multiple copies of the
same repeat motif may produce alignments that erroneously sug-
gest a close evolutionary relationship where none exists. As an
example, one can imagine a scenario in which a highly diverged
repeat protein produces high-scoring alignments with a larger pro-

tein containing many well-conserved copies of the canonical motif,
simply because the larger protein provides many opportunities for
such alignments. We reasoned that the likelihood of obtaining false
positives in this manner could be reduced by performing reciprocal
searches, using both proteins in any given pair as queries. Unlike
the pair in the scenario just described, pairs directly related to each
other would be expected to produce mutually high-ranking align-
ments, regardless of which was used as the query.
By generating a network out of many such reciprocal searches,
families of related repeat proteins can then be obtained by cluster-
ing. The underlying assumption behind this idea being that, on
average, closely related proteins will produce more high-ranking
alignments within their immediate family than they will with pro-
teins outside the family, even if individual alignments are sometimes
misleading.
We have previously used this approach to demonstrate the
common ancestry of members of the Hawk family (HEAT repeat
proteins associated with kleisins)—a set of widely conserved pro-
teins that arose in the last common ancestor of eukaryotes [22] and
which play an essential role in regulating the activity of condensin
and cohesin. The protocol as presented here is the same as that used
to describe the Hawk family, but is directly applicable to other
families of repeat proteins with qualitatively similar structures, for
example, leucine-rich repeat proteins (Fig. 2). For proteins from
different repeat classes, such as beta-propellers, adjustments to
certain method parameters will likely be needed, as discussed in
the Notes section.
2 Methods
2.1 Preliminary HH-suite, specifically the command-line programs hhblits and

Setup and Database hhsearch, works by aligning profile hidden Markov models (profile
Building HMMs) using the Viterbi algorithm [23, 24]. Each profile HMM is
essentially a condensed version of a multiple sequence alignment,
unique to the protein of interest. Since the quantity of information
in profile HMMs is much greater than the corresponding raw
amino acid sequences, homologous relationships between proteins
can be detected with much higher sensitivity than methods such as
BLAST or PSI-BLAST [25, 26].
Before building the network, a search database must first be
built for the species of interest. This is by some margin the most
time-consuming step in this method, but fortunately only needs
doing once, after which it can be saved for reuse. To begin with, the
UniProt database itself must be clustered and converted into a set
of profile HMMs; pre-compiled versions can be downloaded
directly from the Söding lab (http://wwwuser.gwdg.de/
Fig. 2 Example network showing leucine-rich repeat subfamilies. Starting with leucine-rich repeat protein
1 (LRR1, not shown) to initiate searches, a network showing different subgroups within the large and diverse
LRR family was generated, using the protocol that follows, with an MCL inflation parameter of 3.0 for the final
clustering. Several known families are recapitulated here, most notably the Toll-like receptors, highlighted
top-left. The networks generated are very dense, with the majority of proteins being connected to tens to
hundreds of others. To aid visualization, only those with a reported true positive probability of 1.0 are shown.
Darker edges represent higher mutual ranks
~compbiol/data/hhsuite/), and using these is strongly

recommended.
The species-specific database is then built using the clustered
UniProt database. As the practical details of this step are beyond the
scope of this method description, readers are instead referred to the
excellent documentation on the matter provided with the
HH-suite package. For smaller proteomes consisting of fewer
than 10,000 sequences, this takes about a day on a powerful desk-
top or laptop, but for larger proteomes—such as those of mam-
mals—it is simpler to build the database in batches on a computer
cluster.
2.2 Generating Once the required species databases have been built, we are ready to
Homology Networks begin carrying out the sequence searches needed to generate our
network. This can be initiated using either multiple candidate
sequences or just a single representative of the family of interest
(see Note 1 for additional information on selecting seed sequences).

For each sequence, generate a profile HMM using hhblits searched
against the clustered UniProt database, with the output format set
to either hhm or a3m if the multiple sequence alignment is
required. Multiple iterations can be carried out for increased sensi-
tivity, but the default value of two iterations is almost certainly
sufficient at this stage.
Next, using the multi-species profile HMM that this produces,
search the species-specific database that you generated previously
using hhsearch—this is equivalent to using hhblits with a single
iteration, with the exception of some additional steps in the latter.
The species-specific results from this step should be saved in the
default hhresults format, but the profile HMMs or sequence align-
ments are not required. Within this file, there will be up to 500 pro-
teins that have been identified as sharing significant sequence
similarity (assuming default hhsearch settings); the amino acid
sequence of each of these proteins must then be downloaded, a
step most easily achieved programmatically, for obvious reasons.
Having downloaded these sequences, the first round of homologue
searches is complete.
Each of these new sequences is then subjected to the same
procedure as before, i.e., searched against the clustered UniProt
database to generate a profile HMM and then searched against the
species-specific database. However, once the results files from this
second round of searches have been acquired, it is no longer neces-
sary to download the new protein sequences that have appeared. At
this point, all the proteins that will be represented in the final
network are present, along with many more which will be
discarded.
The set of results files must now be processed and converted
into graph format. Each protein, whether used as a query sequence
or returned as a hit, can be represented as a node in the graph.
Similarly, the significant alignments between each query protein
and the hits can be represented as edges from the former to the
latter, weighted by the rank of the alignment. Importantly, many
edges will be redundant, corresponding to alignments between
different sections of the repetitive sequence; in these cases, simply
use the highest-ranking alignment between the protein pair and
discard the rest.
The resulting graph will be directed, and likely quite large. In
order to reduce the size, we can filter out edges and nodes that do
not meet certain significance criteria. The choice of these thresh-
olds is partly case-dependent (see Note 2), but for demonstration
purposes, discard all edges with an expect value of greater than 0.01
(thus controlling the false-positive rate) and a true positive proba-
bility less than 0.15. These values are given with each alignment in
the results files for query proteins.
Since some proteins will have many significant homologues and

others very few, it is necessary to normalize the ranks (and therefore
the edge weights) of each alignment. Do so according to the
following formula:
r max r min
f ðr Þ ¼ , 1 r 500
99r þ r max 100r min
where r is the rank of the alignment/edge in question and rmax and
rmin are the maximum and minimum ranks in the result file describ-
ing the alignment. This ensures that the normalized edge weights
lie between 1.0 and 0.01, with the former being the best possible
rank and the latter the worst (see Note 3 for alternative edge
weighting possibilities).
To simplify the subsequent clustering step, we now need to
make the network undirected. At this point, each pair of proteins
will have two edges between them or a single edge; the latter can
occur either because an alignment did not meet the significance
thresholds imposed on it, or much more commonly because one of
the proteins involved only appeared in the second round of searches
and was thus never queried. Since these proteins are only of sec-
ondary importance to the family of interest, if a node in the net-
work has a degree of less than two, then discard it. For each pair of
remaining nodes, collapse the edges between them into a single,
undirected edge using the geometric mean. Since the geometric
mean is always lower than the arithmetic mean, this avoids giving
too much weight to high-ranking, but low significance alignments
from proteins with few detectable homologues.
2.3 Clustering At this stage, the network is undirected and has been trimmed
Networks down considerably from the size that would otherwise be pro-
and Assessing duced. Clustering is carried out with the MCL algorithm, which
Significance can be downloaded as a standalone program from Stijn van Don-
gen’s personal website https://micans.org/mcl/index.html, or as
implemented in other programs such as Cytoscape [27]. A sensible
inflation parameter should be used (roughly speaking, this is a
measure of the granularity of the resulting clusters): I ¼ 2.5 is a
good starting point but may need to be changed depending on the
properties of your network (see Note 4). The workflow leading to
this point is summarized graphically in Fig. 3.
Once you have obtained clusters containing your proteins of
interest, their statistical significance can be assessed with permuta-
tion tests of the underlying network. Specifically, the probability
of obtaining a specific cluster by chance can be calculated by
randomizing the ranks of the underlying alignments in each result
file, regenerating the network and clustering it. This is then
repeated as many times as is computationally practical—on a
powerful desktop computer (tested on 8 Intel® Core™
i7-4790 K CPUs @ 4.00 GHz), a network containing
... Proteins of
Scc3 Ycs4
interest
Search for paralogues MCL clustering
of starting proteins to identify families
Scc3 Query
Ycg1 Rank 1
Scc2 Rank 2
Edges weighted
...
by ranks
Xyz1 Rank n
Clusters of closely
related proteins
Search for paralogues
of all hits returned
Ycg1 Scc2 Xyz1

...
...
...
Quality control and

Discard simplify network
1-way edges
Generate
network from results
Fig. 3 Summary of network construction. First choose a small number of proteins to initiate the searches. Use
hhblits and the clustered UniProt database to generate profile HMMs for each sequence, and then use this with
hhsearch to search the species database. Carry out a second round of searches using the results from the first
round as queries in the second. After choosing appropriate thresholds for alignment significance, build a
network using the alignment ranks as edge weights. Simplify this network by removing nodes with degree ¼ 1
and collapse edge pairs using the geometric mean. Cluster this network using the MCL algorithm to reveal
subfamilies within the larger repeat motif family. Figure adapted from Wells et al. [22]
~200 nodes can be regenerated and clustered on the order of 105

times overnight, but for larger networks, access to a computer
cluster may be necessary.
3 Discussion
Although qualitatively different in terms of their output, there are

some interesting parallels between phylogenetic trees and the clus-
tered graphs produced here. In an ideal graph, different clusters are
equivalent to monophyletic clades within a tree-based representa-
tion of the true evolutionary history. In the same fashion, varying
the inflation parameter is analogous to producing sub-trees by
cutting internal branches at specific time points.
However, less welcome similarities are also likely to exist, and
though we have not investigated the process, it seems probable that
the formation of some clusters in the graph may involve a process

akin to long-branch attraction. Specifically, if large groups of highly
divergent proteins are present, none of which exhibit distinct simi-
larities to each other, then these may cluster together as a result of
being densely connected, yet evenly weighted across the group.
Concerted evolution leading to homogenization of repeats has
also been observed in repeat proteins [11], and this too is likely
to confound analyses in cases where it occurs. Given this, indepen-
dent validation of clusters should be sought wherever possible, for
example, by looking for enrichment of GO terms or shared struc-
tural similarities.
In summary, the method described here is a robust, semiquan-
titative alternative to traditional tree-based descriptions of phylog-
eny and is particularly powerful for repeat protein families for which
it is not possible to generate accurate multiple sequence alignments.
Although originally designed to demonstrate the shared ancestry of
a single group of proteins, due to fact that many distantly related
proteins are acquired during construction of the network, it can
also be used more broadly as a tool for generating new hypotheses.
4 Notes
There are several points to make about the method described

above. As mentioned in the introduction, different classes of repeat
proteins will likely require different parameters and should be
approached on a case-by-case basis. The following notes offer
some additional guidance on how different parameter choices affect
the resulting networks:
1. Beginning with selection of seed proteins for the first round of
sequence searches, it may be necessary to remove extraneous
domains if present. While the method will work for both repeti-
tive and non-repetitive proteins, if the core domains of interest
are non-repetitive, then traditional phylogenetic methods are
likely to be more appropriate. Otherwise, manually removing
such domains will avoid the possibility of subsequent searches
being led off track by high-scoring alignments in non-repetitive
regions.
2. When choosing alignment significance thresholds for inclusion
in the network (e.g., expect value, true positive probability,
alignment length), as long as the values are reasonable, then
their main effect will be on the size of the network. More
stringent thresholds will decrease the overall size of the network,
at the expense of distant family members in each cluster, whereas
relaxed thresholds will increase the computational requirements
in carrying out permutation tests for each cluster.
3. Attributes other than the relative rank alignments can also be

used to weight edges for clustering, for example, true positive
probability. However, since we are more interested in relative
relationships, using the rank is the most robust metric. To
understand this, it helps to imagine two hypothetical protein
families whose average rates of divergence differ and consider
how a relative metric such as rank would behave versus absolute
metrics.
4. An important strength of MCL is the fact that in most cases the
only parameter that needs to be explicitly set by the user is the
inflation parameter, which affects the granularity of the resulting
clustered graph. As a rule of thumb, values in the region of
1.2–6.0 are reasonable, with 2.0 being the default [20]; larger
values will tend to produce smaller clusters and vice versa. To
adhere to best scientific practice, once this parameter has been
set, it should not be changed, and to avoid researcher bias, the
value should ideally be settled on before revealing node labels.
This is not unrealistic given prior knowledge about the average
size of species gene families. For example, an inflation parameter
that gave clusters with a median size of two in a human network
would almost certainly be inappropriate, since most repeat pro-
tein families are considerably larger.
References
1. Kajava AV (2001) Review: proteins with 7. Andrade MA, Petosa C, O’Donoghue SI et al

repeated sequence—structural prediction and (2001) Comparison of ARM and HEAT pro-
modeling. J Struct Biol 134:132–144. tein repeats. J Mol Biol 309:1–18. https://doi.
https://doi.org/10.1006/jsbi.2000.4328 org/10.1006/jmbi.2001.4624
2. Kajava AV (2012) Tandem repeats in proteins: 8. Sutherland TD, Campbell PM, Weisman S et al
from sequence to structure. J Struct Biol (2006) A highly divergent gene cluster in
179:279–288. https://doi.org/10.1016/j. honey bees encodes a novel silk family. Genome
jsb.2011.08.009 Res 16:1414–1421. https://doi.org/10.
3. Kobe B, Deisenhofer J (1994) The leucine-rich 1101/gr.5052606
repeat: a versatile binding motif. Trends Bio- 9. Björklund ÅK, Ekman D, Elofsson A (2006)
chem Sci 19:415–421 Expansion of protein domain repeats. PLoS
4. Neer EJ, Schmidt CJ, Nambudripad R, Smith Comput Biol 2:0959–0970. https://doi.org/
TF (1994) The ancient regulatory-protein 10.1371/journal.pcbi.0020114
family of WD-repeat proteins. Nature 10. Schüler A, Bornberg-Bauer E (2016) Evolu-
371:297–300. https://doi.org/10.1038/ tion of protein domain repeats in Metazoa.
371297a0 Mol Biol Evol 33:3170
5. Marcotte EM, Pellegrini M, Yeates TO, Eisen- 11. Persi E, Wolf YI, Koonin EV (2016) Positive
berg D (1999) A census of protein repeats. J and strongly relaxed purifying selection drive
Mol Biol 293:151–160. https://doi.org/10. the evolution of repeats in proteins. Nat Com-
1006/jmbi.1999.3136 mun 7:13570. https://doi.org/10.1038/
6. Schaper E, Gascuel O, Anisimova M (2014) ncomms13570
Deep conservation of human protein tandem 12. Szklarczyk R, Heringa J (2004) Tracking
repeats within the eukaryotes. Mol Biol Evol repeats using significance and transitivity. Bio-
31:1132–1148. https://doi.org/10.1093/ informatics 20(Suppl 1):i311–i317. https://
molbev/msu062 doi.org/10.1093/bioinformatics/bth911
13. Söding J, Remmert M, Biegert A, Lupas AN 21. Enright AJ, Van Dongen S, Ouzounis CA
(2006) HHsenser: exhaustive transitive profile (2002) An efficient algorithm for large-scale
search using HMM-HMM comparison. detection of protein families. Nucleic Acids
Nucleic Acids Res 34:374–378. https://doi. Res 30:1575–1584
org/10.1093/nar/gkl195 22. Wells JN, Gligoris TG, Nasmyth KA, Marsh JA
14. Newman AM, Cooper JB (2007) XSTREAM: a (2017) Evolution of condensin and cohesin
practical algorithm for identification and archi- complexes driven by replacement of kite by
tecture modeling of tandem repeats in protein hawk proteins. Curr Biol 27:R17–R18.
sequences. BMC Bioinformatics 8:382. https://doi.org/10.1016/j.cub.2016.11.050
https://doi.org/10.1186/1471-2105-8-382 23. Eddy SR (1998) Profile hidden Markov mod-
15. Vo A, Nguyen N, Huang H (2010) Solenoid els. Bioinformatics 14:755–763
and non-solenoid protein recognition using 24. Viterbi A (1967) Error bounds for convolu-
stationary wavelet packet transform. Bioinfor- tional codes and an asymptotically optimum
matics 26:i467–i473. https://doi.org/10. decoding algorithm. IEEE Trans Inf Theory
1093/bioinformatics/btq371 13:260–269. https://doi.org/10.1109/TIT.
16. Szalkowski AM, Anisimova M (2013) Graph- 1967.1054010
based modeling of tandem repeats improves 25. Altschul SF, Gish W, Miller W et al (1990)
global multiple sequence alignment. Nucleic Basic local alignment search tool. J Mol Biol
Acids Res 41:e162–e162. https://doi.org/10. 215:403–410. https://doi.org/10.1016/
1093/nar/gkt628 S0022-2836(05)80360-2
17. Schaper E, Kajava AV, Hauser A, Anisimova M 26. Altschul SF, Madden TL, Sch€affer AA et al
(2012) Repeat or not repeat?--Statistical vali- (1997) Gapped BLAST and PSI-BLAST: a
dation of tandem repeat prediction in genomic new generation of protein database search pro-
sequences. Nucleic Acids Res grams. Nucleic Acids Res 25:3389–3402
40:10005–10017. https://doi.org/10.1093/ 27. Cline MS, Smoot M, Cerami E et al (2007)
nar/gks726 Integration of biological networks and gene
18. Soding J, Söding J (2005) Protein homology expression data using Cytoscape. Nat Protoc
detection by HMM-HMM comparison. Bioin- 2:2366–2382. https://doi.org/10.1038/
formatics 21:951–960. https://doi.org/10. nprot.2007.324
1093/bioinformatics/bti125 28. Chavali S, Chavali PL, Chalancon G et al
19. Remmert M, Biegert A, Hauser A, Söding J (2017) Constraints and consequences of the
(2011) HHblits: lightning-fast iterative protein emergence of amino acid repeats in eukaryotic
sequence searching by HMM-HMM align- proteins. Nat Struct Mol Biol 24:765–777.
ment. Nat Methods 9:173–175. https://doi. https://doi.org/10.1038/nsmb.3441
org/10.1038/nmeth.1818
20. Van Dongen S (2000) A cluster algorithm for
graphs. Rep Inf Syst 10:1–40
Chapter 14
Exploring Enzyme Evolution from Changes in Sequence,

Structure, and Function
Jonathan D. Tyzack, Nicholas Furnham, Ian Sillitoe, Christine M. Orengo,
and Janet M. Thornton
Abstract
The goal of our research is to increase our understanding of how biology works at the molecular level, with
a particular focus on how enzymes evolve their functions through adaptations to generate new specificities
and mechanisms. FunTree (Sillitoe and Furnham, Nucleic Acids Res 44:D317–D323, 2016) is a resource
that brings together sequence, structure, phylogenetic, and chemical and mechanistic information for 2340
CATH superfamilies (Sillitoe et al., Nucleic Acids Res 43:D376–D381, 2015) (which all contain at least
one enzyme) to allow evolution to be investigated within a structurally defined superfamily.
We will give an overview of FunTree’s use of sequence and structural alignments to cluster proteins
within a superfamily into structurally similar groups (SSGs) and generate phylogenetic trees augmented by
ancestral character estimations (ACE). This core information is supplemented with new measures of
functional similarity (Rahman et al., Nat Methods 11:171–174, 2014) to compare enzyme reactions
based on overall bond changes, reaction centers (the local environment atoms involved in the reaction),
and the structural similarities of the metabolites involved in the reaction. These trees are also decorated with
taxonomic and Enzyme Commission (EC) code and GO annotations, forming the basis of a comprehensive
web interface that can be found at http://www.funtree.info. In this chapter, we will discuss the various
analyses and supporting computational tools in more detail, describing the steps required to extract
information.
Key words FunTree, Enzyme evolution, CATH, EC-Blast, Phylogenetic tree
1 Introduction
FunTree [1] is a resource for exploring the evolution of protein

function through relationships in sequence, structure, phylogeny,
and function. It catalogues 2340 CATH superfamilies with over
400,000 representative sequences (selected to cover taxonomic
lineage and function), over 70,000 structural domains, and 2358
EC (Enzyme Commission) codes.
FunTree can be used to place structures and sequences in the
context of their structural and functional evolution, allowing the
investigation of how novel enzyme functions have evolved within a
263
264 Jonathan D. Tyzack et al.
structurally similar group (SSG). This can also be helpful to identify

new but currently unobserved reactions and substrates for known
enzymes, as well as possible reactions for enzyme sequences/struc-
tures of unknown function.
Often CATH [2] superfamilies can be structurally highly
diverse, hindering the confident atomic superimposition of all
structures in the superfamily. A key step in the generation of Fun-
Tree is the sub-clustering of each superfamily into distinct SSGs,
where all interstructure SIMAX scores are less than 9 angstroms
(where SIMAX is the RMSD between two domains multiplied by
the number of residues in the larger domain divided by the number
of aligned residues).
The core output of FunTree is a phylogenetic tree for each SSG
(discussed in Subheading 2.2.1), calculated from structure-guided
sequence alignments using a novel agglomerative clustering tech-
nique. The resulting alignments are provided to TreeBest [4],
along with a species tree derived from species relationships in the
NCBI Taxonomic definitions, to generate a maximum likelihood-
based phylogenetic tree.
The phylogenetic trees are decorated with information such as
EC code, GO annotations, and multidomain architecture (MDA)
and augmented with various ancillary analyses describing the diver-
sity in areas such as enzyme chemistry and taxa distribution. These
will all be described in more detail in the Methods section; however
it is important to note that some annotations such as GO and EC
are assigned to entire gene products rather than the individual
structural domains included in the SSGs. Most functions can be
ascribed to a single domain, but many are a product of domain
combinations or multiple gene products. Thus, as FunTree is a
domain-centric resource, some annotations might be relevant at
the protein rather than domain level. The FunTree pipeline describ-
ing the various steps in collecting, processing, and presenting the
data is shown in Fig. 1.
The trees generated in FunTree can become difficult to navi-
gate due to their size and mixed media content. To facilitate easy
navigation, a web interface has been constructed using the Java-
Script D3 libraries [5] to provide intuitive and user-friendly func-
tionality (e.g., using the mouse wheel to zoom and dragging
images to pan) and interaction with the trees (e.g., collapsing and
expanding nodes by clicking).
2 Methods
FunTree [1] can be browsed by CATH [2] superfamily or searched

by CATH superfamily, UniProtKB accession, and EC code or by
entering a text string for a fuzzy search. Overview statistics and
FunTree: Exploring Enzyme Evolution 265
A B CATH
Data Collection
M-CSA CATH-Gene3D
Cluster Domains - Catalytic and mechanistic data
PDBSum ArchSchema
- Structural annotations - Multi-Domain Architectures
- Cross reference E.C. assignments for
Align Sequences PDB and UniProtKB UniProtKB
- Sequence and taxonomic data
- Use agglomerative clustering to generate structurally informed multiple

sequence alignments
Gather Functional Data - Calculate phylogenetic tree using TreeBeST guided by a taxonomic tree
derived from taxonomic data
Data Processing
- collate all structure / sequence / functional annotations resolving
ambiguities
Calculate Phylogeny KEGG

- Small molecule and reaction data
- Cluster all small molecule data using SMSD

- Cluster reactions based on overall bond changes, reaction centres &
Annotate Functional small molecule sub-graph similarity
- Calculate ancestral function based on maximum likelihood ancestral
Information character estimation
Visualisation
Annotated Phylogenetic Tree
ArchSchema Graph
Display & Visualise Ancestral Character Tree Ligand Similarity Tree
Summary Statistics Reaction Similarity Tree
Annotated Alignment
Fig. 1 The FunTree pipeline. (a) An overview of the workflow for collecting and processing sequence,
structure, and functional information in FunTree. (b) A detailed schematic representation of the various
steps in data collection, processing, and visualization in FunTree
high-level results are produced at the CATH superfamily level, with

the phylogenetic trees and lower level results produced at the
structurally similar group (SSG) level. These are discussed in more
detail in the remainder of this section.
2.1 CATH This page is the gateway for results at the CATH superfamily level
Superfamily Results for the selected domain (Fig. 2), where each thumbnail provides a
Gateway link to a detailed analysis of the selected results. The SSGs within
the superfamily are shown in Clusters with a link to lower level
results for that SSG.
2.1.1 Domain This page shows an interactive force directed graph generated by
Architectures ArchSchema [6] of the multidomain architectures (MDAs) asso-
ciated with the current search, with the current domain shown at
the center connected to increasingly more complicated
architectures.
Fig. 2 Superfamily gateway. CATH superfamily results for CATH 3.20.20.120 Enolase. Each thumbnail
provides a link to a detailed analysis of the selected results. The SSGs within the superfamily are shown in
Clusters with a link to lower level results for that SSG
1. The colored graph nodes represent the different MDAs and can
be dragged to reorganize the graph. Hovering over the graph
nodes shows the following information for that MDA:
(a) Number of sequences
(b) Number of structures
(c) List of EC codes (annotated by UniProtKB [7])
(d) List of structures
2. The colored domain bars show the domain composition, where
hovering over the bar reveals the domain name and clicking
opens the webpage for that CATH superfamily.
2.1.2 Overview Stats This page contains a dynamic, interactive plot allowing various
properties of CATH superfamilies to be plotted on two axes
(Fig. 3). The different properties that can be plotted on either a
linear or log scale and also used to scale and color the data points
are as follows:
1. Alphabetical order (x-axis only)
2. Average conservation score for each position in the alignment
(ScoreCons) [8] for SSGs
Fig. 3 CATH superfamily Overview Stats. The plot shows the number of sequences on the y-axis against the
number of structures on the x-axis with color representing the number of EC codes and size representing the
number of partial EC codes
3. Number of multidomain architectures (MDAs)

4. Number of full Enzyme Commission (EC) codes
5. Number of partial EC codes
6. Number of sequences
7. Number of structures
8. Number of structurally similar groups (SSGs)
2.1.3 EC Wheel This page shows the EC hierarchy as an unrooted tree with EC
codes within the superfamily labelled outside the wheel.
Nodes/leaves for class, sub-class, sub-subclass, and numerical
identifier are highlighted for the enzymes found in the superfamily.
2.1.4 EC-Blast This page shows the EC classification rendered as a circular rooted
tree.
1. Leaves represent EC code and are colored by primary EC class.
EC codes that are found in the superfamily are pushed out of the
circle and are colored blue.
Fig. 4 SSG (structurally similar group) Gateway: SSG results for CATH 3.20.20.120 SSG1. Each thumbnail
provides a link to a detailed analysis of the selected results. The SSGs within the superfamily are shown in
Clusters with a link to lower level results for that SSG
2. Links between EC codes found in the superfamily and their

10 most similar functions as calculated by EC-Blast [3] are high-
lighted in blue, tracing the path through the tree between them.
3. Hovering over an EC code highlights in red connections to the
top 10 most similar reactions, which are also listed on the right
of the page.
2.1.5 CATH This is a link to the CATH page [2] for that superfamily containing
further information on structure and function.
2.2 Structurally This page is the gateway for results at the SSG level for the selected
Similar Group (SSG) domain (Fig. 4). Each thumbnail provides a link to a detailed
Results Gateway analysis of the selected results.
2.2.1 FunTree: Rooted This page contains a rooted phylogenetic tree for the SSG selected,
Phylogenetic Tree with annotations and links embedded in the nodes and leaves
(Fig. 5).
1. Navigation is implemented using the mouse wheel to zoom,
dragging the image to pan, clicking on a node to collapse/
expand that node, clicking on text for links to data sources,
and hovering over text/images for more information.
2. At each node to the tree, a confidence score can be found. This is
the confidence bootstrap score provided by TreeBest for bifur-
cation at the node. Please note that as these trees are automati-
cally generated, some of the bifurcations might have low
confidence scores and should be considered with caution.
3. The annotations at the end of each leaf are as follows:
(a) The first number/text section is the node name (internal to
FunTree) made up of a reference and the taxonomic code.
Fig. 5 Rooted phylogenetic tree for SSG1 in CATH 3.20.20.120 Enolase. Each node contains a score that
measures the confidence in the bifurcation. Each leaf contains labels for reaction similarity represented as
green circles, EC code/function (where available), UniProtKB sequence, representative PDB domain (where
available), and a domain bar representing the multidomain architecture (MDA). See Subheading 2.1 for further
details
(b) If the leaf represents an enzyme, the next three circles show
the similarity between reactions in the EC code on a bond
change, reaction center, and sub-structure basis, respec-
tively. Coloring is based on the degree of similarity as
calculated by EC-Blast [3].
(c) Primary EC code, containing a hyperlink to the IntEnz
database.
(d) UniProtKB identifier, containing a link to the UniProtKB
record.
(e) If the sequence represents a known structural superfamily,
then the PDB (linked to PDBe entry [9]) and CATH
domain (linked to the CATH superfamily page) are shown.
(f) The MDA of the protein at each leaf is depicted showing
the domains as uniquely colored bars along a line, the
position and length of which are proportional to the total
sequence. Hovering over each bar shows the CATH super-
family, and clicking navigates to the CATH
superfamily page.
2.2.2 Taxa Distribution The taxa distribution shows the distribution of taxonomic classes
within the SSG tree.
1. Hovering over the band reveals the taxonomic lineage (shown
top left) as well as the percentage of sequences in the tree that
belong to that group.
2.2.3 Ancestral This is a circular representation of the phylogenetic tree based on

Character Estimation SSG alignments showing likelihoods of functions at ancestral
(ACE) Tree nodes.
1. Hovering over a node shows the EC code/function with the
maximum likelihood for that node.
2. Hovering over leaves of the tree shows the contribution to
function annotation from each internal node in the lineage.
2.2.4 Reaction Clustering This page shows a tree representing the similarities between reac-
tions based on bond changes calculated by EC-Blast [3], where the
clustering is made using the PVClust [10] methods as implemented
in R (Fig. 6).
1. The tree can be zoomed using the mouse wheel or moved/
panned by dragging the image.
2. Each leaf shows a schematic of the reaction with color coding
highlighting the atoms that are involved in the reaction.
Fig. 6 Reaction Clustering for SSG1 in CATH 3.2.2.120 Enolase. A tree representing the similarities between
reactions based on bond changes calculated by EC-Blast, where the clustering is made using the PVClust
methods as implemented in R
2.2.5 GO Clustering This page shows a tree representing the similarities between GO
annotations using a semantic similarity score.
2.2.6 Ligand Clustering This page shows a similarity tree of all the small molecules found in
all the reactions in the SSG. The similarities are calculated using
SMSD [11], and the clustering is made using the PVClust methods
as implemented in R.
2. By hovering over the leaves of the tree, the reaction is displayed,
and the other ligands in the reaction are highlighted.
2.2.7 EC Wheel The functionality is as described in Subheading 2.1.3 but for data at
the SSG level.
2.2.8 Annotated This page shows the multiple sequence alignment generated with
Alignment the BioJS [12] module (Fig. 7) that was used to build the phyloge-
netic tree. The sequences in the alignment are annotated by sec-
ondary structure where available and catalytic residues as
catalogued in the M-CSA [13] (bright red if from the curated
M-CSA, light red if from the predicted M-CSA).
Fig. 7 Annotated alignment for phosphatidylinositol (PI) phosphodiesterase

1. The alignments can be scrolled vertically (to show more

sequences) and horizontally (to show different parts of the
sequence).
2. The sequences can be selected, ordered, and filtered by the
various data fields including sequence identity.
3. Other formatting options include editing the color scheme and
hiding/showing visual elements such as labels and headers.
4. There is also functionality to import and export data for external
analysis.
2.2.9 Overview Stats The functionality is as described in Subheading 2.1.2 but for data at
the SSG level.
2.3 Examples As FunTree [1] holds data across many domain superfamilies, it is
of the Application possible to use FunTree to make large-scale general observations
of FunTree about how enzymes have evolved their function [14]. These obser-
vations can be made at the domain and residue level, exploring how
function is modulated via the addition/removal of domains within
a multi-domain architecture or adaptations of the catalytic/binding
pocket. This allows analyses to be prepared comparing the number
and types of evolutionary steps observed within domain
superfamilies [15].
Furthermore, detailed analysis within a single superfamily or
for a specific enzyme can be undertaken. An example of this is the
evolution of functionality within the phosphatidylinositol-
phosphodiesterase superfamily (CATH 3.20.20.190), which is
summarized briefly here but discussed more comprehensively in
reference [16]. This superfamily shows relatively high structural
conservation, presenting just one structurally similar group, but
the phylogenetic tree generated within FunTree reveals three clades
(see Fig. 8). Clades C1 and C3 show hydrolase activity (EC: 3.1.4)
using a metal cofactor, whereas Clade 2 exhibits a transition to lyase
activity (EC: 4.6.1). The structure-informed sequence alignment
reveals that none of the three metal-chelating residues are con-
served in Clade 2, so that a metal is no longer bound, resulting in
the cyclic intermediate leaving the active site prior to hydrolysis and
giving the change from hydrolase to lyase functionality. The mech-
anistic changes that give rise to this change in functionality can be
explored further using the Mechanism and Catalytic Site Atlas
(M-CSA [13], formerly called MACiE [17] and CSA [18]).
FunTree is an important resource providing a comprehensive
analysis of the evolution of enzyme functionality within structurally
similar subdivisions of CATH superfamilies. Not only will this
improve our understanding of the link between enzyme structure
and function but, coupled with FunTree’s various supporting ana-
lyses such as structural alignments and measures of molecular
Fig. 8 Summary of phylogenetic, functional, metabolite, and multidomain architectures for the
phosphatidylinositol-phosphodiesterase superfamily (3.20.20.190) [16]. This shows a diagrammatic repre-
sentation of the FunTree phylogenetic tree with associated functional data and multidomain architectures.
Domain 3.20.20.190 performs all molecular functionality and is represented in green in the multidomain
architecture analysis. Three major clades (C1–C3) are highlighted. Within the first group, a number of
functional sub-groups can be observed, with differences in function defined by changes in substrate or
product formed
similarity, offers potential to inform de novo enzyme design, anno-

tate sequences/structures of unknown function, and propose novel
indications for drugs.
References
1. Sillitoe I, Furnham N (2016) FunTree: 3. Rahman SA, Cuesta SM, Furnham N et al

advances in a resource for exploring and con- (2014) EC-BLAST: a tool to automatically
textualising protein function evolution. search and compare enzyme reactions. Nat
Nucleic Acids Res 44:D317–D323. https:// Methods 11:171–174. https://doi.org/10.
doi.org/10.1093/nar/gkv1274 1038/nmeth.2803
2. Sillitoe I, Lewis TE, Cuff A et al (2015) CATH: 4. Ruan J, Li H, Chen Z et al (2007) TreeFam:
comprehensive structural and functional anno- 2008 update. Nucleic Acids Res 36:
tations for genome sequences. Nucleic Acids D735–D740. https://doi.org/10.1093/nar/
Res 43:D376–D381. https://doi.org/10. gkm1005
1093/nar/gku947 5. Bostock M (2017) https://d3js.org
6. Tamuri AU, Laskowski RA (2010) Arch- (M-CSA): a database of enzyme reaction

Schema: a tool for interactive graphing of mechanisms and active sites. Nucleic Acids
related Pfam domain architectures. Bioinfor- Res 46(D1):D618–D623
matics 26:1260–1261. https://doi.org/10. 14. Furnham N, Dawson NL, Rahman SA et al
1093/bioinformatics/btq119 (2016) Large-scale analysis exploring evolution
7. Uniprot Consortium (2009) The universal of catalytic machineries and mechanisms in
protein resource (UniProt) 2009. Nucleic enzyme superfamilies. J Mol Biol
Acids Res 37:D169–D174. https://doi.org/ 428:253–267. https://doi.org/10.1016/j.
10.1093/nar/gkn664 jmb.2015.11.010
8. Valdar WSJ (2002) Scoring residue conserva- 15. Tyzack JD, Furnham N, Sillitoe I et al (2017)
tion. Proteins Struct Funct Genet 48:227–241. Understanding enzyme function evolution
https://doi.org/10.1002/prot.10146 from a computational perspective. Curr Opin
9. Gutmanas A, Alhroub Y, Battle GM et al Struct Biol 47:131–139. https://doi.org/10.
(2014) PDBe: Protein Data Bank in Europe. 1016/j.sbi.2017.08.003
Nucleic Acids Res 42:D285–D291. https:// 16. Furnham N, Sillitoe I, Holliday GL et al
doi.org/10.1093/nar/gkt1180 (2012) Exploring the evolution of novel
10. Suzuki R, Shimodaira H (2006) Pvclust: an R enzyme functions within structurally defined
package for assessing the uncertainty in hierar- protein superfamilies. PLoS Comput Biol 8:
chical clustering. Bioinformatics e1002403. https://doi.org/10.1371/journal.
22:1540–1542. https://doi.org/10.1093/bio pcbi.1002403
informatics/btl117 17. Holliday GL, Bartlett GJ, Almonacid DE et al
11. Rahman S, Bashton M, Holliday GL et al (2005) MACiE: a database of enzyme reaction
(2009) Small Molecule Subgraph Detector mechanisms. Bioinformatics 21:4315–4316.
(SMSD) toolkit. J Cheminform 1:12. https:// https://doi.org/10.1093/bioinformatics/
doi.org/10.1186/1758-2946-1-12 bti693
12. Yachdav G, Goldberg T, Wilzbach S et al 18. Furnham N, Holliday GL, de Beer TAP et al
(2015) Anatomy of BioJS, an open source (2014) The catalytic site atlas 2.0: cataloging
community for the life sciences. elife 4: catalytic sites and residues identified in
e07009. https://doi.org/10.7554/eLife. enzymes. Nucleic Acids Res 42:D485–D489.
07009 https://doi.org/10.1093/nar/gkt1243
13. Ribeiro AJM, Holliday GL, Furnham N et al
(2018) Mechanism and Catalytic Site Atlas
Chapter 15
Identification of Protein Homologs and Domain Boundaries

by Iterative Sequence Alignment
Dustin Schaeffer and Nick V. Grishin
Abstract
Evolutionary domains are protein regions with observable sequence similarity to other known domains.
Here we describe how to use common sequence and profile alignment algorithms (i.e., BLAST, HHsearch)
to delineate putative domains in novel protein sequences, given a reference library of protein domains. In
this case, we use our database of evolutionary domains (ECOD) as a reference, but other domain sequence
libraries could be used (e.g., SCOP, CATH). We describe our domain partition algorithm along with
specific notes on how to avoid domain indexing errors when working with multiple data sources and
software algorithms with differing outputs.
Key words Protein domains, Homologs, Sequence alignment
1 Introduction
Protein domains are regions sharing a common origin (and some-

times function) that may be observed shuffled among homologous
proteins by recombination, deterioration, fusion, and other evolu-
tionary events [1, 2]. Clear identification of protein domains aids in
the identification of novel homologs and can give insight to protein
function [3, 4]. The detection of homologous proteins and the
delineation of their domain boundaries can be complicated by
multidomain proteins in the search space [5]. Multidomain pro-
teins in the domain search space can lead to nonhomologous
domains being classified as homologous, and potential novel inser-
tions or deletions may be missed. Discontinuous domains caused
by insertion may also complicate partition and assignment of
boundaries [6]. By careful selection of a reference database of
domains, combined with use of publically available sequence align-
ment software, the detection of similarity can be used to determine
domain boundaries in a set of input proteins [7]. Here we demon-
strate how the Basic Local Alignment Search Tool (BLAST) can be
combined with profile-profile detection (HHsearch) to determine
277
278 Dustin Schaeffer and Nick V. Grishin
domain boundaries in an input protein [8–10]. Additionally, we

show how a database of proteins with known domain architectures
can be used to efficiently partition multidomain proteins. For our
domain database, we use our Evolutionary Classification Of protein
Domains (ECOD), but this method could be adapted to work with
other common domain databases [11].
2 Materials
2.1 Software 1. BLAST+ version 2.2.25+ or greater (ftp://ftp.ncbi.nlm.nih.

gov/blast/executables/blast+/LATEST/).
2. HHsuite v2.0.15 or greater (https://github.com/soedinglab/
hh-suite).
3. PSIPRED (http://bioinfadmin.cs.ucl.ac.uk/downloads/
psipred/).
2.2 Databases 1. ECOD domain description flat file (http://prodata.swmed.

edu/ecod/distributions/ecod.latest.domains.txt).
2. Local PDB with PDBml or mmCIF no-atom headers (ftp://
ftp.wwpdb.org/pub/pdb/data/structures/divided).
3. Non-redundant sequence set for generation of reference pro-
files (http://wwwuser.gwdg.de/~compbiol/data/hhsuite/
databases/hhsuite_dbs/).
3 Methods
3.1 Preparation of Given a set of domain sequence ranges, generate a set of domain
Domain and Protein sequences in FASTA formats. In this case we will translate modified
Sequence Databases or unnatural residues to unknown residues, although it is possible
in some cases to identify parent amino acids and translate accord-
ingly (see Note 1). If prepared sequence databases are used, this
step can be skipped. This protocol assumes basic scripting knowl-
edge (e.g., Python, Perl, or Ruby) and the ability to parse and write
structured data formats (e.g., XML, mmCIF) [12]. We will use a
sample workspace as illustrated in Fig. 1; directory structures can
clearly be adapted for individual computing needs and infrastruc-
tures. The overall workflow of this domain partition is illustrated in
Fig. 2. The workflow presented here assumes the sequence inputs
are sourced from PDB structure depositions.
1. Download the PDBml no-atom headers and place into work-
space (see Note 2):
(a) /data/pdb
2. Download the ECOD domain description file and place into
workspace:
(b) /work/domain_search
Iterative Domain Sequence Alignment 279
Fig. 1 A sample workspace for domain partition. We delineate directories for storage of external databases
(top left), the reference domain database against which we search (top right), necessary downloadable
software programs (bottom left), and the contents of the search directory for a chain A found in the PDB
deposition 5XCT
Fig. 2 Workflow for domain partition by iterative sequence alignment. Briefly, the workflow can be split into
three large components. The search databases and the query workspace are prepared based on your domain
definitions and external protein database (left). Alignments are generated for the query proteins against the
reference databases, and the subsequent alignments are post-processed into structured data files containing
only well-covered hits (middle). Well-covered hits are used to iteratively assign and partition domain
boundaries using protein-protein sequence hits with the highest precedence and domain-domain profile
hits with the lowest (right)
3. Using domain ranges (PDB seqid indices) and the PDBml

no-atom XML files, prepare a set of domain FASTA sequences
and place into a single text file (see Note 3):
(c) /work/domain_search/domain_ref_seq.fasta
4. For each PDB protein sequence containing domains in the
ECOD file, using the structured residues with small (<20
residue) unstructured gaps included, prepare a set of protein
FASTA sequences and place into a single flat file (see Note 4):
(d) /work/domain_search/protein_ref_seq.fasta.
3.2 Preparation of 1. Generate a BLAST query database given an input FASTA

BLAST Query library of domain sequences:
Sequence Databases (a) /programs/blast/bin/makeblastdb -in
/work/domain_search/domain_ref_seq.fasta -out domain_
ref_seq -title
domain_ref_seq
2. Generate a BLAST query database given an input FASTA
library of protein sequences:
(b) /programs/blast/bin/makeblastdb -in
/work/domain_search/protein_ref_seq.fasta -out protein_
ref_seq -title
protein_ref_seq
3.3 Preparation of This step can be omitted if you have chosen to use a pre-generated
HHsearch Reference profile database. Select a subset of your original sequence database
Profile Database that is more sparsely populated. We will use the ECOD F40 repre-
sentatives in our example (see Note 4).
1. For each reference domain sequence, generate a reference
sequence profile using HHblits queried against a non-redundant
protein sequence database (e.g., UniRef30, nr, RefSeq, etc.):
(a) Use PSIPRED secondary structure prediction to aid with
HHsearch alignments.
(b) HHblits can be allocated to use multiple CPUs using the
–cpu switch; select a value that is appropriate for your local
computing infrastructure.
(c) We find that three iterations are sufficient ( n 3) to locate
close sequence homologs.
(d) /programs/hhsuite/bin/hhblits -i
/data/domain_data/e1mppA2/ef1ooA1.fasta -d
/data/hhsuite/lib/UniRef30.fa -ohhm
/data/domain_data/e1mppA2/e1mppA2.hhm -n 3 -cpu
8 -addss
-psipred /programs/psipred -psipred_data /data/psipred
2. Concatenate the set of domain profiles (HHM files) into a

single file:
(a) cat /data/domain_data/*/*.hhm >
/work/domain_search/domain_ref_seq_40.hhm
3.4 Prepare Query 1. Download or prepare the set of query sequences. We will
Workspace demonstrate with a recent week of PDB depositions
(20170929) with a single protein (5xct_B). If a set of query
sequences is highly redundant, it is appropriate to cluster the
set using CD-HIT or blastclust to reduce the size of the search
set [13].
2. Create a subdirectory for each query sequence:
(a) /work/domain_sequences/20170929/5xct_B
3. Distribute a FASTA file for each query sequence into each
subdirectory:
(b) /work/domain_sequences/20170929/5xct_B/5xct_B.fa
4. Create a sequence profile for each query sequence using
HHblits:
(c) /work/domain_sequences/20170929/5xct_B/5xct_B.hhm
3.5 Performing 1. For each query FASTA sequence, perform a BLAST search
BLAST and HHsearch against both reference sequence BLAST databases (protein
Queries and domain) using the XML output format ( outfmt 5):
(a) /programs/blast/bin/blastp -query
/work/domain_sequences/20170929/5xct_B/5xct_B.fa -db
/work/domain_search/domain_ref_seq.fa -outfmt
5 -num_alignments
5000 -evalue 0.002 >
/work/domain_sequences/20170929/5xct_B/5xct_B.
protein_blast.xml
(b) /programs/blast/bin/blastp -query
/work/domain_search/protein_ref_seq.fa -outfmt
5 -num_alignments
5000 -evalue 0.002 >
/work/domain_sequences/20170929/5xct_B/5xct_B.
domain_blast.xml
2. For each query HMM, perform an HHsearch against the ref-
erence sequence profile database:
(a) /programs/hhsuite/bin/hhsearch -i
/work/domain_search/domain_ref_seq_40.hhm -o
/work/domain_search/20170929/5xct_B/5xct_B.hh_result
-cpu 8
3.6 Collate HHsearch To better work with HHsearch results, it is convenient to parse
Output to Structured them to a structured data format, so that inconsistencies and errors
Data Format in batch jobs can be identified early in the process. It is possible but
not recommended to work directly from the standard HHsearch
result format in interpretation of results.
1. Locate the completed HHsearch outputs from the query
sequence workspace.
2. Record the hit number, HH probability of homology (%), and
HH E-value from the HH summary result block.
3. Using the original domain reference ranges, index aligned
positions in hit alignments to original domain residues indices
(see Note 3).
4. Convert alignments into ranges of aligned residue indices from
both the query sequence and the reference sequence.
5. Calculate the residue coverage of the reference alignment over
the reference domain sequence.
6. For hits with more than 70% of the reference domain aligned to
the query, deposit the query aligned range, the reference
aligned range, the HH probability of homology, the HH E-
value, and the coverage of the template sequence into a
structured data format in the query workspace directory.
3.7 Collate Protein In the previous step, we chose an XML format for BLAST output.
BLAST Hits Data locations for BLAST results are presented as an XPath
statement.
1. For each protein BLAST query result, record the following:
(a) Database used (//BlastOutput/BlastOutput-db)
(b) Query submitted (//BlastOutput/BlastOutput-
query_def)
(c) Query length (//BlastOutput/BlastOutput-
query_len)
2. Iterate over the protein BLAST hits and their high-scoring
segment pairs (HSPs) and determine whether the hit is of
sufficient quality for further consideration.
3. For each hit, record the hit number (Hit/hit_num), the hit
length (Hit/hit_len), and the hit definition (Hit/hit_def).
For protein queries conducted against a protein reference data-
base containing a set of PDB chains, the hit definition is a four-
character PDB identifier and a chain identifier of up to four
characters.
4. For each high-scoring segment pair (Hit-hsps/Hsp), record
the hsp E-value (Hsp/Hsp-evalue) and generate the aligned
range for the query sequence (Hsp-query_from .. Hssp-
query_to) as well as the aligned range for the reference
sequence (Hsp-hit_from .. Hsp-hit_to).
5. Considering HSPs from lowest to highest E-value, if any of the

query aligned range overlaps more than five residues with a
previous HSP, the reference aligned range overlaps more than
ten residues with a previous HS, or if the HSP exceeds an E-
value threshold of 5e 3, discard it. Otherwise, aggregate the
retained HSP query and reference aligned ranges for the hit.
6. If neither the respective aggregate query aligned range nor
reference aligned range differs from the total query length or
total reference length by more than 50 residues, retain the
protein-protein hit and the query and reference aligned ranges
for further consideration.
3.8 Collate Domain Collation of domain BLAST hits is similar to that of protein
BLAST Hits BLAST, with some small modifications to tighten constraints on
hsp overlaps arising from discontinuous domains.
1. As in protein BLAST, record the database used, the query
submitted, and the query length.
2. For each domain hit, record the reference domain identifier
(Hit/hit_def), the hit length (Hit/hit_len), and the hit num-
ber (Hit/hit_num).
3. For each HSP in a hit, allow no more than five residues overlap
between query aligned residues and no more than ten residues
overlap between reference aligned residues. HSPs must have an
E-value lower than 2e 3 to be accepted. The total coverage of
aligned reference residues over the reference sequence must be
70% or greater for the hit to be accepted.
4. Collect the protein BLAST, domain BLAST, and domain
HHsearch results into a single structured data format, where
each method contains a list of hits with the respective query
aligned range, reference aligned range, reference aligned cov-
erage, and quality score (E-value for BLAST, HH probability of
homology for HHsearch) associated with each hit.
3.9 Domain Partition Given a set of well-scoring hits to protein sequences, domain
by Iteration Over sequences, and domain profiles, we are prepared to partition the
Alignments query sequence into domains.
1. For a query protein, process alignments in the following order:
protein sequence hits, domain sequence hits, and domain pro-
file hits. If either less than 5% of the query sequence is unas-
signed or less than ten residues remain unassigned, the
partition is complete.
2. For each protein sequence alignment, if at least 90% of the
query aligned residues have not been assigned and less than
5% of the query aligned residues have not been assigned to a
previous putative domain, then define domains based on this
protein sequence alignment.
Fig. 3 An example domain partition using multiple aligner. A domain partition using iterative sequence
alignment of a fusion structure of Fv and MST1 coiled coil (5xct_B). This novel domain architecture (a) was
partitioned into its components by hits against a Fv Ig beta-sandwich (b) domain(1mfa_L) and a coiled-coil
domain (c) from MST1 kinase (2jo8_B)
3. To define domains from a protein-protein alignment, consider

the sequence ranges from the domains which comprise the
reference protein. Generate a 1:1 mapping between aligned
positions, omitting those positions in the reference protein
which align to a gap. Using this mapping, generate domain
ranges for the query protein sequence based on the domains of
the reference proteins. Discard any domains which are shorter
than the global gap tolerance (20 residues).
4. For each domain sequence alignment, if at least 90% of the
query aligned residues have not been assigned and less than 5%
of the query aligned residues have not been assigned to a
previous putative domain, then define a domain based on this
domain sequence alignment.
5. To define domains from a domain-domain alignment, simply
assign the query aligned residues to the reference domain as a
putative domain. Retain the family identification of the refer-
ence domain; this is indicative of the homologous link between
the putative domain and the reference domain.
6. For each domain profile alignment, if at least 90% of the query
aligned residues have not been assigned, less than 20 of the
query aligned residues overlap with a previous putative domain,
the HH probability of homology is greater than 90%, and if the
alignment itself is longer than 20 residues, then use the aligned
residues to define a putative domain.
7. For each putative domain, close small gaps (length < 20 resi-
dues) in the putative aligned domains internal to the defined
range.
8. Check partition completeness by examining total coverage of
the query sequence (<20 residues is considered good) or the
length of the longest uncovered segment. See Fig. 3 for a
sample domain partition result.
4 Notes
1. Depositions in the Protein Databank often contain modified or

unnatural amino acids. In some cases, such as selenomethio-
nine or alkylated lysines, modifications reflect a change from a
natural amino acid to more aid with purification or structural
determination. Where a parent compound has been recorded
for a modified amino acid, it is possible to revert an amino acid
to its natural form. Individual modified amino acids are
recorded in the PDBx:pdbx_struct_mod_residue datablock in
the PDB. Residues in this datablock are indexed by both
PDB seqid (PDBx:label_seq_id) and the more commonly used
PDB residue number (PDBx:auth_seq_id). The parent com-
pound of the modified residue is recorded in the PDBx:par-
ent_comp_id record.
2. A common problem using protein structures is that many
resources still use the 80-column PDB legacy format. Both
the mmCIF and PDBml formats offer richer data in a more
easily parsed format [12]. Additionally, many recently released
EM structures are simply too large to be described by the
legacy format. There are advantages and disadvantages to
choosing either. Although mmCIF is far more compact (due
to a less verbose loop data structure), there are less off-the-shelf
parsing packages available for commonly used scripting and
programming languages. XML parsing tools for PDBml files
are commonly available in many languages, but XML is
extremely verbose, and care must be taken to avoid ruinous
performance traps when using these files on the large scale.
One easy option is to use the “no-atom” PDBml files for those
analyses using only structure metadata (such as sequence).
3. Users of the PDB 80-column legacy format are accustomed to
sequence data in PDB depositions being organized by PDB
chain identifier and PDB residue number. PDB residue number
has many properties which make it unsuitable for use as a
domain index: it is not necessarily sequential, may or may not
incorporate insertion codes, and in rare cases is not unique
within a chain. Internally, both mmCIF and PDBml represen-
tations of structures use a sequence index (seqid) to represent
the position of a residue within a polymer, an entity index
(entity_id) to indicate a specific chemical compound with a
deposition, and an asymmetric index (asym_id) to represent a
specific instance of a chemical compound in 3D space within
the asymmetric unit. The PDBx:pdbx_poly_seq_scheme data-
block provides a single location wherein each of these concepts
is collectively represented.
4. When we refer to structured residue, we refer to those residues
that are structurally resolved (i.e., present in ATOM records).
In addition, we remove small gaps (<20 residues) that are

unresolved in sequence. We choose this length because, in
general, domains of shorter than 20 residues are not observed.
This structured residue model is thus a subset of the residues
sequence described. In the subsequent workflow, it is impor-
tant to accurately resolve positions in subsequent alignments to
this structured residue model, in order to avoid indexing
errors.
Acknowledgments
This work was supported in part by the National Institutes of

Health (GM094575 to NVG) and the Welch Foundation (I-1505
to NVG).
References
1. Soding J, Lupas AN (2003) More than the sum SCOP database: new developments. Nucleic
of their parts: on the evolution of proteins from Acids Res 36(Database issue):D419–D425
peptides. BioEssays 25(9):837–846 8. Altschul SF, Madden TL, Schaffer AA,
2. Leipe DD, Aravind L, Grishin NV, Koonin EV Zhang J, Zhang Z, Miller W, Lipman DJ
(2000) The bacterial replicative helicase DnaB (1997) Gapped BLAST and PSI-BLAST: a
evolved from a RecA duplication. Genome Res new generation of protein database search pro-
10(1):5–16 grams. Nucleic Acids Res 25(17):3389–3402
3. Tyzack JD, Furnham N, Sillitoe I, Orengo 9. Soding J (2005) Protein homology detection
CM, Thornton JM (2017) Understanding by HMM-HMM comparison. Bioinformatics
enzyme function evolution from a computa- 21(7):951–960. https://doi.org/10.1093/
tional perspective. Curr Opin Struct Biol 47 bioinformatics/bti125
(Suppl C):131–139. https://doi.org/10. 10. Remmert M, Biegert A, Hauser A, Soding J
1016/j.sbi.2017.08.003 (2011) HHblits: lightning-fast iterative protein
4. Cheng H, Schaeffer RD, Liao Y, Kinch LN, sequence searching by HMM-HMM align-
Pei J, Shi S, Kim BH, Grishin NV (2014) ment. Nat Methods 9:173. https://doi.org/
ECOD: an evolutionary classification of pro- 10.1038/nmeth.1818
tein domains. PLoS Comput Biol 10(12): 11. Cheng H, Liao Y, Schaeffer RD, Grishin NV
e1003926. https://doi.org/10.1371/journal. (2015) Manual classification strategies in the
pcbi.1003926 ECOD database. Proteins 83(7):1238–1251.
5. Song N, Sedgewick RD, Durand D (2007) https://doi.org/10.1002/prot.24818
Domain architecture comparison for multido- 12. Westbrook J, Ito N, Nakamura H, Henrick K,
main homology identification. J Comput Biol Berman HM (2005) PDBML: the representa-
14(4):496–516. https://doi.org/10.1089/ tion of archival macromolecular structure data
cmb.2007.A009 in XML. Bioinformatics 21(7):988–992.
6. Holland TA, Veretnik S, Shindyalov IN, https://doi.org/10.1093/bioinformatics/
Bourne PE (2006) Partitioning protein struc- bti082
tures into domains: why is it so difficult? J Mol 13. Fu L, Niu B, Zhu Z, Wu S, Li W (2012)
Biol 361(3):562–590. https://doi.org/10. CD-HIT: accelerated for clustering the next-
1016/j.jmb.2006.05.060 generation sequencing data. Bioinformatics 28
7. Andreeva A, Howorth D, Chandonia JM, (23):3150–3152. https://doi.org/10.1093/
Brenner SE, Hubbard TJ, Chothia C, Murzin bioinformatics/bts565
AG (2008) Data growth and its impact on the
Chapter 16
A Roadmap to Domain Based Proteomics

Carsten Kemena and Erich Bornberg-Bauer
Abstract
Protein domains are reusable segments of proteins and play an important role in protein evolution. By
combining the elements from a relatively small set of domains into unique arrangements, a large number of
distinct proteins can be generated. Since domains often have specific functions, changes in their arrange-
ment usually affect the overall protein function. Furthermore, domains are well amenable to computational
representations, e.g., by Hidden Markov Models (HMMs), and these HMMs are widely represented in
various databases. Therefore, domains can be efficiently used for proteomic analyses. Here, we describe how
domains are annotated using different domain databases and then how to assess the annotation quality of
proteomes. We next show how functional annotations of domains in large-scale data such as whole genomes
or transcriptomes can be used to analyze molecular differences between species. Furthermore, we describe
methods to analyze the changes in domain content of proteins which significantly helps to characterize and
reconstruct the modular evolution of proteins. Altogether, domain-based methods offer a computationally
highly effective approach to analyze large amounts of proteomic data in an evolutionary setting.
Key words Protein domain, Molecular evolution
1 Introduction
Domains are modular building blocks of proteins. Protein domains

have a conserved sequence, often describe a specific structure or
function and, since they occur in different proteins and frequently
in changing combination with other domains, they represent inde-
pendent units of evolution [1, 2]. Furthermore, the alternative
combinations of domains in a single sequence can generate new
proteins with varying functions [1, 3]. Therefore, a relatively small
number of domains can generate a much higher number of distinct
proteins [4] and allow for a fast adaptation to changing conditions
and the generation of new functions without the need to
completely generate a new protein from scratch. Additionally,
domains are much more conserved when compared to the sur-
rounding linker regions and are therefore traceable even in
sequences that are only very distantly related where, e.g., BLAST
is not able to find matches anymore [5].
287
288 Carsten Kemena and Erich Bornberg-Bauer
Changes on the genome/gene level alter domain arrangements

during evolution and these changes can be tracked, e.g., using
DomRates to determine the evolutionary origin of a domain
arrangement. DomRates is based on an algorithm previously
described by Moore et al. [6]. The most common way to generate
new protein domain arrangements is by fusion of two existing
domain arrangements or the fission of a single domain arrangement
into two separate ones. Another common way is to lose a domain at
either end of the protein [7, 8].
Many domains have a known function. Gene Ontology [9]
assigns formalized terms to them and thereby allows to assign
possible functions to unknown sequences by identifying the
domains in a sequence. Changes in the function of a protein are
often reflected in the domains the protein contains. Therefore,
studying the changes of domain arrangements and thereby the
function of a protein is of biological importance.
Domains are identified by the maintainers of databases as Pfam
and Superfamily either based on structural or sequence similarity.
These databases usually store the domains in forms of HMMs, i.e.,
profile-based methods which have a high sensitivity and selectivity
as they are trained on a data set of known family members. How-
ever, some databases (e.g., PROSITE [10]) use string patterns
(regular expressions) to store a domain. The latter approach is
much faster but only allows to find very conserved motifs and can
only give a binary classification.
The databases can then be used to identify domain instances in
a sequence that is scanned for the occurrences of a domain and
thereby, for example, gain functional insights as described above.
This approach of course only allows to identify known
domains. Other approaches like the analysis of hydrophobic clusters
in alignments allow to identify new, formerly uncharacterized
domains [11]. These approaches detect sequence fragments
which are compatible with some basic structural elements, such as
α-helices and β-sheets and assume that the resulting spatial proxim-
ity of hydrophobic residues (e.g., L, I, W) will indicate if a relatively
compact and stable conformation exists. This method, therefore,
does not require any prior homology information but does also not
have any reliability measure such as a p-value.
Altogether domains and domain arrangements are therefore
ideal candidates to be studied when one is interested in proteome
evolution. In the next section we cover the steps needed to do
domain-based proteomics, starting with the data preparation and
then go on to different methods to do domain-based proteomics.
2 Materials
All the tools used in the method section are freely available. Below
is a list of used programs with a short description of their purpose.
A Roadmap to Domain Based Proteomics 289
2.1 Databases l Gene Ontology Gene Ontology (GO) [9] is an effort to pro-
vide a vocabulary to represent biological functions. Website:
http://geneontology.org.
l InterPro InterPro [12] is a meta-domain database. It contains
domains from 14 databases and groups identical domains from
different databases into the same InterPro ID. Website:
http://www.ebi.ac.uk/interpro/.
l Pfam Pfam [13] is a database of domains, with about 16,700
domains (versions 31). The domains are based on sequence
conservation and are clustered into clans based on similarity of
either sequence or structure. For each domain family an e-value
threshold is defined to separate random hits from real domain
instance occurrences. Website: http://pfam.xfam.org/.
Beside databases, several programs are needed to analyze the data.
An overview can be found in Table 1.
Table 1
List of software programs needed
Program Description
BioBundle A small collection of programs we use to prepare the data. Website: https://github.com/
CarstenK/BioBundle
DAMA DAMA[14] annotates sequences with Pfam domains. The results are based on an existing
(e.g., HMMER) annotation that is then improved by using different filter criteria.
Website: http://www.lcqb.upmc.fr/DAMA/
DOGMA DOGMA can assess the quality of proteomes and transcriptomes based on the occurrences
of domains. Website: http://domainworld.uni-muenster.de/programs/dogma/
DomRates Program to trace evolutionary changes in domain arrangements. Website: http://
domainworld.uni-muenster.de/programs/domrates/
gffread gffread is part of the cufflinks package[15] and is used to extract protein
sequences from a genome based on GFF file. Website: http://cole-trapnell-lab.github.
io/cufflinks/install/
HMMER HMMER is a program suite containing programs to construct sequence HMMs of, e.g.,
domains. These HMMs can then be used in searches for further matches in other
sequences. Website: http://hmmer.org/
InterProScan Program to annotate proteins with domains contained in the InterPro domain database.
Website: http://www.ebi.ac.uk/interpro/download.html
PfamScan The database Pfam [13] provides a software to annotate sequences with Pfam domains.
The software as well as the domain database are needed to annotate sequences. It uses
the HMMER program suite to find domain matches and then uses the Pfam e-value
thresholds to filter out overlaps and spurious hits. Website: ftp://ftp.ebi.ac.uk/pub/
databases/Pfam/Tools/
RADIANT A fast domain annotation program. Used here together with DOGMA for a fast quality
assessment. Website: http://domainworld.uni-muenster.de/programs/radiant/
3 Methods
Here, we describe how to use different programs in a domain-based

proteomics analysis. An overview of the different steps that will be
performed is shown in Fig. 1. Among the many potential mishaps,
we list those that we experience in our research as the most frequent
ones and present solutions for them as well.
proteome genome
+ GFF genes of interest
gffread protein
extraction
stop and
stopCleaner pseudogene
data
removal
preparation
isoformCleaner isoform removal
DOGMA
+ RADIANT quality check
domain hmmscan
annotation PfamScan InterproScan
+ DAMA
analysis DomRates topGo
Fig. 1 Workflow of a domain-based proteome analysis. The steps “data preparation,” “domain annotation,”
and the analysis itself are covered
3.1 Preparing Proteomes for the species to be analyzed can be found in publicly
a Data set for Domain available databases, e.g., on general portals (e.g., NCBI [16] or
Annotation Ensembl [17]) or on more specialized websites for certain species
and Subsequent groups (e.g., Hymenoptera genomes [18]) or single species. The
Domain-Based simplest way to obtain a proteome set is to download the proteome
Analyses directly but sometimes only a genome and a GFF file are available.
In this case gffread can extract the mRNA from the genome and
translate them into proteins.
It is important to make sure that the gene annotation version
fits the genome version. If this is not the case the protein extraction
might fail or, in the worst case, might extract incorrect proteins due
to shifts in protein coordinates. Even if the versions match pro-
blems might occur. A possible error can be that two identical gene
annotations exist (with same ID) or the same ID has been used
twice for different genes. In these cases the gene annotation needs
to be fixed manually either by removing the gene annotation (first
case) or change the gene ID (second case).
In other cases, e.g., if scaffolds in the genome file are missing,
the providers of the GFF/genome need to be contacted to ask for
correction. Sometimes the GFF/genome files contain a prefix to
the scaffold names in either the GFF or the genome sequence file
but not in both. The solution is simply to remove the prefix (or add
it to the other file).
On the first run of gffread on a genome file it creates an index
file ( < genome file >.fai) that contains the names and positions of
the scaffolds for faster access. This file is not regenerated automati-
cally when the genome file is changed. It is therefore important to
delete the index file after manually having changed the genome as
otherwise gffread will not recalculate the index.
The terminating stop codon in coding sequences (CDS) is

signified by a stop sign (“.”) at the end of the protein. The “.” at
the end is a minor technical issue which most programs will easily
cope with, though not all. Additionally, sometimes one or several
“.” occur in the middle of a protein sequence either on purpose
because the GFF contains pseudogenes or because the gene anno-

tation is erroneous. These should be removed as, for example, a
domain could be found after a recently obtained stop codon.
However, if the proteome was downloaded as protein
sequences, the stop codons at the end might have been removed
already but not the ones in the middle which might be masked by a
different character (e.g., “U”) as they usually cause less problems
for programs. These genes should still be removed to avoid wrong
domain annotations as mentioned above.
Additionally, it is important to realize that some read through
genes (e.g., selenoproteins [19]) do contain a stop codon in the
sequence which can be either replaced with an amino acid or is
simply being ignored during translation. The cases cannot be han-
dled automatically and need attention by the user. We use the
stopCleaner program of the BioBundle package to remove the
stop signs at the end as well as potentially problematic genes.
A quality check will be performed using DOGMA [20]. It

searches for a set of conserved domain arrangements in the prote-
ome or transcriptome of interest and computes a quality score
based on the percentage of found arrangements. For the quality
check as well as for the enrichment and DomRates analysis, that will
be performed at the end of this chapter (see Fig. 1), each gene
should be present with a single isoform only. Otherwise, a bias
will be introduced. Since there is no generally accepted way to
mark different isoforms, one first needs to determine how isoforms
are marked. This has to be done manually. A common way is to
mark the end of the protein IDs (e.g., -PA, -PB, . . ., or.t1,.t2, . . .).
However, not always it is as simple as gene names might not contain
the isoform information. In this case the GFF is needed, then the
Parent field can be used to identify isoforms because isoforms of the
same protein should have different IDs but identical parent IDs. To
help with this procedure the isoformCleaner program can be
used. It provides different options to keep only the longest isoform
version of a protein. Here, we make use of the possibility to define a
character (option “-s”) which separates the gene name from the
isoform identifier, e.g., the sequence ID “HSAL20401-RB” will be
split into gene ID “HSAL20401” and isoform “RB”.
A general problem is that the quality of available data can vary a

lot. It is therefore important to check the quality of the selected
proteomes. BUSCO [21] and DOGMA [20] are both programs to
analyze proteome and transcriptome quality. Here, we describe
only the usage of DOGMA, as it is based on domains and therefore
perfectly suits our purpose, however, it can be used just as well for
general quality checks even if no domain analyses are planned to be
done. Both programs give very similar results.
DOGMA compares the domains found in the proteome to
analyze with a predefined core set. For best results it is recom-
mended to use the core set that is the smallest group containing
the species to analyze. In the case of the ant used in this example the
insect group is the best fit.
The resulting quality score 94.55 is very good. A more specific

core set should always give a quality score lower or equal to one
with a less specific core set as more domain arrangements are
checked. Therefore, the better the core set fits the more accurate
the quality estimation will be.
In general one should try to have only proteomes which have a
quality value of at least 75 (ideally higher) as the lower the number
the less comparable are the proteomes.
3.1.1 Annotating The first step to annotating the sequences with domains is to decide
Sequences with Domains which database to use as many different ones exist. They differ in
the number of domains they contain, and in the way they define
them (e.g., more structural or sequence based). Here, for demon-
stration purposes, we use the Pfam and the InterPro database and
apply it to the prepared file from the previous section. It contains
17,146 sequences that will be annotated with domains such that we
can perform a domain-based functional enrichment or rearrange-
ment analysis in the next step.
This is the standard approach to annotate sequences with Pfam

domains. However, there are two ways to further increase the
domain coverage of a proteome. One simple approach would be
to increase the e-value (i.e., lower the threshold) to allow for more
domain hits. However, this increased sensitivity would increase as
well the number of false positive hits. A better approach is to
incorporate co-occurrence information as, for example, done in
CODD [22] or DAMA [14]. Domains are often preferentially
co-occurring together with specific other domains. This informa-
tion can be used during the annotation of sequences. For example,
in the case of two domains in a sequence one with an e-value
threshold above the default one below. By default only one domain
would be annotated. But if co-occurrence information suggest that
both domains often occur together one can recover the one with
the too high e-value.
In the next step it is shown how to use DAMA to improve
domain coverage:
The second approach is to use InterPro, a meta database which

combines several different domain databases into a single one. This
will result in a higher coverage as well because different databases
contain different domains although, in general, a large overlap
exists between the models characterizing one domain. However,
the results will often contain overlaps of domains from different
databases making further post-processing necessary for many ana-
lyses (e.g., analysis of arrangement changes).
3.2 DomRates: The analysis of domain arrangement changes can provide insights
Analyzing Domain into the kind of events that were important for a new species. We
Arrangement Changes traceback domain arrangement changes using the DomRates pro-
Along gram. Based on a domain annotation and a given phylogeny it is
a Phylogenetic Tree able to reconstruct the events that lead to the extant species in the
data set.
We will analyze a small set of hymenopterans. For each species
we prepare the proteome as described above and put the final
domain annotation together in one folder. The set additionally
includes an outgroup (Drosophila melanogaster) to reconstruct
the ancient state at the root of the hymenopteran branch. Further-
more, a phylogenetic tree in Newick format is needed. The labels in
the tree correspond to the domain file names without a fixed suffix.
For later visualization we will produce a statistics file which will be
used in a subsequent step. With only six species the tree is very
short. We therefore use the “-l” option to adjust the legend.
The output of the visualization script is a tree with pie charts

showing the amount and proportion of rearrangement events
which have happened along the branches (Fig. 2).
3.3 Functional It is often of great interest if gene sets (e.g., genes under positive
Enrichment Analysis selection) have a common function as this can help to find
Based on Domain biological explanations. Domains, as known functional units, com-
Annotations bined with a defined biological vocabulary (e.g., Gene Ontology)
can be used together to characterize genes in respect of the molec-
ular or biological processes they are involved in or the cellular
component they are active in.
The Gene Ontology (GO) consortium provides mappings of
GO terms to domains of different databases. Here, we will use the
pfam2go mapping that assigns GO terms to numerous Pfam
domains. The combination of the domain assignments is then
used to identify the function of a protein and perform analyses of
enrichment of certain terms in a set of genes. The R package
topGO [23] provides several algorithms and statistical tests that
can be used for the enrichment analysis. It compares the GO terms
Fusion Fission Terminal Loss

Terminal Emergence Single Domain Loss Single Domain Emergence
395Linepithema_humile
133
457Atta_cephalotes
143
164
327 Harpegnathos_saltator
427 Apis_mellifera
706 Nasonia_vitripennis
Drosophila_melanogaster
0.50
Fig. 2 Domain arrangement changes along a selected set of hymenopterans
of genes of interest with the larger “GO universe” to identify

overrepresented terms.
The first step is to prepare the GO universe. For this, we can use
a simple Python script which merges a domain annotation file with
the pfam2go file into the needed format. The universe file is a
simple two-column text file. The first column contains the gene
names, the second one a comma separated list of GO-IDs asso-
ciated with the gene. The file containing the genes of interest is a
simple text file with one gene ID per line, for example, all the genes
which are differentially expressed or have another common feature
that is of interest.
We can now install (if necessary) the topGO package directly

from R. Subsequently, we load the package using the library
command.
The next step is to load the data from the files and prepare it for
the following GO term enrichment analysis.
An enrichment test can be performed for three different cate-

gories: molecular function (MF), biological process (BP), and cel-
lular component (CC). The category analyzed in the analysis can be
changed by simply changing the abbreviation used in the ontology
parameter.
Table 2
Top 10 enriched GO terms based on the “parentChild” algorithm with a fisher test
GO.ID Term Annotated Significant Expected Fisher

1 GO:0032196 Transposition 104 18 4.30 9.5e 12
2 GO:0006259 DNA metabolic process 245 26 10.14 1.2e 10
3 GO:0044710 Single-organism metabolic process 689 59 28.52 1.3e 08
4 GO:0008152 Metabolic process 2302 120 95.27 4.5e 06
5 GO:0006508 Proteolysis 279 23 11.55 1.8e 05
6 GO:0034641 Cellular nitrogen compound 952 41 39.40 5.3e 05
metabolic pro. . .
7 GO:0006313 Transposition, DNA-mediated 104 18 4.30 0.0013
8 GO:0006725 Cellular aromatic compound 796 34 32.94 0.0029
metabolic pro. . .
9 GO:0046483 Heterocycle metabolic process 797 34 32.99 0.0030
10 GO:0006310 DNA recombination 128 20 5.30 0.0061
The “results.csv” file will now contain a list of all GO terms that
are enriched in the gene set of interest and have a p-value 0.05.
An example output is shown in Table 2.
In this chapter, we gave a basic overview why the analysis of
domains is important. Additionally, we described the basic methods
to prepare and analyze data within a protein domain context.
Domains allow a fast evolutionary analysis of large data sets and
by using GO term assignments allow to perform functional analyses
as well. However, it is important to remember that not all domains
have a known function and that not all proteins contain a domain
which might influence the analysis. Additionally, the database used
might have a species bias (e.g., contain an over-proportional
amount of domains of eukaryotes) which will influence coverage
and functional depth of analyses based on such annotations.
Acknowledgements
We would like to thank Mark Harrison and Ulrike Brandt for

helpful suggestions.
References
1. Vogel C, Bashton M, Kerrison ND, Chothia C, 2. Moore AD, Asa KB, Ekman D, Bornberg-
Teichmann SA (2004) Structure, function and Bauer E, Elofsson A (2008) Arrangements in
evolution of multidomain proteins. Curr Opin the modular evolution of proteins. Trends Bio-
Struct Biol 14(2):208–216 chem Sci 33(9):444–451
3. Lees JG, Dawson NL, Sillitoe I, Orengo CA S, Sutton G, Thanki N, Thomas PD, Tosatto
(2016) Functional innovation from changes in SC, Wu CH, Xenarios I, Yeh LS, Young SY,
protein domains and their combinations. Curr Mitchell AL (2017) InterPro in 2017–beyond
Opin Struct Biol 38:44–52 protein family and domain annotations.
4. Levitt M (2009) Nature of the protein uni- Nucleic Acids Res 45(D1):D190–D199
verse. Proc Natl Acad Sci USA 106 13. Finn RD, Coggill P, Eberhardt RY, Eddy
(27):11079–11084 SR, Mistry J, Mitchell AL, Potter SC, Punta
5. Remmert M, Biegert A, Hauser A, Soding J M, Qureshi M, Sangrador-Vegas A, Salazar
(2011) HHblits: lightning-fast iterative pro- GA, Tate J, Bateman A (2016) The Pfam
tein sequence searching by HMM-HMM protein families database: towards a more sus-
alignment. Nat Methods 9(2):173–175 tainable future. Nucleic Acids Res 44(D1):
6. Moore AD, Grath S, Schüler A, Huylmans AK, D279–D285
Bornberg-Bauer E (2013) Quantification and 14. Bernardes JS, Vieira FR, Zaverucha G,
functional analysis of modular protein evolu- Carbone A (2016) A multi-objective optimiza-
tion in a dense phylogenetic tree. Biochim Bio- tion approach accurately resolves protein
phys Acta Proteins Proteomics 1834 domain architectures. Bioinformatics 32
(5):898–907 (3):345–353
7. Moore AD, Bornberg-Bauer E (2012) The 15. Trapnell C, Williams BA, Pertea G, Mortazavi
dynamics and evolutionary potential of domain A, Kwan G, van Baren MJ, Salzberg SL, Wold
loss and emergence. Mol Biol Evol 29 BJ, Pachter L (2010) Transcript assembly and
(2):787–796 quantification by RNA-Seq reveals unanno-
8. Kersting AR, Bornberg-Bauer E, Moore AD, tated transcripts and isoform switching during
Grath S (2012) Dynamics and adaptive benefits cell differentiation. Nat Biotechnol 28
of protein domain emergence and arrange- (5):511–515
ments during plant genome evolution. 16. NCBI Resource Coordinators (2017) Data-
Genome Biol Evol 4(3):316–329 base Resources of the National Center for Bio-
9. Ashburner M, Ball CA, Blake JA, Botstein technology Information. Nucleic Acids Res 45
D, Butler H, Cherry JM, Davis AP, Dolinski K, (D1):D12–D17
Dwight SS, Eppig JT, Harris MA, Hill 17. Yates A, Akanni W, Amode MR, Barrell D, Billis
DP, Issel-Tarver L, Kasarskis A, Lewis S, K, Carvalho-Silva D, Cummins C, Clapham
Matese JC, Richardson JE, Ringwald M, P, Fitzgerald S, Gil L, Giron CG, Gordon L,
Rubin GM, Sherlock G (2000) Gene ontol- Hourlier T, Hunt SE, Janacek SH, Johnson
ogy: tool for the unification of biology. The N, Juettemann T, Keenan S, Lavidas I, Martin
Gene Ontology Consortium. Nat Genet 25 FJ, Maurel T, McLaren W, Murphy DN, Nag R,
(1):25–29 Nuhn M, Parker A, Patricio M, Pignatelli
10. Sigrist CJA, Castro E, de Cerutti L, Cuche M, Rahtz M, Riat HS, Sheppard D, Taylor
BA, Hulo N, Bridge A, Lydie B, Xenarios I K, Thormann A, Vullo A, Wilder SP, Zadissa A,
(2013) New and continuing developments at Birney E, Harrow J, Muffato M, Perry E, Ruf-
PROSITE. Nucleic Acids Res 41(Database- fier M, Spudich G, Trevanion SJ, Cunning-
Issue):344–347 ham F, Aken BL, Zerbino DR, Flicek P
(2016) Ensembl 2016. Nucleic Acids Res 44
11. Bitard-Feildel T, Heberlein M, Bornberg- (D1):D710–D716
Bauer E, Callebaut I (2015) Detection of
orphan domains in Drosophila using “hydro- 18. Elsik CG, Tayal A, Diesh CM, Unni DR,
phobic cluster analysis”. Biochimie Emery ML, Nguyen HN, Hagen DE (2016)
119:244–253 Hymenoptera Genome Database: integrating
genome annotations in HymenopteraMine.
12. Finn RD, Attwood TK, Babbitt PC, Bateman Nucleic Acids Res 44(D1):793–800
A, Bork P, Bridge AJ, Chang HY, Dosztanyi
Z, El-Gebali S, Fraser M, Gough J, Haft D, 19. Labunskyy VM, Hatfield DL, Gladyshev VN
Holliday GL, Huang H, Huang X, Letunic (2014) Selenoproteins: molecular pathways
I, Lopez R, Lu S, Marchler-Bauer A, Mi and physiological roles. Physiol Rev 94
H, Mistry J, Natale DA, Necci M, Nuka G, (3):739–777
Orengo CA, Park Y, Pesseat S, Piovesan D, 20. Dohmen E, Kremer LPM, Bornberg-Bauer E,
Potter SC, Rawlings ND, Redaschi N, Kemena C. (2016) DOGMA: domain-based
Richardson L, Rivoire C, Sangrador-Vegas transcriptome and proteome quality assess-
A, Sigrist C, Sillitoe I, Smithers B, Squizzato ment. Bioinformatics 32(17):2577–2581
21. Simão FA, Waterhouse RM, Ioannidis P, Kri- using co-occurrence: application to Plasmo-
ventseva EV, Zdobnov EM (2015) BUSCO: dium falciparum. Bioinformatics 25
assessing genome assembly and annotation (23):3077–3083
completeness with single-copy orthologs. Bio- 23. Alexa A, Rahnenführer J (2016) topGO:
informatics 31(19):3210–3212 enrichment analysis for gene ontology. R pack-
22. Terrapon N, Gascuel O, Marechal E, Breehelin age version 2.26.0
L (2009) Detection of new protein domains
Chapter 17
Modeling of Protein Tertiary and Quaternary Structures

Based on Evolutionary Information
Gabriel Studer, Gerardo Tauriello, Stefan Bienert,
Andrew Mark Waterhouse, Martino Bertoni, Lorenza Bordoli,
Torsten Schwede, and Rosalba Lepore
Abstract
Proteins are subject to evolutionary forces that shape their three-dimensional structure to meet specific
functional demands. The knowledge of the structure of a protein is therefore instrumental to gain
information about the molecular basis of its function. However, experimental structure determination is
inherently time consuming and expensive, making it impossible to follow the explosion of sequence data
deriving from genome-scale projects. As a consequence, computational structural modeling techniques
have received much attention and established themselves as a valuable complement to experimental
structural biology efforts. Among these, comparative modeling remains the method of choice to model
the three-dimensional structure of a protein when homology to a protein of known structure can be
detected.
The general strategy consists of using experimentally determined structures of proteins as templates for
the generation of three-dimensional models of related family members (targets) of which the structure is
unknown. This chapter provides a description of the individual steps needed to obtain a comparative model
using SWISS-MODEL, one of the most widely used automated servers for protein structure homology
modeling.
Key words Homology modeling, Oligomeric proteins, Quaternary structure, Protein structure pre-
diction, Model quality assessment, Model quality estimates, SWISS-MODEL
1 Introduction
Homology modeling, or comparative protein structure modeling,

is a technique to generate a three-dimensional model of a protein
from its amino acid sequence (target) using the structures of related
proteins as reference (templates) [1, 2]. The applicability and suc-
cess of the approach mainly depend upon two factors: the extent of
sequence similarity between the target protein and the template,
and the extent of structural divergence during evolution
[3]. Although confident results are expected in case of close
301
302 Gabriel Studer et al.
homologs, each step of the modeling process should be carefully

considered as it can affect the intended applications of the model.
In this context, the availability of automated servers with user-
friendly web interfaces also allows nonspecialists to generate reliable
3D models without the need to install complex software and data-
bases and by providing easy access to results, their visualization, and
interpretation. SWISS-MODEL pioneered the field of automated
modeling servers 25 years ago. Since then, it has been continuously
improved and its functionality greatly extended following both
methodological and conceptual advances [4–7].
From a technical point of view, comparative modeling consists
of the following main steps:
1. The amino acid sequence of the target protein is compared to
the sequences of homologous proteins of known structure in
order to identify reliable templates.
2. The amino acid sequences of the target and the selected tem-
plates are aligned.
3. Three-dimensional models of the target are built based on the
alignments.
4. The global and local quality of the resulting models is evaluated.
So far, steps 1 and 2 gathered a solid consensus of being the
most critical ones [8] especially when only remote homologs with
known structure are available. Accordingly, many efforts have been
devoted over the past years to the development and improvement
of dedicated methods. Position-specific sequence profiles (PSSM)
and profile-based hidden Markov models (HMM) have represented
a breakthrough in this context, being able to enhance template
identification and alignment even in case of remote homology.
There is no risk of overstating the importance of a curated and
annotated template library at this stage of the process. The SWISS-
MODEL Template Library (SMTL) is a database of experimental
structures derived from the Protein Data Bank (PDB) [9], which
are thoroughly processed, annotated, and organized to support
efficient query of high-quality template structure data [5]. The
SMTL is searched with two sequence search methods: BLAST
[10] and HHblits [11]. While the first delivers fast and accurate
templates of closely related sequences, the second adds sensitivity in
case of distant homology, taking advantage of HMM-HMM align-
ments and incorporating secondary structure information pre-
dicted by PSIPRED [12].
Once a list of potential templates is obtained, the next step is to
select the best possible among the available ones. To this aim, any
available information about the target and the template should be
taken into account. In other words, sequence similarity is not the
only factor to consider, especially if different templates with similar
sequences exist. The experimental quality of the structure is an
Modeling Tertiary and Quaternary Structures 303
important parameter, as well as the environmental conditions in

which it has been determined. Hence it can be beneficial for the
planned application to select a template structure which is in com-
plex with a given ligand or substrate or in a certain conformational
state. If the target and the template share the same function, their
active sites or functional residues are expected to be conserved and
aligned [13]. However, this is not necessarily what a sequence
alignment algorithm produces, i.e., the alignment that maximizes
a certain similarity score. On the other hand, even small mistakes in
the alignment can translate into prominent errors in the final
model. Therefore, a careful inspection of the alignment is generally
recommended since it provides a valid a priori estimate of the
expected accuracy of the model and, when possible given the
information known about the specific protein family, allows avoid-
ing modeling errors by manually correcting alignment problems.
SWISS-MODEL facilitates addressing these two aspects. Based on
the analysis of different target-template alignment properties such
as sequence identity, sequence similarity, and secondary structure
agreement, SWISS-MODEL computes a Global Model Quality
Estimation score (GMQE), indicating the expected quality of the
model resulting from the given alignment [5]. Comprehensive
protein annotations and display functionalities are also provided
to aid the identification of problems in the alignment; for example
template secondary structure information is shown to facilitate
identification of errors due to incorrect placing of insertion/dele-
tions in conserved regions.
The aspects described so far are all key elements to identify a
reliable template to produce single-chain protein models. But it is
well known that the function of a protein is often the result of its
interaction with other proteins, forming either homo- or hetero-
oligomeric complexes. Therefore, the oligomeric state of the target
protein must be considered in order to generate biologically mean-
ingful models. However, protein oligomeric states are difficult to
characterize experimentally and, during evolution, show only lim-
ited conservation: i.e., the number of subunits and their binding
modes in different complexes can vary substantially [14]. As a
consequence, modeling a protein in its correct quaternary structure
is still far from being a routine procedure. In SWISS-MODEL, we
recently introduced a new strategy to infer the stoichiometry and
the overall structure of oligomers by homology [15]. The method
exploits a novel description of protein-protein interface conserva-
tion as a function of evolutionary distance and an efficient distance
measure to structurally compare homologous multimeric protein
complexes. Interface conservation scores, structural clustering, and
classical interface descriptors are combined in a supervised machine
learning algorithm to provide a quaternary structure quality esti-
mate (QSQE). This information, together with annotation of the
template quaternary structure, is provided in SWISS-MODEL to
improve the selection of homologous protein templates for the

subsequent modeling steps.
Immunoglobulins represent an important class of hetero-
oligomeric proteins, for which classical comparative modeling
approaches give only suboptimal results, especially for the variable
loop regions responsible for epitope recognition. However, specific
protocols exist that enable the modeling of this class of molecules
with high accuracy based on the canonical structures of antibody
loops [16]. In line with this, the pre-screening of the target
sequence in SWISS-MODEL has been extended to identify
whether a target sequence represents an immunoglobulin and, if a
matching sequence signal is detected, sends data to the Prediction
of Immunoglobulin Structure server PIGSPro [17] where the
modeling job can proceed. Alternatively, modeling can also be
performed using the standard SWISS-MODEL workflow. How-
ever, this is only recommended if the template structure is closely
related; that is, there are no insertions/deletions occurring in the
variable loop regions.
Given a target-template alignment, the model coordinates are
built based on the assumption that aligned residues between the
target and the template are structurally equivalent. In practice, this
requires the transfer of information from the template and the
modeling of parts where such information is missing, i.e., insertions
and deletions in the alignment. SWISS-MODEL relies on the
OpenStructure software framework [18] and the ProMod3 model-
ing engine (manuscript in preparation) to generate the atomic
coordinates for all residues of the target protein that are in the
range covered by the target-template alignment. This is achieved
by fully automatically performing the following steps: (1) building
an initial model, (2) loop modeling, (3) modeling of side chains,
(4) energy minimization, and (5) model quality estimation. Each
step is summarized below (see also Fig. 1).
1.1 Building In this step, structural information from template residues is trans-
an Initial Model ferred to corresponding target residues as defined by the target-
template alignment. Several algorithms have been developed to
accomplish this task, based on different approaches which are
reviewed elsewhere [19]. In SWISS-MODEL this is done by trans-
ferring the atomic coordinates from the corresponding template
residues in Cartesian space. ProMod3 aims at inferring as many
atom positions as possible from template structures, depending on
the conservation of the corresponding residues between target and
template. This usually results in an incomplete model with missing
side-chain coordinates and gaps originating from amino acid inser-
tions/deletions.
Main Additional input

Target sequence(s) input
Template
structure
Sequence mode
Target-template
Template mode alignment
Alignment mode DeepView

project
Template search Project mode
Yes
Antibody?
No No Use Yes Link to

dedicated
PIGSPro
service?
Use No
automated
mode?
Inspect templates
Yes
Determine
quaternary
structure
Template selection Select templates
Model building & quality estimation
Inspect models
No Re-evaluate
Quality
ok? choices
Yes
Export models Colors:

& reports - blue: user input
- yellow: user action
- green: automatic
SWISS-MODEL Workspace
Fig. 1 Flowchart of the comparative protein structure modeling pipeline implemented in SWISS-MODEL
1.2 Loop Modeling With the possible exception of antibody loops, as discussed later in
this chapter, modeling protein loops is a challenging task and often
a major source of modeling errors. Loop modeling methods can be
categorized into two main groups: ab initio and database
approaches [19–22]. ProMod3 uses geometric criteria to query a
database containing high-resolution X-ray structures for matching
loop candidates. Candidate loops are fitted to the loop stems using
the cyclic coordinate descent algorithm [23] and scored based on
statistical potentials of mean force [24]. The best candidate is then
selected according to its score and inserted into the model.
1.3 Modeling of Side To model non-conserved side chains, ProMod3 extracts side-chain
Chains conformations from the Dunbrack rotamer library [25] and deter-
mines their optimal conformation by minimizing the SCWRL4
energy function [26] using a graph-based approach [27].
1.4 Energy The modeling process can produce stereochemical irregularities

Minimization and clashes, which ProMod3 resolves in this step. The model is
parameterized using the CHARMM27 force field [28] and energies
are evaluated using the OpenMM package [29]. ProMod3 applies
short steepest descent and conjugate gradient minimization itera-
tively on the model until all stereochemical problems are resolved
or an upper bound of iterations is reached. This concludes the
modeling process and produces the final model.
1.5 Model Quality Quality estimation tools aim to quantify modeling errors and give
Estimation estimates on expected model accuracy both on a global and
per-residue scale. From a modeling perspective, such estimates are
useful to select the best model in a set of alternatives or detecting
local errors. But, even more importantly, they aim to determine the
usefulness of a model for a specific application at hand [30, 31]. Var-
ious tools assessing physical plausibility are routinely applied on
models based on experimental data [32]. However, while stereo-
chemistry is a necessary condition for a high-quality model, it is not
a sufficient criterion to indicate similarity of a theoretical model to
the native target structure. Knowledge-based approaches with sta-
tistical potentials of mean force [24] constitute a valid complement
for estimating the expected accuracy of a theoretical model. SWISS-
MODEL relies on QMEAN [33, 34] to assign global and
per-residue quality estimates. QMEAN linearly combines four sta-
tistical potentials of mean force. Two of them evaluate pairwise
distances, the first between all chemically distinguishable heavy
atoms and the second between Cβ atoms. Two more potentials
evaluate backbone torsion angles and packing of the model.
The accuracy of models generated by SWISS-MODEL is con-
tinuously assessed by the CAMEO project [35] based on weekly
blind prediction of proteins from the upcoming PDB release.
1.6 Concluding The availability of reliable and robust fully automated workflows for
Remarks protein structure modeling has made homology modeling the
method of choice to reliably generate three-dimensional models
for proteins when experimental structures are not available. Easy-
to-use interactive web servers and reliable model quality estimation
tools allow also nonspecialists to successfully use protein models in
structure-based applications in biomedical research.
2 Materials
1. A computer with access to the Internet and a web browser.

2. The amino acid sequence, either in FASTA or as a plain text, or
the UniProtKB identifier of the target to be modeled.
3. The DeepView desktop application, available at https://spdbv.
vital-it.ch/ (optional).
4. A DeepView project (optional).
5. The coordinates in PDB format of the structure to be used as a
template (optional).
6. A target-template alignment, either in FASTA or Clustal for-
mat (optional).
7. An e-mail address (optional).
3 Methods
Figure 1 illustrates the workflow of the modeling pipeline imple-

mented in SWISS-MODEL. A detailed description of the individ-
ual steps is provided in the following sections.
3.1 Access The SWISS-MODEL website is available at https://swissmodel.

the SWISS-MODEL expasy.org. From the homepage and any other page on the website,
Workspace the user can create a password-protected user account, which is
associated with an e-mail address. It is also possible to use SWISS-
MODEL anonymously, i.e., without registration. However, in this
case it is necessary to bookmark individual project URLs in order to
access the results once the session has been closed. By default,
projects are stored for 2 weeks on the server with an option to
extend the project lifetime.
3.2 Start a New Depending on the type of information at hand, there are different
Modeling Project modes to start a new modeling project. If only the sequence of the
target protein to be modeled is available, a first step is to search for
templates, as described in Subheading (3.2.1.) Sequence mode.
Alternatively, if the model should be based on a specific template
structure, three different modes can be used according to template-

related information available: (3.2.2.) Template mode, (3.2.3.)
Alignment mode, and (3.2.4.) Project mode.
3.2.1 Sequence Mode: 1. Insert the amino acid sequence of the target protein into the
Starting from the Sequence main input box of the homepage. The sequence can be
of the Target Protein provided either as a plain text or in FASTA format. Alterna-
tively, the UniProtKB identifier can be used.
2. Press the “Validate” button or the return key. A sequence
validation step is performed to check for nonstandard amino
acid codes and to reformat the input sequence. If the target
UniProtKB identifier is provided as input, the protein sequence
is automatically retrieved and validated. After validation, a
non-editable wrapped view of the target sequence is displayed.
3. If the target protein is heteromeric, i.e., it consists of different
protein chains as subunits, it is possible to enter an additional
amino acid sequence by clicking the “Add Hetero Target”
button. Repeat this step until all subunit sequences have been
entered.
4. The next step is to identify reliable templates to be used for
modeling. Two options are available to perform this task:
(a) Manual template selection: This option allows the user to
inspect the template search results before selecting one or
more template structures for modeling, taking into
account information such as quality of the experimental
structure, oligomeric state, bound ligands, or crystalliza-
tion conditions. To use this option, click the “Search for
Templates” button and proceed to step 3.3.
(b) Automatic template selection: Using this option, when
the template search is complete, templates are ranked
according to the expected quality of the resulting models
and a number of templates are selected automatically. This
option is especially useful for well-characterized protein
families where target-template sequence similarity is
expected to be sufficiently high to automatically generate
unambiguous alignments and high-quality models. Note
that also in this option, the full template search results can
be inspected to select additional template structures in
case the automated modeling results are not satisfactory.
To use this option, click the “Build Model” button and
proceed to step 3.4 to access the modeling results.
In both cases, as soon as the template search starts, an auto-
matic scanning of the target sequence is performed to verify
whether any immunoglobulin variable domain is present in the
input. If this is the case, the user is provided with a link to the
Prediction of Immunoglobulin Structure server PIGSPro

[17]. The link redirects to the PIGSPro server home page
where the input form is pre-filled with the detected
antibody variable domains and the modeling process can pro-
ceed using a protocol developed specifically for immunoglobu-
lins (see Note 1).
3.2.2 Template Mode: 1. Click the “User Template” button.

Using a Specific Three- 2. Submit the target sequence as described in “Sequence mode.”
Dimensional Structure
3. Upload the template coordinates in PDB format.
as Template
4. Click the “Build Model” button and proceed to step 3.4.
3.2.3 Alignment Mode: 1. Click the “Target-Template Alignment” button.

Using a Specific Template 2. Provide the target-template alignment, either by copy/paste of
from the SWISS-MODEL the alignment or by uploading a file. The alignment must be in
Template Library (SMTL) FASTA or Clustal format. Make sure that the provided tem-
with a User-Defined plate sequence and ID correspond to SEQRES and ID of the
Target-Template Alignment SMTL entry, respectively. Note that the content of the SMTL
can be browsed manually (Menu: Modeling—Template
Library).
3.2.4 Project Mode: The desktop application DeepView (available for Windows and
Using a Specific Three- Mac OS) [1] allows for visualization of one or more template
Dimensional Structure structures, and manual editing of the target-template alignment.
as Template by Manually Projects generated in DeepView, or obtained in step 3.4, can be
Adjusting the Target- submitted for modeling after manual manipulation of the target-
Template Alignment template alignment.
in the DeepView Desktop
1. Click the “DeepView Project” button.
Application
2. Upload the DeepView project file.
3.3 Template The Template Results page provides an overview of the available
Identification templates as well as interactive views and selection tools.
1. From the views below, select one or more template structures
for modeling.
(a) Templates. The main table displays the list of top 50 tem-
plates, ranked according to the expected quality of the
resulting model. The complete list of templates is accessi-
ble by links at the bottom of the Template Results page.
Features such as coverage, model quality estimates
(GMQE and QSQE), oligomeric state, and bound ligands
are shown in a condensed tabular form. Each of the table
rows can be expanded to display additional information
and the target-template alignments. If the desired tem-

plate could not be found or the templates do not cover the
full target sequence, please refer to Notes 2–4. Note that
selecting a template with a given bound ligand of interest
does not guarantee that the ligand will be present in the
final model (see Note 5).
(b) Quaternary Structure. The results of the quaternary struc-
ture analysis are provided for oligomeric targets. Tem-
plates are clustered and displayed in a decision tree
according to their quaternary structure features: oligo-
meric state, stoichiometry, topology, and interface similar-
ity. Each leaf of the tree corresponds to a template, with a
bar indicating sequence identity and coverage to the tar-
get. In each cluster, templates are ranked according to
their QSQE score. According to the selected clustering
level, a protein-protein interaction (PPI) fingerprint plot
informs about the conservation of template interfaces as a
function of the evolutionary distance within the protein
family. Please refer to Notes 6 and 7 if no template with
the expected oligomeric state is found.
(c) Sequence Similarity. A chart displays how templates relate
to each other in the sequence similarity space. Each tem-
plate is shown as a circle on an interactive plot, which
allows selecting individual or groups of templates for fur-
ther inspection and template superposition.
(d) Alignment of Selected Templates. Target-template align-
ments are shown with different coloring options for high-
lighting various sequence and structure properties along
the alignment (i.e., secondary structure, hydrophobicity,
solvent accessibility). Careful inspection of the alignment
is recommended since this permits to identify, and possi-
bly prevent, modeling errors beforehand (see Note 8).
2. Click the “Build Model” button to run the modeling using the
selected templates and proceed to step 3.4.
3.4 Accessing After completion of the modeling process, a detailed report of the
Modeling Results modeling project is generated and can be accessed from the work-
space. Model coordinates can be downloaded either formatted as
PDB or DeepView project files.
The generated model(s) can be inspected in the model results
page using the embedded structural viewer. The target-template
alignment used for modeling is also shown and linked to the
structure visualization such that hovering the mouse over the
alignment highlights the corresponding residue in the viewer and
vice versa (Fig. 2).
Fig. 2 Modeling results for the superoxide dismutase [Cu-Zn] protein from S. pombe (SOD1, UniProtKB AC:
P28758) generated in automated mode in SWISS-MODEL. SpSOD1 is predicted as homo-2-mer including 1 Zn
ion and 1 Cu ion per subunit as cofactors based on the experimental structure of the deep-sea yeast
Cryptococcus liquefaciens homologue (SMTL: 3ce1.1.A; [36]) as template
3.5 Model Quality By default, models and alignments are colored based on the
Estimation per-residue quality estimates from the QMEAN scoring function
[37]. The color gradient ranges from red to blue, indicating low to
high estimated per-residue quality. The same information is also
available for every model in the form of a Local Quality plot, as well
as in the B-factor column of the downloadable PDB file. The
Global Quality plot gives an estimate of the overall model quality,
based on four individual terms: Cβ, all atom, solvation, and torsion.
The QMEAN score is also compared to what one would expect
from experimentally determined protein structures of similar size
using a Z-score scheme (hence 0.0 would be the optimal score).
This is illustrated in the Comparison plot (Fig. 2).
4 Notes
1. While for most proteins classical homology modeling works

well, for some protein families the results are not satisfying. For
example, dedicated modeling methods for immunoglobulins
exploit protein family-specific sequence and structure proper-
ties and are therefore more effective. In line with this, if an
immunoglobulin sequence is present in the input, SWISS-
MODEL provides a link to the Prediction of ImmunoGlobulin
Structure server PIGSPro [17]. The PIGSPro core strategy
consists of modeling the conserved region of the antibody by
homology while the prediction of five of the six hypervariable
loops is based on the canonical structure model [38–40]. The

hypervariable loop H3, for which no complete canonical struc-
ture has been identified, is modeled based on a different tem-
plate selection method [41]. While for both the framework and
loops the user can modify the selected template structures
according to their own needs, very accurate models (C-
α-RMSD close to 1 Å) can be obtained using fully automated
modeling, as estimated by the results of independent assess-
ments [17, 42].
2. The most frequent reason for not finding any template during a
template search is that none of the sequences in the SMTL
library shares a significant sequence similarity with the target
sequence. With a few possible exceptions (see Note 3), this
means that no suitable template is available for comparative
modeling. In this case, de novo modeling methods may be
considered as an alternative. However, one should keep in
mind that the accuracy of such techniques is considerably
lower than that achievable by comparative modeling [43, 44].
3. Stringent quality filters are applied to protein structures from
PDB in order to be included in the SMTL. For example,
structures only providing Cα coordinates are considered low
quality and are excluded. Therefore, it can happen that a spe-
cific PDB entry is not listed in a template search despite sharing
significant sequence similarity with the target. The availability
of specific template structures can be checked by directly que-
rying the SMTL (Menu: Modeling—Template Library).
In case a model should be generated based on template
coordinates not present in SMTL, e.g., for newly solved struc-
tures not yet deposited to PDB, it is possible to create a model
based on such templates by either the “Template mode” or the
“Project mode” options (details in Subheading 3.2).
4. Increasingly, multiple alternative experimental template struc-
tures are available for target proteins of interest, often covering
different regions of the amino acid sequence, representing
different conformational states, or harboring different ligands.
Using this information from heterogeneous templates effi-
ciently is an area of active research in the field of comparative
modeling [45]. Structural information from multiple templates
can, in some cases, be complementary and the modeling pro-
cedure potentially benefits from the added information from
alternative templates [46]. The main improvement by multi-
template modeling is generally due to an increased coverage of
the target, i.e., increased size of the produced model. Such
functionality is work in progress in SWISS-MODEL and cur-
rently not supported. Alternative solutions are available to this
purpose, both as stand-alone software and web servers

[47, 48].
5. Biologically relevant ligands and cofactors are modeled based
on a homology transfer approach from the templates identified
in the SMTL. This approach is rather restrictive; that is, it
requires a high conservation of ligand-binding site residues
between target and template both in terms of sequence and
structure. This implies that a given ligand which is present in
the template will not necessarily be present in the final model.
Docking strategies can be considered as a valuable solution in
these cases. For this purpose, SWISS-MODEL provides a
direct link to send models to SwissDock [49, 50] from the
model results page.
6. Currently, the number of heteromeric complexes characterized
experimentally at atomic resolution is limited. Consequently,
there is a high chance that no suitable templates are identified
for a heteromeric protein complex. In this case, a possible
solution is to perform a template search for each subunit of
the complex and model them separately as monomers. Result-
ing models can then be used as input for external docking
software. The performance of current methods for protein
docking is assessed during the Critical Assessment of PRedic-
tion of Interactions (CAPRI) [51].
7. In case a specific oligomeric state is expected (e.g., a homo-
tetramer) but not found during the template search, the user
can build a monomeric model and employ external software to
enforce a specific stoichiometry [52] or symmetry [53] using
the monomer as input. If further experimental information is
available, i.e., interactions between subunits of the complex, a
hybrid modeling approach can be suitable [54–56].
8. There are several different possible sources of modeling errors,
which are typically detected by the provided local quality esti-
mation tools. Some minor issues such as local stereochemical
violations due to small structural distortions or incorrect mod-
eling of side chains can be resolved by energy minimization. In
other cases, further inspection of both the alignment and the
model is needed. For alignments of templates with low
sequence similarity, incorrect positioning of insertions and
deletions may give rise to model errors; manual correction of
the alignment may be a solution in some of these cases. For
modeling very long insertions, library-based approaches have
limitations and one has to recur to de novo modeling
approaches such as Rosetta [57] and I-TASSER [58]. However,
from a practical point of view, modeling long insertions is
typically less reliable, and it is important to critically evaluate
whether these efforts are necessary and sufficient for the
intended application of the model. Finally, deviations of the

model from the native structure may be due to structural
divergence between target and template during evolution.
The structural variability of the protein family at hand typically
provides a good estimate for the errors expected in the models
[59]. One of the main limitations of comparative modeling
today is the intrinsic dependence on information from template
structures; that is, it is not able to “predict” structural diver-
gence where the target deviates from the templates.
Approaches for refining the model coordinates based on
molecular modeling simulations have made significant progress
in recent years for the task of refining models closer to the
native structure, however, at a very high computational
cost [60].
References
1. Guex N, Peitsch MC, Schwede T (2009) Auto- 10. Altschul SF, Madden TL, Schaffer AA et al
mated comparative protein structure modeling (1997) Gapped BLAST and PSI-BLAST: a
with SWISS-MODEL and Swiss-PdbViewer: a new generation of protein database search pro-
historical perspective. Electrophoresis 30 Suppl grams. Nucleic Acids Res 25:3389–3402
1:S162–S173 11. Remmert M, Biegert A, Hauser A et al (2011)
2. Sali A, Blundell TL (1993) Comparative pro- HHblits: lightning-fast iterative protein
tein modelling by satisfaction of spatial sequence searching by HMM-HMM align-
restraints. J Mol Biol 234:779–815 ment. Nat Methods 9:173–175
3. Chothia C, Lesk AM (1986) The relation 12. Jones DT (1999) Protein secondary structure
between the divergence of sequence and struc- prediction based on position-specific scoring
ture in proteins. EMBO J 5:823–826 matrices. J Mol Biol 292:195–202
4. Arnold K, Bordoli L, Kopp J et al (2006) The 13. Sillitoe I, Cuff AL, Dessailly BH et al (2013)
SWISS-MODEL workspace: a web-based envi- New functional families (FunFams) in CATH
ronment for protein structure homology mod- to improve the mapping of conserved func-
elling. Bioinformatics 22:195–201 tional sites to 3D structures. Nucleic Acids
5. Biasini M, Bienert S, Waterhouse A et al (2014) Res 41:D490–D498
SWISS-MODEL: modelling protein tertiary 14. Aloy P, Ceulemans H, Stark A et al (2003) The
and quaternary structure using evolutionary relationship between sequence and interaction
information. Nucleic Acids Res 42: divergence in proteins. J Mol Biol
W252–W258 332:989–998
6. Kiefer F, Arnold K, Kunzli M et al (2009) The 15. Bertoni M, Kiefer F, Biasini M et al (2017)
SWISS-MODEL repository and associated Modeling protein quaternary structure of
resources. Nucleic Acids Res 37:D387–D392 homo- and hetero-oligomers beyond binary
7. Waterhouse A, Bertoni M, Bienert S et al interactions by homology. Sci Rep 7:10480
(2018) SWISS-MODEL: homology modelling 16. Marcatili P, Olimpieri PP, Chailyan A et al
of protein structures and complexes. Nucleic (2014) Antibody modeling using the predic-
Acids Research Res 46(W1):W296–W303 tion of immunoglobulin structure (PIGS) web
8. Kryshtafovych A, Venclovas C, Fidelis K et al server [corrected]. Nat Protoc 9:2771–2783
(2005) Progress over the first decade of CASP 17. Lepore R, Olimpieri PP, Messih MA et al
experiments. Proteins 61(Suppl 7):225–236 (2017) PIGSPro: prediction of immunoGlob-
9. Berman H, Henrick K, Nakamura H et al ulin structures v2. Nucleic Acids Res 45:W17
(2007) The worldwide protein data Bank 18. Biasini M, Schmidt T, Bienert S et al (2013)
(wwPDB): ensuring a single, uniform archive OpenStructure: an integrated software frame-
of PDB data. Nucleic Acids Res 35: work for computational structural biology.
D301–D303 Acta Crystallogr D Biol Crystallogr
69:701–709
19. Fiser A (2010) Template-based protein struc- models in biomedical research. Structure
ture modeling. Methods Mol Biol 673:73–94 17:151–159
20. Choi Y, Deane CM (2010) FREAD revisited: 32. Read RJ, Adams PD, Arendall WB 3rd et al
accurate loop structure prediction using a data- (2011) A new generation of crystallographic
base search algorithm. Proteins 78:1431–1440 validation tools for the protein data bank.
21. Liang S, Zhang C, Zhou Y (2014) LEAP: Structure 19:1395–1412
highly accurate prediction of protein loop con- 33. Benkert P, Biasini M, Schwede T (2011)
formations by integrating coarse-grained sam- Toward the estimation of the absolute quality
pling and optimized energy scores with of individual protein structure models. Bioin-
all-atom refinement of backbone and side formatics 27:343–350
chains. J Comput Chem 35:335–341 34. Benkert P, Kunzli M, Schwede T (2009)
22. Messih MA, Lepore R, Tramontano A (2015) QMEAN server for protein model quality esti-
LoopIng: a template-based tool for predicting mation. Nucleic Acids Res 37:W510–W514
the structure of protein loops. Bioinformatics 35. Haas J, Roth S, Arnold K et al (2013) The
31:3767–3772 protein model portal--a comprehensive
23. Canutescu AA, Dunbrack RL Jr (2003) Cyclic resource for protein structure and model infor-
coordinate descent: a robotics algorithm for mation. Database 2013:bat031
protein loop closure. Protein science: a publi- 36. Teh AH, Kanamasa S, Kajiwara S et al (2008)
cation of the protein. Society 12:963–972 Structure of cu/Zn superoxide dismutase from
24. Sippl MJ (1990) Calculation of conformational the heavy-metal-tolerant yeast Cryptococcus
ensembles from potentials of mean force. An liquefaciens strain N6. Biochem Biophys Res
approach to the knowledge-based prediction of Commun 374:475–478
local structures in globular proteins. J Mol Biol 37. Benkert P, Tosatto SC, Schomburg D (2008)
213:859–883 QMEAN: a comprehensive scoring function
25. Shapovalov MV, Dunbrack RL Jr (2011) A for model quality assessment. Proteins
smoothed backbone-dependent rotamer 71:261–277
library for proteins derived from adaptive ker- 38. Chothia C, Lesk AM (1987) Canonical struc-
nel density estimates and regressions. Structure tures for the hypervariable regions of immuno-
19:844–858 globulins. J Mol Biol 196:901–917
26. Krivov GG, Shapovalov MV, Dunbrack RL Jr 39. Morea V, Tramontano A, Rustici M et al
(2009) Improved prediction of protein side- (1998) Conformations of the third hypervari-
chain conformations with SCWRL4. Proteins able region in the VH domain of immunoglo-
77:778–795 bulins. J Mol Biol 275:269–294
27. Xu J (2005) Rapid protein side-chain packing 40. Tramontano A, Chothia C, Lesk AM (1990)
via tree decomposition. In: Miyano S, Framework residue 71 is a major determinant
Mesirov J, Kasif S, Istrail S, Pevzner PA, Water- of the position and conformation of the second
man M (eds) Research in computational hypervariable region in the VH domains of
molecular biology: 9th Annual International immunoglobulins. J Mol Biol 215:175–182
Conference, RECOMB 2005, Cambridge, 41. Messih MA, Lepore R, Marcatili P et al (2014)
MA, USA, May 14–18, 2005. Proceedings. Improving the accuracy of the structure predic-
Springer Berlin, Heidelberg, pp 423–439 tion of the third hypervariable loop of the
28. Mackerell AD Jr, Feig M, Brooks CL 3rd heavy chains of antibodies. Bioinformatics
(2004) Extending the treatment of backbone 30:2733–2740
energetics in protein force fields: limitations of 42. Almagro JC, Teplyakov A, Luo J et al (2014)
gas-phase quantum mechanics in reproducing Second antibody modeling assessment
protein conformational distributions in molec- (AMA-II). Proteins 82:1553–1562
ular dynamics simulations. J Comput Chem
25:1400–1415 43. Moult J (2005) A decade of CASP: progress,
bottlenecks and prognosis in protein structure
29. Eastman P, Swails J, Chodera JD et al (2017) prediction. Curr Opin Struct Biol 15:285–289
OpenMM 7: rapid development of high per-
formance algorithms for molecular dynamics. 44. Tai CH, Bai H, Taylor TJ et al (2014) Assess-
PLoS Comput Biol 13:e1005659 ment of template-free modeling in CASP10
and ROLL. Proteins 82(Suppl 2):57–83
30. Baker D, Sali A (2001) Protein structure pre-
diction and structural genomics. Science 45. Meier A, Soding J (2015) Automatic predic-
294:93–96 tion of protein 3D structures by probabilistic
multi-template homology modeling. PLoS
31. Schwede T, Sali A, Honig B et al (2009) Out- Comput Biol 11:e1004343
come of a workshop on applications of protein
46. Larsson P, Wallner B, Lindahl E et al (2008) 54. De Vries SJ, Van Dijk M, Bonvin AM (2010)
Using multiple templates to improve quality of The HADDOCK web server for data-driven
homology models in automated homology biomolecular docking. Nat Protoc 5:883–897
modeling. Protein Sci 17:990–1002 55. Leaver-Fay A, Tyka M, Lewis SM et al (2011)
47. Cheng J (2008) A multi-template combination ROSETTA3: an object-oriented software suite
algorithm for protein comparative modeling. for the simulation and design of macromole-
BMC Struct Biol 8:18 cules. Methods Enzymol 487:545–574
48. Webb B, Sali A (2014) Comparative protein 56. Russel D, Lasker K, Webb B et al (2012) Put-
structure modeling using MODELLER. Curr ting the pieces together: integrative modeling
Protoc Bioinformatics 47:5.6.1–5.6.32 platform software for structure determination
49. Grosdidier A, Zoete V, Michielin O (2011) of macromolecular assemblies. PLoS Biol 10:
Fast docking using the CHARMM force field e1001244
with EADock DSS. J Comput Chem 57. Simons KT, Kooperberg C, Huang E et al
32:2149–2159 (1997) Assembly of protein tertiary structures
50. Grosdidier A, Zoete V, Michielin O (2011) from fragments with similar local sequences
SwissDock, a protein-small molecule docking using simulated annealing and Bayesian scoring
web service based on EADock DSS. Nucleic functions. J Mol Biol 268:209–225
Acids Res 39:W270–W277 58. Yang J, Yan R, Roy A et al (2015) The
51. Lensink MF, Velankar S, Wodak SJ (2017) I-TASSER suite: protein structure and function
Modeling protein-protein and protein-peptide prediction. Nat Methods 12:7–8
complexes: CAPRI 6th edition. Proteins 59. Maghrabi AHA, Mcguffin LJ (2017) Mod-
85:359–377 FOLD6: an accurate web server for the global
52. Esquivel-Rodriguez J, Filos-Gonzalez V, Li B and local quality estimation of 3D protein
et al (2014) Pairwise and multimeric protein- models. Nucleic Acids Res 45(W1):
protein docking using the LZerD program W416–W421
suite. Methods Mol Biol 1137:209–234 60. Heo L, Feig M (2018) What makes it difficult
53. Pierce B, Tong W, Weng Z (2005) to refine protein models further via molecular
M-ZDOCK: a grid-based approach for Cn dynamics simulations? Proteins 86(Suppl
symmetric multimer docking. Bioinformatics 1):177–188
21:1472–1478
Chapter 18
Interface-Based Structural Prediction of Novel

Host-Pathogen Interactions
Emine Guven-Maiorov, Chung-Jung Tsai, Buyong Ma, and Ruth Nussinov
Abstract
About 20% of the cancer incidences worldwide have been estimated to be associated with infections.
However, the molecular mechanisms of exactly how they contribute to host tumorigenesis are still
unknown. To evade host defense, pathogens hijack host proteins at different levels: sequence, structure,
motif, and binding surface, i.e., interface. Interface similarity allows pathogen proteins to compete with
host counterparts to bind to a target protein, rewire physiological signaling, and result in persistent
infections, as well as cancer. Identification of host-pathogen interactions (HPIs)—along with their struc-
tural details at atomic resolution—may provide mechanistic insight into pathogen-driven cancers and
innovate therapeutic intervention. HPI data including structural details is scarce and large-scale experimen-
tal detection is challenging. Therefore, there is an urgent and mounting need for efficient and robust
computational approaches to predict HPIs and their complex (bound) structures. In this chapter, we review
the first and currently only interface-based computational approach to identify novel HPIs. The concept of
interface mimicry promises to identify more HPIs than complete sequence or structural similarity. We
illustrate this concept with a case study on Kaposi’s sarcoma herpesvirus (KSHV) to elucidate how it subverts
host immunity and helps contribute to malignant transformation of the host cells.
Key words Host-pathogen interaction prediction, Protein–protein interaction, Structural network,

Superorganism network, Molecular mimicry, Interface mimicry
1 Introduction
1.1 Molecular Signaling pathways shape and convey the cell’s responses to stimuli
Mimicry from its environment; however, pathogens can circumvent this
response by “repurposing” host signaling. Pathogens can interact
with the host through proteins, metabolites, small molecules, and
nucleic acids [1]. Direct protein-protein interactions are the most
common interaction type (see Note 1). By interfering with key
pathways pathogens can reshape physiological signaling, subverting
the immune system, altering the cytoskeletal organization [2, 3],
modifying membrane and vesicular trafficking [2, 4, 5], boosting
pathogen entry into the host cell, changing the cell cycle regulation
317
318 Emine Guven-Maiorov et al.
[6, 7], and modulating apoptosis [8]. All host-pathogen interac-

tions (HPIs) aim to ensure pathogen survival within the host.
Pathogens evolved several strategies to cross-talk with their
hosts. One powerful way is molecular mimicry, which has been
extensively reviewed in our recent study [9]. There are four differ-
ent levels of molecular (protein) mimicry: hijacking (1) both
sequence and structure of a protein or a domain, (2) only structure
without sequence homology, (3) sequence of a short motif—motif
mimicry, and (4) structure of a binding surface without sequence
similarity—interface mimicry. Global sequence and structural simi-
larity is much rarer than interface similarity both within and across
species. Thus, utilizing interface mimicry allows pathogens to tar-
get more host proteins. The concept of interface mimicry, proposed
over two decades ago, suggested that proteins with different global
structures can interact in similar ways, via similar binding surfaces
[10–12]. Interfaces are frequently “reused” by distinct proteins
[13], suggesting that these recurring architectures are favorable
scaffolds [12].
Interface mimicry is often observed within (intraspecies/
endogenous) [13–15] and across species (interspecies/host-patho-
gen/exogenous) [16, 17]. Similarity in endogenous and exoge-
nous protein-protein interfaces permits pathogenic proteins to
compete with their host counterparts [17], rewire host signaling,
and cause infections, as well as cancer. Identification of the HPIs
and the rewired host-pathogen superorganism protein interaction
network, together with structural details, should provide critical
insights into pathogenic virulence strategies underlying infections
and pathogen-driven cancers, and hence help innovative
therapeutics [18].
To date, the HPI networks show that different pathogens often
target the same host pathway, and certain host pathways are
attacked at several nodes to guarantee alteration of host cell signal-
ing [19]. Although there are several available host-pathogen
metaorganism interaction networks [19–29], there have been few
attempts to integrate these HPI networks with the human 3D
structural protein-protein interactions (PPIs) [17]. Traditional
node-and-edge representation of the PPI networks simplifies the
“big picture.” They depict which proteins interact, but not how.
Structural networks allow a higher resolution with mechanistic
insights, showing which residues are involved in the interaction
and thus which binary interactions can co-occur or are mutually
exclusive [16, 30] (see Note 2). The power of structural networks in
displaying the details of endogenous signaling pathways was
demonstrated earlier [30–33]. They are also vital to comprehend
the mechanisms exerted by pathogens to avert and subvert host cell
signaling and circumvent immune response [18]. Structures exhibit
which endogenous PPIs are ablated by the HPIs, whether the
virulence factors in different strains of the same pathogenic species
Interface-Based Structural Prediction of Novel Host-Pathogen Interactions 319
have distinct HPIs, and possible outcomes of mutations on either

the host or the pathogenic proteins.
The challenging large-scale experimental characterization of
HPIs [34, 35], coupled with the scarcity of experimentally con-
firmed HPI data, especially structural details, escalates the demand
for efficient and robust computational approaches to predict HPIs
along with their complex (bound) structures. In this chapter, we
first review available computational approaches to predict HPIs and
present the only interface-based computational approach available
to identify novel HPIs and their complex structures. Then, we
illustrate the usefulness of our approach with a case study on
Kaposi’s sarcoma herpesvirus (KSHV).
1.2 Review of Several HPI databases have been developed for experimentally
Available identified HPIs, including PHISTO [36], HPIDB [37], Proteo-
Computational Tools to pathogen [38], PATRIC [39], PHI-base [40], PHIDIAS [41],
Identify HPIs HoPaCI-DB [42], VirHostNet [43], ViRBase [44], VirusMentha
[45], and HCVpro [46]. These databases comprise only a limited
number of pathogens. Given that at least hundreds of different
species can infect the host, thousands of HPIs are still unknown.
Enriching of the host-pathogen interactome and construction of
comprehensive HPI networks will still mostly rely on computa-
tional models in the near future [47]. Numerous studies computa-
tionally identified large-scale HPIs and built HPI networks for
viruses and bacteria [20, 24, 48–56].
Although prediction of human PPIs is a well-established area,
modeling of interspecies interactions is comparably new. Still, sev-
eral attempts focused on computational approaches to identify
HPIs [34], most of which rely on sequence homology [49, 52,
54, 57–63]. Homology-based approaches are successful only if the
sequence similarity is high, but not all virulence factors have homo-
logs in human. For instance, a secreted protein of H. pylori, VacA,
does not have sequence similarity with any other known viral,
bacterial, or eukaryotic proteins [64], but it alters signaling
through several host pathways [65]. Thus, sequence-based meth-
ods cannot detect VacA’s HPIs, highlighting the importance of
considering the 3D structures of proteins in predicting HPIs.
There are also sequence-based comparative methods that consider
structure [48, 55, 56, 61, 62, 66–70]; interologs (interacting
homologs/conserved interactions) [71, 72]; and transcriptome
data [73]. Available structure-based techniques often depend on
global structural similarity rather than interface mimicry
[55, 69]. One method combines interface data with sequence
homology and gene expression, but the predicted interacting host
and pathogenic proteins should satisfy a minimum of 80% sequence
identity over at least 50% of template host PPI complexes [66]. To
the best of our knowledge, none of the current approaches utilizes
solely interface structures to model HPIs, except our recently

developed interface-based method [74].
It has been suggested that the existing interface structures in
PDB are diverse enough to cover majority of the endogenous PPIs
[75–78] and hence success of template-based approaches to model
endogenous PPIs is high [15] and expected to increase even more
with the increase in the number of resolved PPI 3D structures [79]
and advances in computational biology. Exogenous interactions are
underrepresented in the PDB: there are not many exogenous inter-
faces. Since exogenous interfaces hijack endogenous ones, available
endogenous and exogenous interfaces may represent most the
structural host-pathogen interface space (see Note 2).
2 Methods
2.1 Modeling HPIs Here, we review the first and to date only computational approach
that utilizes solely interface mimicry to predict putative HPIs and
their 3D structures as complexes [74]. Local structural resemblance
is sufficient; there is no need for sequence similarity. This approach
reveals not only targets of pathogenic proteins and how they inter-
act, but also the host endogenous PPIs which may be disrupted by
these potential HPIs. Figure 1 displays our workflow. Generally, the
interacting protein partners are known from docking studies and
the main purpose is to discern how they interact structurally.
Therefore, inputs of the docking algorithms are structures of the
two monomeric target proteins to be docked to each other. How-
ever, when dealing with HPIs, the main aim is to identify the
interacting partners, as well as how they interact. Normally, the
pathogenic proteins (one of the targets in a docking study) are
known but not their partners in the host (second target). Hence,
before performing docking, we need to identify those potential
host interactors.
To accomplish this, we generate all known human interfaces—
including endogenous and exogenous—in the PiFace interface
database, as described in [14]. Each interface has two chains (part-
ners/sides). There are 26,236 human interfaces in our template set.
Then, we structurally align these interfaces with the pathogenic
proteins by MultiProt [80]. The structural alignment thresholds
for the number of matching interface residues and the hot spots
follow the PRISM algorithm [81–84]. If the pathogenic protein is
aligned with one side of the human interface, it may interact with
the complementary side. Thus, the pathogenic protein can compete
with the first side of the interface—with which it is structurally
aligned—to bind to the second side, thereby abrogating the endog-
enous binary interaction in the template PPI (Fig. 1). Structural
complementarity does not necessarily guarantee chemical comple-
mentarity and favorable interaction energy. For instance, 8 KSHV
Fig. 1 Workflow of our interface-based HPI modeling approach. In the first step, we extract human interfaces
from the PDB. Then, we obtain the structures of the pathogenic proteins from the PDB. Before docking, we
need to identify the potential HPI pairs since docking programs require two target proteins. To do that, we
structurally align the pathogenic proteins with the human interfaces in our template set. If the pathogenic
protein is aligned with the B-side of the interface, it can interact with the complementary A-side. After
determining potential HPI pairs, we perform docking of these pairs with PRISM [81–84] and Rosetta (local
refinement) [85–87] to select the energetically favorable ones. We further assess the likelihood that the HPI
models take place in the cell based on the percent match of the interface residues with the template interface
and probability of the template interface being a real biological interface. In the final optional step, we filter our
energetically favorable HPI results according to tissue expression of the human proteins by checking whether
the interactors of the pathogenic proteins are expressed in the same tissue where the pathogen resides
proteins aligned with 15,350 interfaces, but only 96 of them are

energetically favorable. So, after detection of the potential partners
in humans with structural complementarity, we check whether
these potential HPI pairs have favorable interaction energy. To do
that we perform docking with two programs: PRISM [81–84] and
Rosetta (local refinement) [85–87]. We take HPIs as energetically
favorable only if their Rosetta interface scores (I_sc) are below 5
and total energy scores are below zero. We also calculate Rosetta
I_sc for the endogenous template PPIs and compare them with
those of modeled HPIs to determine whether the pathogenic
protein will outcompete the endogenous partner to bind to a target
host protein with a higher affinity. For some template PPIs, Rosetta
gives extremely low unrealistic I_sc, due to intermolecular disulfide
bonds. To correct this, we calculate Rosetta I_sc with both includ-
ing and disregarding the disulfide bonds. We consider the HPIs as
favorable interactions if they have I_sc below 5 with both Rosetta
scorings. Note that Rosetta I_sc does not have units nor reflects the
real binding free energy. It only gives an idea whether an interaction
is favorable or not.
To further evaluate the likelihoods of our HPI models, we
calculate the “percent match” of the interfaces by taking the ratio
of the number of interface residues that are aligned with the patho-
genic protein to the number of interface residues in the endoge-
nous template PPI. Each template interface is assigned with a
weight based on the size of the endogenous template interface
such that larger interfaces have higher weights. If the template
interfaces have less than 30 residues (n < 30), the weight is 0.5; if
30 < n < 50, weight is 1; if 50 < n < 80 weight is 1.5; and if n > 80
(very large interface), the weight is 2. Score1 given in Table 1 is the
product of the interface percent match and the corresponding
interface weight.
We employ the EPPIC (Evolutionary Protein-Protein Interface
Classifier) [88], to evaluate whether the template interfaces are real
biological interfaces or crystal artifacts. The EPPIC server gives the
probability of a particular interface to be biological. Score2 in
Table 1 is the product of Score1 and the probability of being a
biological interface. The higher the Score2, the more confidence
we have that a particular HPI model would take place in the cell, as
they are better mimics of real biological endogenous interfaces (see
Note 3).
Finally, with an optional step, the results can be filtered accord-
ing to tissue expression, checking whether the host partners of the
pathogenic proteins are expressed in the same tissue where the
pathogen resides. We take the tissue expression data from the
Human Protein Atlas, which includes 19,709 human proteins,
mapping to 7106 human PDBs [89, 90]. If the pathogen is a
bacterial species, it resides in only certain tissues. For instance,
Helicobacter pylori is mainly in the stomach and gastrointestinal
tract, making it reasonable to focus on human proteins that are
expressed in these tissues. However, if the pathogen is a virus, it can
infect several different—if not all—tissues. Therefore, filtering
according to tissue expression is an optional step depending on
the pathogen type (see Note 4).
2.2 Constructing the As we have the complex (bound) structures of the predicted HPIs,
Structural it is possible to construct the structural interspecies interaction
Superorganism network. Our template set serves as the human endogenous binary
Network interactions. 26,236 interfaces map to 3366 distinct human PPIs.
The predicted HPIs serve as exogenous interactions. So, all
Table 1
HPIs for KSHV proteins
KSHV Human # of # of residues Probability of template

KSHV protein Human protein Template I_sc I_sc of residues in template % interface being a
protein PDB protein PDB interface of HPI PPI aligned interface Match Weight Sc1 biological interface Sc2
K4 2fhtA CCL4 2x6lB 2x6lBD 8.86 9.53 29 35 82.9 1 82.9 0.9 74.6
K4 2fhtA CXCR4 2k03D 2k03CD 6.76 11.26 25 55 45.5 1.5 68.2 0.48 32.7
K6 1zxtA CCL5 1u4lB 1u4lAB 8.90 11.30 26 34 76.5 1 76.5 0.77 58.9
vCyclin 1g3nC CDK4 3g33A 3g33AD 6.07 8.24 39 51 76.5 1.5 114.7 0.98 112.4
vCyclin 1g3nC CDK2 1w98A 1w98AB 5.41 13.69 48 94 51.1 2 102.1 1 102.1
vIL6 1i1rB IL12B 3duhB 3duhBD 6.83 13.47 26 50 52.0 1.5 78.0 0.9 70.2
vIL6 1i1rB INAR1 3se4A 3se4AB 6.08 11.52 26 53 49.1 1.5 73.6 0.91 67.0
vIRF1 4hlxA UBP21 3i3tG 3i3tGH 5.88 16.50 18 51 35.3 1.5 52.9 0.97 51.4
vFLIP 3cl3A TNR6 3ezqI 3ezqIJ 5.57 11.04 19 54 35.2 1.5 52.8 0.11 5.8
vBCL2 1k3kA ITA2B 2vdkA 2vdkAB 5.19 13.52 20 63 31.7 1.5 47.6 1 47.6
I_sc refers to Rosetta interface score, where we ignored disulfide bonds. If the I_sc of modeled HPI is lower than I_sc of template PPI, it means that the pathogenic protein may
have higher affinity to target protein than the endogenous partner of the target
Interface-Based Structural Prediction of Novel Host-Pathogen Interactions
323
pairwise interactions in the structural network will have structures

as complexes. The topological features of the resulting superorgan-
ism network can be calculated by the NetworkAnalyzer [91] appli-
cation in Cytoscape [92]. Functional annotation of pathogenic
targets in the host can be performed by DAVID [93, 94].
To compare the pathogen of interest with other bacteria/
viruses, we can also build the structural interspecies network for
all known HPIs in PDB. There are 299 HPIs in PDB between
human and different bacterial, yeast, and viral species.
2.3 Case Study Our interface-based HPI modeling method was successfully
applied to H. pylori before and can be applied to any commensal
or pathogenic microorganism. As a case study to illustrate the utility
of the concept, here we applied it to KSHV, infection of which is
associated with a blood/lymph vessel cancer—Kaposi’s sarcoma—
and lymphoma [95]. We modeled its HPIs and constructed its
structural superorganism network. We analyzed eight KSHV pro-
teins, vCyclin, vFLIP, vBCL2, vIL6, vIRF1, vIRF2, and viral che-
mokines (K4 and K6). We found 96 putative HPIs. All our HPI
models have 3D structures as complexes (see Note 5). Table 1
shows some examples from these 96 HPIs and Table 2 displays
the human PPIs that are potentially disrupted by these HPIs.
Our HPI candidates may elucidate the roles of KSHV in mod-
ulation of host signaling and contribution to malignant transfor-
mation. For instance, we found that KSHV chemokines and
cytokines, like K4, K6, and vIL6, target many human chemokine
and cytokine receptors (Fig. 2). Signaling through the cytokine and
chemokine receptors is critical for T-cell recruitment to the infected
Table 2
Potentially disrupted endogenous host PPIs due to predicted KSHV HPIs
KSHV protein Human PPI disrupted by KSHV protein PDB for the human PPI disrupted
K4 CCL4-CCL4 2x6lBD
K4 CXCR4-SDF1 2k03CD
K6 CCL5-CCL5 1u4lAB
vCyclin CDK4-CCND3 3g33AD
vCyclin CDK2-CCNE1 1w98AB
vIL6 IL12B-IL23A 3duhBD
vIL6 INAR1-IFNW1 3se4AB
vIRF1 UBP21-RL40 3i3tGH
vFLIP TNR6-FADD 3ezqIJ
vBCL2 ITA2B-ITB3 2vdkAB
Fig. 2 KSHV proteins mimic the human protein-protein interfaces, blocking human PPIs. (a) Endogenous
human PPI between IL12B and IL23A. (b) Our HPI model between vIL6 and IL12B. (c) Superimposed view of
PPI and HPI shows that vIL6 almost perfectly mimics the interface on IL23A to bind to IL12B. (d) through (l)
also show the superimposed structures of endogenous human PPIs and modeled HPIs. Human proteins are
shown in cyan and pink; and KSHV proteins are shown in gray. Gray proteins bind to pink proteins by hijacking
the interface on cyan proteins (only the interface similarity is enough, no need for global structural similarity).
Thus, they may block the pink-cyan protein interactions
host tissue to eradicate the pathogens and for regulation of their

activation and differentiation [96]. Blockage of these pathways by
the KSHV proteins may underlie the molecular mechanisms of
evading the immune system and persistence of infection. We also
found that vCyclin interferes with several CDKs (Fig. 2), thereby
disrupting normal host cell cycle regulation, which may contribute
to aberrant proliferation in malignant transformation.
In addition to mimicked endogenous interfaces, hijacked exog-
enous interfaces can also be identified through our approach. A
given pathogenic protein can mimic both human and pathogenic
proteins from other species. For instance, we found that KSHV
vCyclin mimics other viral vCyclin proteins to target human CDKs
(Fig. 3).
We also constructed the structural superorganism network
between human and KSHV (Fig. 4). The endogenous human
PPIs are template PPIs and the exogenous virus-human interac-
tions are HPI models. There are 3366 human PPIs and 96 HPIs in
this network. Our results indicate that KSHV proteins can poten-
tially target the highly connected part of the network and hub
proteins, like CDK2 in the human PPI network. Hub proteins are
critical to many cellular functions, establishing pathway cross talk.
It is an ingenious pathogen strategy, since by attacking only a single
Fig. 3 KSHV proteins mimic not only host interactions, but also other HPIs from other species (a), (b), and (c).
Figures show the superimposed structures of our HPI models for KSHV with the known exogenous interactions
with proteins from other species. Pink proteins are from human, greens are proteins from other pathogens,
and gray proteins are KSHV proteins. Gray proteins bind to pink proteins by hijacking the interfaces on green
proteins
Fig. 4 Structural superorganism network for KSHV and human, where all binary interactions have structures as
complexes. Endogenous human interactions (black edges) are obtained from crystal structures in PDB
(template interface set), where human proteins are shown as gray circular nodes. Exogenous interactions
(red edges) are our HPI models for 8 KSHV proteins, where viral proteins are shown as blue diamond nodes. (a)
KSHV proteins target the highly connected part of the human PPI network. (b) Structural HPI network without
the endogenous template interactions. Most targets of individual KSHV proteins are distinct, but some are
shared across different KSHV proteins
protein they can interfere with several pathways. Functional anno-

tation of the KSHV-targeted human proteins is enriched in
17 KEGG pathways (Table 3). Among the highly enriched, there
are cytokine and chemokine signaling, and viral carcinogenesis
pathways.
3 Concluding Remarks
Insight into mechanisms of infectious diseases and pathogen-driven

cancers at the molecular level is limited. Identification of novel
HPIs and their atomistic details may illuminate how virulence
factors modulate host signaling, and stimulate innovative therapeu-
tics. Large-scale detection of HPIs will rely on computational tech-
niques in the near future due to current limitations of experimental
methodologies. Most computational approaches rely on sequence
homology which constrains the application of these tools to
Table 3
Functional enrichment of KSHV-targeted human proteins by DAVID [93, 94]
Number of
genes
KEGG pathways enriched % P value KSHV-targeted human proteins
Cytokine-cytokine 10 13.9 7.20E 05 CCL3, CCL2, CCL13, TNR6, CCL4,
receptor interaction ACVR1, CCL5, CXCR4, INAR1,
IL12B
Chemokine signaling 9 12.5 9.80E 05 RHOA, CCL3, CCL2, CCL13, CCL4,
pathway CCL5, JAK2, CCL14, CXCR4
Herpes simplex infection 8 11.1 5.60E 04 CCL2, C1QBP, TNR6, CDK2, CCL5,
JAK2, INAR1, IL12B
Measles 7 9.7 6.10E 04 TNR6, CCND3, CDK4, CDK2, JAK2,
INAR1, IL12B
p53 signaling pathway 5 6.9 1.90E 03 TNR6, CCND3, CASP9, CDK4,
CDK2
Influenza A 7 9.7 2.40E 03 CCL2, TNR6, CASP9, CCL5, JAK2,
INAR1, IL12B
Pathways in cancer 10 13.9 3.50E 03 RHOA, ITA2B, FGFR2, TNR6,
CASP9, CDK4, CDK2, CXCR4,
ARHGB, BMP2
Hepatitis B 6 8.3 5.70E 03 CCNA2, TNR6, CASP9, CDK4,
CDK2, INAR1
Chagas disease (American 5 6.9 9.20E 03 CCL3, CCL2, TNR6, CCL5, IL12B
trypanosomiasis)
Toll-like receptor signaling 5 6.9 9.80E 03 CCL3, CCL4, CCL5, INAR1, IL12B
pathway
PI3K-Akt signaling 8 11.1 1.90E 02 ITA2B, FGFR2, CCND3, CASP9,
pathway CDK4, CDK2, JAK2, INAR1
African trypanosomiasis 3 4.2 2.80E 02 TNR6, HBA, IL12B
Small-cell lung cancer 4 5.6 3.00E 02 ITA2B, CASP9, CDK4, CDK2
Glutathione metabolism 3 4.2 6.20E 02 GSTA4, GSTP1, GSTM2
Cell cycle 4 5.6 7.60E 02 CCNA2, CCND3, CDK4, CDK2
Viral carcinogenesis 5 6.9 8.00E 02 CCNA2, RHOA, CCND3, CDK4,
CDK2
Signaling pathways 4 5.6 1.00E 01 FGFR2, ACVR1, JAK2, BMP2
regulating pluripotency
of stem cells
pathogenic proteins that have no sequence homologs in human.

Interface architectures are conserved within and across species
regardless of the entire sequence and the structure of the proteins.
Here we reviewed the first and only available interface-based

method to uncover novel HPIs and their complex 3D structures.
This approach predicts not only the HPIs, but also the potentially
disrupted endogenous human PPIs. It can be applied to any micro-
bial organisms, including commensals and pathogens.
4 Notes
1. Our approach is based on the reasonable assumption that

pathogenic proteins may alter host signaling. However, inter-
actions through metabolites and small molecules also have
roles in modulation of the host responses. Moreover, interac-
tion of a particular pathogen with other microbial species in the
microbiota and different combination of bacterial species also
affects the overall response.
2. Some of the limitations are as follows: coverage of endogenous
human PPIs is low; available endogenous protein structures are
biased toward permanent, not transient, interactions; disor-
dered proteins are underrepresented in the PDB; and most
pathogenic proteins lack crystal structures.
3. Both experimental and computational methods have false posi-
tives with varying rates. Although HPIs predicted here may
have false positives, we cannot calculate the exact false-positive
rate due to limited experimental HPI data. We tried to mini-
mize the error rates by calculating the percent match of the
HPI models with the corresponding template PPI and incor-
porating the probability of template interfaces being real
biological interfaces. Predicted models should be tested by
experiments. Computational screening of big data can provide
possible leads to experiments guiding functional characteriza-
tion while avoiding testing millions of possible binary combi-
nations of host and pathogenic proteins.
4. In addition to filtering by tissue expression, HPI models can
also be filtered by subcellular localization of the host proteins.
For instance, if the pathogenic protein is found in the cyto-
plasm of the host cell, then it cannot interact with the host
nuclear proteins. Since the large-scale subcellular localization
data for all proteins are not available, it is a choice of the
researchers to do so.
5. Proteins often assemble into multi-protein complexes. Model-
ing only pairwise interactions between host and pathogenic
proteins may not be sufficient.
Acknowledgments
This project has been funded in whole or in part with federal funds
from the National Cancer Institute, National Institutes of Health,
under contract number HHSN261200800001E. The content of
this publication does not necessarily reflect the views or policies of
the Department of Health and Human Services, nor does mention
of trade names, commercial products, or organizations imply
endorsement by the US Government. This research was supported
(in part) by the Intramural Research Program of the NIH, National
Cancer Institute, Center for Cancer Research. This study utilized
the high-performance computational capabilities of the Biowulf
PC/Linux cluster at the National Institutes of Health (NIH),
Bethesda, MD (http://biowulf.nih.gov).
References
1. Durmus S, Cakir T, Ozgur A, Guthke R JF, Delohery T, Weghorst CM, Weinstein IB,
(2015) A review on computational systems Moss SF (2000) Chronic helicobacter pylori
biology of pathogen-host interactions. Front infection induces an apoptosis-resistant pheno-
Microbiol 6:235. https://doi.org/10.3389/ type associated with decreased expression of
fmicb.2015.00235 p27(kip1). Infect Immun 68(9):5321–5328
2. Stebbins CE, Galan JE (2001) Structural mim- 9. Guven-Maiorov E, Tsai CJ, Nussinov R (2016)
icry in bacterial virulence. Nature 412 Pathogen mimicry of host protein-protein
(6848):701–705. https://doi.org/10.1038/ interfaces modulates immunity. Semin Cell
35089000 Dev Biol 58:136–145. https://doi.org/10.
3. Sal-Man N, Biemans-Oldehinkel E, Finlay BB 1016/j.semcdb.2016.06.004
(2009) Structural microengineers: pathogenic 10. Tsai CJ, Lin SL, Wolfson HJ, Nussinov R
Escherichia coli redesigns the actin cytoskele- (1996) A dataset of protein-protein interfaces
ton in host cells. Structure 17(1):15–19. generated with a sequence-order-independent
https://doi.org/10.1016/j.str.2008.12.001 comparison technique. J Mol Biol 260
4. Kahn RA, Fu H, Roy CR (2002) Cellular (4):604–620. https://doi.org/10.1006/jmbi.
hijacking: a common strategy for microbial 1996.0424
infection. Trends Biochem Sci 27 11. Tsai CJ, Lin SL, Wolfson HJ, Nussinov R
(6):308–314. https://doi.org/10.1016/ (1996) Protein-protein interfaces: architec-
S0968-0004(02)02108-4 tures and interactions in protein-protein inter-
5. Finlay BB, McFadden G (2006) Anti- faces and in protein cores. Their similarities and
immunology: evasion of the host immune sys- differences. Crit Rev Biochem Mol Biol 31
tem by bacterial and viral pathogens. Cell 124 (2):127–152. https://doi.org/10.3109/
(4):767–782. https://doi.org/10.1016/j.cell. 10409239609106582
2006.01.034 12. Keskin O, Nussinov R (2005) Favorable scaf-
6. Moody CA, Laimins LA (2010) Human papil- folds: proteins with different sequence, struc-
lomavirus oncoproteins: pathways to transfor- ture and function may associate in similar ways.
mation. Nat Rev Cancer 10(8):550–560. Protein Eng Des Sel 18(1):11–24. https://doi.
https://doi.org/10.1038/nrc2886 org/10.1093/protein/gzh095
7. Filippova M, Song H, Connolly JL, Dermody 13. Keskin O, Nussinov R (2007) Similar binding
TS, Duerksen-Hughes PJ (2002) The human sites and different partners: implications to
papillomavirus 16 E6 protein binds to tumor shared proteins in cellular pathways. Structure
necrosis factor (TNF) R1 and protects cells 15(3):341–354. https://doi.org/10.1016/j.
from TNF-induced apoptosis. J Biol Chem str.2007.01.007
277(24):21730–21739. https://doi.org/10. 14. Cukuroglu E, Gursoy A, Nussinov R, Keskin O
1074/jbc.M200113200 (2014) Non-redundant unique interface struc-
8. Shirin H, Sordillo EM, Kolevska TK, tures as templates for modeling protein
Hibshoosh H, Kawabata Y, Oh SH, Kuebler
interactions. PLoS One 9(1):e86738. https:// 24. Shapira SD, Gat-Viks I, Shum BO, Dricot A,
doi.org/10.1371/journal.pone.0086738 de Grace MM, Wu L, Gupta PB, Hao T, Silver
15. Muratcioglu S, Guven-Maiorov E, Keskin O, SJ, Root DE, Hill DE, Regev A, Hacohen N
Gursoy A (2015) Advances in template-based (2009) A physical and regulatory map of host-
protein docking by utilizing interfaces towards influenza interactions reveals pathways in
completing structural interactome. Curr Opin H1N1 infection. Cell 139(7):1255–1267.
Struct Biol 35:87–92. https://doi.org/10. https://doi.org/10.1016/j.cell.2009.12.018
1016/j.sbi.2015.10.001 25. Zhang L, Villa NY, Rahman MM,
16. Franzosa EA, Garamszegi S, Xia Y (2012) Smallwood S, Shattuck D, Neff C,
Toward a three-dimensional view of protein Dufford M, Lanchbury JS, Labaer J, McFad-
networks between species. Front Microbiol den G (2009) Analysis of vaccinia virus-host
3:428. https://doi.org/10.3389/fmicb. protein-protein interactions: validations of
2012.00428 yeast two-hybrid screenings. J Proteome Res
17. Franzosa EA, Xia Y (2011) Structural princi- 8(9):4311–4318. https://doi.org/10.1021/
ples within the human-virus protein-protein pr900491n
interaction network. Proc Natl Acad Sci U S 26. Khadka S, Vangeloff AD, Zhang C,
A 108(26):10538–10543. https://doi.org/ Siddavatam P, Heaton NS, Wang L,
10.1073/pnas.1101440108 Sengupta R, Sahasrabudhe S, Randall G,
18. Guven-Maiorov E, Tsai CJ, Nussinov R (2017) Gribskov M, Kuhn RJ, Perera R, LaCount DJ
Structural host-microbiota interaction net- (2011) A physical interaction network of den-
works. PLoS Comput Biol 13(10):e1005579. gue virus and human proteins. Mol Cell Prote-
https://doi.org/10.1371/journal.pcbi. omics 10(12):M111.012187. https://doi.
1005579 org/10.1074/mcp.M111.012187
19. Bhavsar AP, Guttman JA, Finlay BB (2007) 27. Jager S, Cimermancic P, Gulbahce N, Johnson
Manipulation of host-cell pathways by bacterial JR, McGovern KE, Clarke SC, Shales M,
pathogens. Nature 449(7164):827–834. Mercenne G, Pache L, Li K, Hernandez H,
https://doi.org/10.1038/nature06247 Jang GM, Roth SL, Akiva E, Marlett J,
Stephens M, D’Orso I, Fernandes J, Fahey M,
20. Uetz P, Dong YA, Zeretzke C, Atzler C, Mahon C, O’Donoghue AJ, Todorovic A,
Baiker A, Berger B, Rajagopala SV, Morris JH, Maltby DA, Alber T, Cagney G,
Roupelieva M, Rose D, Fossum E, Haas J Bushman FD, Young JA, Chanda SK, Sund-
(2006) Herpesviral protein networks and their quist WI, Kortemme T, Hernandez RD, Craik
interaction with the human proteome. Science CS, Burlingame A, Sali A, Frankel AD, Krogan
311(5758):239–242. https://doi.org/10. NJ (2011) Global landscape of HIV-human
1126/science.1116804 protein complexes. Nature 481
21. von Schwedler UK, Stuchell M, Muller B, (7381):365–370. https://doi.org/10.1038/
Ward DM, Chung HY, Morita E, Wang HE, nature10719
Davis T, He GP, Cimbora DM, Scott A, Kraus- 28. Pichlmair A, Kandasamy K, Alvisi G,
slich HG, Kaplan J, Morham SG, Sundquist WI Mulhern O, Sacco R, Habjan M, Binder M,
(2003) The protein network of HIV budding. Stefanovic A, Eberle CA, Goncalves A,
Cell 114(6):701–713 Burckstummer T, Muller AC, Fauster A,
22. Calderwood MA, Venkatesan K, Xing L, Chase Holze C, Lindsten K, Goodbourn S,
MR, Vazquez A, Holthaus AM, Ewence AE, Kochs G, Weber F, Bartenschlager R, Bowie
Li N, Hirozane-Kishikawa T, Hill DE, Vidal M, AG, Bennett KL, Colinge J, Superti-Furga G
Kieff E, Johannsen E (2007) Epstein-Barr virus (2012) Viral immune modulators perturb the
and virus human protein interaction maps. human molecular network by common and
Proc Natl Acad Sci U S A 104 unique strategies. Nature 487
pnas.0702332104 nature11289
23. de Chassey B, Navratil V, Tafforeau L, Hiet 29. Rozenblatt-Rosen O, Deo RC, Padi M,
MS, Aublin-Gex A, Agaugue S, Meiffren G, Adelmant G, Calderwood MA, Rolland T,
Pradezynski F, Faria BF, Chantier T, Le Grace M, Dricot A, Askenazi M, Tavares M,
Breton M, Pellet J, Davoust N, Mangeot PE, Pevzner SJ, Abderazzaq F, Byrdsong D, Car-
Chaboud A, Penin F, Jacob Y, Vidalain PO, vunis AR, Chen AA, Cheng J, Correll M,
Vidal M, Andre P, Rabourdin-Combe C, Lot- Duarte M, Fan C, Feltkamp MC, Ficarro SB,
teau V (2008) Hepatitis C virus infection pro- Franchi R, Garg BK, Gulbahce N, Hao T,
tein network. Mol Syst Biol 4:230. https://doi. Holthaus AM, James R, Korkhin A,
org/10.1038/msb.2008.66 Litovchick L, Mar JC, Pak TR, Rabello S,
Rubio R, Shen Y, Singh S, Spangle JM, Proteopathogen, a protein database for study-
Tasan M, Wanamaker S, Webber JT, ing Candida albicans--host interaction. Prote-
Roecklein-Canfield J, Johannsen E, Barabasi omics 9(20):4664–4668. https://doi.org/10.
AL, Beroukhim R, Kieff E, Cusick ME, Hill 1002/pmic.200900023
DE, Munger K, Marto JA, Quackenbush J, 39. Wattam AR, Abraham D, Dalay O, Disz TL,
Roth FP, DeCaprio JA, Vidal M (2012) Inter- Driscoll T, Gabbard JL, Gillespie JJ, Gough R,
preting cancer genomes using systematic host Hix D, Kenyon R, Machi D, Mao C, Nordberg
network perturbations by tumour virus pro- EK, Olson R, Overbeek R, Pusch GD,
teins. Nature 487(7408):491–495. https:// Shukla M, Schulman J, Stevens RL, Sullivan
doi.org/10.1038/nature11288 DE, Vonstein V, Warren A, Will R, Wilson
30. Guven Maiorov E, Keskin O, Gursoy A, Nussi- MJ, Yoo HS, Zhang C, Zhang Y, Sobral BW
nov R (2013) The structural network of (2014) PATRIC, the bacterial bioinformatics
inflammation and cancer: merits and chal- database and analysis resource. Nucleic Acids
lenges. Semin Cancer Biol 23(4):243–251. Res 42(Database issue):D581–D591. https://
https://doi.org/10.1016/j.semcancer.2013. doi.org/10.1093/nar/gkt1099
05.003 40. Urban M, Pant R, Raghunath A, Irvine AG,
31. Guven-Maiorov E, Keskin O, Gursoy A, Pedro H, Hammond-Kosack KE (2015) The
VanWaes C, Chen Z, Tsai CJ, Nussinov R Pathogen-Host Interactions database
(2015) The architecture of the TIR domain (PHI-base): additions and future develop-
signalosome in the toll-like Receptor-4 signal- ments. Nucleic Acids Res 43(Database issue):
ing pathway. Sci Rep 5:13128. https://doi. D645–D655. https://doi.org/10.1093/nar/
org/10.1038/srep13128 gku1165
32. Guven-Maiorov E, Keskin O, Gursoy A, Nus- 41. Xiang Z, Tian Y, He Y (2007) PHIDIAS: a
sinov R (2015) A structural view of negative pathogen-host interaction data integration
regulation of the toll-like receptor-mediated and analysis system. Genome Biol 8(7):R150.
inflammatory pathway. Biophys J 109 https://doi.org/10.1186/gb-2007-8-7-r150
(6):1214–1226. https://doi.org/10.1016/j. 42. Bleves S, Dunger I, Walter MC,
bpj.2015.06.048 Frangoulidis D, Kastenmuller G, Voulhoux R,
33. Acuner-Ozbabacan ES, Engin BH, Guven- Ruepp A (2014) HoPaCI-DB: host-Pseudo-
Maiorov E, Kuzu G, Muratcioglu S, monas and Coxiella interaction database.
Baspinar A, Chen Z, Van Waes C, Gursoy A, Nucleic Acids Res 42(Database issue):
Keskin O, Nussinov R (2014) The structural D671–D676. https://doi.org/10.1093/nar/
network of Interleukin-10 and its implications gkt925
in inflammation and cancer. BMC Genomics 43. Guirimand T, Delmotte S, Navratil V (2015)
15(Suppl 4):S2. https://doi.org/10.1186/ VirHostNet 2.0: surfing on the web of virus/
1471-2164-15-S4-S2 host molecular interactions data. Nucleic Acids
34. Nourani E, Khunjush F, Durmus S (2015) Res 43(Database issue):D583–D587. https://
Computational approaches for prediction of doi.org/10.1093/nar/gku1121
pathogen-host protein-protein interactions. 44. Li Y, Wang C, Miao Z, Bi X, Wu D, Jin N,
Front Microbiol 6:94. https://doi.org/10. Wang L, Wu H, Qian K, Li C, Zhang T,
3389/fmicb.2015.00094 Zhang C, Yi Y, Lai H, Hu Y, Cheng L, Leung
35. Brito AF, Pinney JW (2017) Protein-protein KS, Li X, Zhang F, Li K, Li X, Wang D (2015)
interactions in virus-host systems. Front ViRBase: a resource for virus-host ncRNA-
Microbiol 8:1557. https://doi.org/10.3389/ associated interactions. Nucleic Acids Res 43
fmicb.2017.01557 (Database issue):D578–D582. https://doi.
36. Durmus Tekir S, Cakir T, Ardic E, Sayilirbas org/10.1093/nar/gku903
AS, Konuk G, Konuk M, Sariyer H, Ugurlu A, 45. Calderone A, Licata L, Cesareni G (2015) Vir-
Karadeniz I, Ozgur A, Sevilgen FE, Ulgen KO usMentha: a new resource for virus-host pro-
(2013) PHISTO: pathogen-host interaction tein interactions. Nucleic Acids Res 43
search tool. Bioinformatics 29 (Database issue):D588–D592. https://doi.
(10):1357–1358. https://doi.org/10.1093/ org/10.1093/nar/gku830
bioinformatics/btt137 46. Kwofie SK, Schaefer U, Sundararajan VS, Bajic
37. Kumar R, Nanduri B (2010) HPIDB--a unified VB, Christoffels A (2011) HCVpro: hepatitis C
resource for host-pathogen interactions. BMC virus protein interaction database. Infect Genet
Bioinformatics 11(Suppl 6):S16. https://doi. Evol 11(8):1971–1977. https://doi.org/10.
org/10.1186/1471-2105-11-S6-S16 1016/j.meegid.2011.09.001
38. Vialas V, Nogales-Cadenas R, Nombela C, 47. Arnold R, Boonen K, Sun MG, Kim PM
Pascual-Montano A, Gil C (2009) (2012) Computational analysis of
interactomes: current and future perspectives proteins in microbial pathogens. Bioinformat-

for bioinformatics approaches to model the ics 31(4):590–592. https://doi.org/10.
host-pathogen interaction space. Methods 57 1093/bioinformatics/btu681
(4):508–518. https://doi.org/10.1016/j. 58. Krishnadev O, Srinivasan N (2011) Prediction
ymeth.2012.06.011 of protein-protein interactions between human
48. Doolittle JM, Gomez SM (2011) Mapping host and a pathogen and its application to three
protein interactions between dengue virus and pathogenic bacteria. Int J Biol Macromol 48
its human and insect hosts. PLoS Negl Trop (4):613–619. https://doi.org/10.1016/j.
Dis 5(2):e954. https://doi.org/10.1371/jour ijbiomac.2011.01.030
nal.pntd.0000954 59. Dyer MD, Murali TM, Sobral BW (2007)
49. Tyagi N, Krishnadev O, Srinivasan N (2009) Computational prediction of host-pathogen
Prediction of protein-protein interactions protein-protein interactions. Bioinformatics
between Helicobacter pylori and a human 23(13):i159–i166. https://doi.org/10.1093/
host. Mol BioSyst 5(12):1630–1635. https:// bioinformatics/btm208
doi.org/10.1039/b906543c 60. Doxey AC, McConkey BJ (2013) Prediction of
50. Xu Q, Xiang EW, Yang Q (2011) Transferring molecular mimicry candidates in human path-
network topological knowledge for predicting ogenic bacteria. Virulence 4(6):453–466.
protein-protein interactions. Proteomics 11 https://doi.org/10.4161/viru.25180
(19):3818–3825. https://doi.org/10.1002/ 61. Mahajan G, Mande SC (2017) Using structural
pmic.201100146 knowledge in the protein data bank to inform
51. Remmele CW, Luther CH, Balkenhol J, the search for potential host-microbe protein
Dandekar T, Muller T, Dittrich MT (2015) interactions in sequence space: application to
Integrated inference and evaluation of host- Mycobacterium tuberculosis. BMC Bioinfor-
fungi interaction networks. Front Microbiol matics 18(1):201. https://doi.org/10.1186/
6:764. https://doi.org/10.3389/fmicb. s12859-017-1550-y
2015.00764 62. Mariano R, Wuchty S (2017) Structure-based
52. Evans P, Dampier W, Ungar L, Tozeren A prediction of host-pathogen protein interac-
(2009) Prediction of HIV-1 virus-host protein tions. Curr Opin Struct Biol 44:119–124.
interactions using virus and host sequence https://doi.org/10.1016/j.sbi.2017.02.007
motifs. BMC Med Genet 2:27. https://doi. 63. Becerra A, Bucheli VA, Moreno PA (2017)
org/10.1186/1755-8794-2-27 Prediction of virus-host protein-protein inter-
53. Zhang M, Su S, Bhatnagar RK, Hassett DJ, Lu actions mediated by short linear motifs. BMC
LJ (2012) Prediction and analysis of the pro- Bioinformatics 18(1):163. https://doi.org/
tein interactome in Pseudomonas aeruginosa 10.1186/s12859-017-1570-7
to enable network-based drug target selection. 64. Jones KR, Whitmire JM, Merrell DS (2010) A
PLoS One 7(7):e41202. https://doi.org/10. tale of two toxins: helicobacter pylori CagA and
1371/journal.pone.0041202 VacA modulate host pathways that impact dis-
54. Huo T, Liu W, Guo Y, Yang C, Lin J, Rao Z ease. Front Microbiol 1:115. https://doi.org/
(2015) Prediction of host – pathogen protein 10.3389/fmicb.2010.00115
interactions between Mycobacterium tubercu- 65. Manente L, Perna A, Buommino E, Altucci L,
losis and Homo sapiens using sequence motifs. Lucariello A, Citro G, Baldi A, Iaquinto G,
BMC Bioinformatics 16:100. https://doi.org/ Tufano MA, De Luca A (2008) The Helico-
10.1186/s12859-015-0535-y bacter pylori’s protein VacA has direct effects
55. Doolittle JM, Gomez SM (2010) Structural on the regulation of cell cycle and apoptosis in
similarity-based predictions of protein interac- gastric epithelial cells. J Cell Physiol 214
tions between HIV-1 and Homo sapiens. Virol (3):582–587. https://doi.org/10.1002/jcp.
J 7:82. https://doi.org/10.1186/1743- 21242
422X-7-82 66. Davis FP, Barkan DT, Eswar N, McKerrow JH,
56. de Chassey B, Meyniel-Schicklin L, Aublin- Sali A (2007) Host pathogen protein interac-
Gex A, Navratil V, Chantier T, Andre P, Lot- tions predicted by comparative modeling. Pro-
teau V (2013) Structure homology and inter- tein Sci 16(12):2585–2596. https://doi.org/
action redundancy for discovering virus-host 10.1110/ps.073228407
protein interactions. EMBO Rep 14 67. Drayman N, Glick Y, Ben-nun-shaul O, Zer H,
(10):938–944. https://doi.org/10.1038/ Zlotnick A, Gerber D, Schueler-Furman O,
embor.2013.130 Oppenheim A (2013) Pathogens use structural
57. Petrenko P, Doxey AC (2015) mimicMe: a web mimicry of native host ligands as a mechanism
server for prediction and analysis of host-like for host receptor engagement. Cell Host
Microbe 14(1):63–73. https://doi.org/10. 77. Gao M, Skolnick J (2010) Structural space of

1016/j.chom.2013.05.005 protein-protein interfaces is degenerate, close
68. Aloy P, Bottcher B, Ceulemans H, Leutwein C, to complete, and highly connected. Proc Natl
Mellwig C, Fischer S, Gavin AC, Bork P, Acad Sci U S A 107(52):22517–22522.
Superti-Furga G, Serrano L, Russell RB https://doi.org/10.1073/pnas.1012820107
(2004) Structure-based assembly of protein 78. Kundrotas PJ, Zhu Z, Janin J, Vakser IA
complexes in yeast. Science 303 (2012) Templates are available to model nearly
(5666):2026–2029. https://doi.org/10. all complexes of structurally characterized pro-
1126/science.1092645 teins. Proc Natl Acad Sci U S A 109
69. Rajasekharan S, Rana J, Gulati S, Sharma SK, (24):9438–9441. https://doi.org/10.1073/
Gupta V, Gupta S (2013) Predicting the host pnas.1200678109
protein interactors of Chandipura virus using a 79. Franzosa EA, Xia Y (2012) Structural models
structural similarity-based approach. Pathog for host-pathogen protein-protein interac-
Dis 69(1):29–35. https://doi.org/10.1111/ tions: assessing coverage and bias. Pac Symp
2049-632X.12064 Biocomput:287–298
70. Zhang A, He L, Wang Y (2017) Prediction of 80. Shatsky M, Nussinov R, Wolfson HJ (2004) A
GCRV virus-host protein interactome based on method for simultaneous alignment of multiple
structural motif-domain interactions. BMC protein structures. Proteins 56(1):143–156.
Bioinformatics 18(1):145. https://doi.org/ https://doi.org/10.1002/prot.10628
10.1186/s12859-017-1500-8 81. Tuncbag N, Gursoy A, Nussinov R, Keskin O
71. Lee SA, Chan CH, Tsai CH, Lai JM, Wang FS, (2011) Predicting protein-protein interactions
Kao CY, Huang CY (2008) Ortholog-based on a proteome scale by matching evolutionary
protein-protein interaction prediction and its and structural similarities at interfaces using
application to inter-species interactions. BMC PRISM. Nat Protoc 6(9):1341–1354.
Bioinformatics 9(Suppl 12):S11. https://doi. https://doi.org/10.1038/nprot.2011.367
org/10.1186/1471-2105-9-S12-S11 82. Keskin O, Nussinov R, Gursoy A (2008)
72. Krishnadev O, Srinivasan N (2008) A data inte- PRISM: protein-protein interaction prediction
gration approach to predict host-pathogen by structural matching. Methods Mol Biol
protein-protein interactions: application to rec- 484:505–521. https://doi.org/10.1007/
ognize protein interactions between human 978-1-59745-398-1_30
and a malarial parasite. In Silico Biol 8 83. Baspinar A, Cukuroglu E, Nussinov R,
(3–4):235–250 Keskin O, Gursoy A (2014) PRISM: a web
73. Schulze S, Henkel SG, Driesch D, Guthke R, server and repository for prediction of
Linde J (2015) Computational prediction of protein-protein interactions and modeling
molecular pathogen-host interactions based their 3D complexes. Nucleic Acids Res 42
on dual transcriptome data. Front Microbiol (Web Server issue):W285–W289. https://doi.
6:65. https://doi.org/10.3389/fmicb.2015. org/10.1093/nar/gku397
00065 84. Ogmen U, Keskin O, Aytuna AS, Nussinov R,
74. Guven-Maiorov E, Tsai CJ, Ma B, Nussinov R Gursoy A (2005) PRISM: protein interactions
(2017) Prediction of host-pathogen interac- by structural matching. Nucleic Acids Res 33
tions for helicobacter pylori by interface mim- (Web Server):W331–W336. https://doi.org/
icry and implications to gastric cancer. J Mol 10.1093/nar/gki585
Biol 429(24):3925–3941. https://doi.org/ 85. Gray JJ, Moughon S, Wang C, Schueler-
10.1016/j.jmb.2017.10.023 Furman O, Kuhlman B, Rohl CA, Baker D
75. Zhang QC, Petrey D, Deng L, Qiang L, Shi Y, (2003) Protein-protein docking with simulta-
Thu CA, Bisikirska B, Lefebvre C, Accili D, neous optimization of rigid-body displacement
Hunter T, Maniatis T, Califano A, Honig B and side-chain conformations. J Mol Biol 331
(2012) Structure-based prediction of protein- (1):281–299
protein interactions on a genome-wide scale. 86. Wang C, Schueler-Furman O, Baker D (2005)
Nature 490(7421):556–560. https://doi. Improved side-chain modeling for protein-
org/10.1038/nature11503 protein docking. Protein Sci 14
76. Zhang QC, Petrey D, Norel R, Honig BH (5):1328–1339. https://doi.org/10.1110/
(2010) Protein interface conservation across ps.041222905
structure space. Proc Natl Acad Sci U S A 107 87. Wang C, Bradley P, Baker D (2007) Protein-
(24):10896–10901. https://doi.org/10. protein docking with backbone flexibility. J
1073/pnas.1005894107 Mol Biol 373(2):503–519. https://doi.org/
10.1016/j.jmb.2007.07.050
88. Duarte JM, Srebniak A, Scharer MA, Capitani Heijne G, Nielsen J, Ponten F (2015) Proteo-
G (2012) Protein interface classification by mics. Tissue-based map of the human prote-
evolutionary analysis. BMC Bioinformatics ome. Science 347(6220):1260419. https://
13:334. https://doi.org/10.1186/1471- doi.org/10.1126/science.1260419
2105-13-334 91. Yang H, Ke Y, Wang J, Tan Y, Myeni SK, Li D,
89. Uhlen M, Bjorling E, Agaton C, Szigyarto CA, Shi Q, Yan Y, Chen H, Guo Z, Yuan Y, Yang X,
Amini B, Andersen E, Andersson AC, Yang R, Du Z (2011) Insight into bacterial
Angelidou P, Asplund A, Asplund C, virulence mechanisms against host immune
Berglund L, Bergstrom K, Brumer H, response via the Yersinia pestis-human pro-
Cerjan D, Ekstrom M, Elobeid A, Eriksson C, tein-protein interaction network. Infect
Fagerberg L, Falk R, Fall J, Forsberg M, Bjork- Immun 79(11):4413–4424. https://doi.org/
lund MG, Gumbel K, Halimi A, Hallin I, 10.1128/IAI.05622-11
Hamsten C, Hansson M, Hedhammar M, 92. Shannon P, Markiel A, Ozier O, Baliga NS,
Hercules G, Kampf C, Larsson K, Wang JT, Ramage D, Amin N,
Lindskog M, Lodewyckx W, Lund J, Schwikowski B, Ideker T (2003) Cytoscape: a
Lundeberg J, Magnusson K, Malm E, software environment for integrated models of
Nilsson P, Odling J, Oksvold P, Olsson I, biomolecular interaction networks. Genome
Oster E, Ottosson J, Paavilainen L, Persson A, Res 13(11):2498–2504. https://doi.org/10.
Rimini R, Rockberg J, Runeson M, 1101/gr.1239303
Sivertsson A, Skollermo A, Steen J, 93. Huang d W, Sherman BT, Lempicki RA (2009)
Stenvall M, Sterky F, Stromberg S, Bioinformatics enrichment tools: paths toward
Sundberg M, Tegel H, Tourle S, Wahlund E, the comprehensive functional analysis of large
Walden A, Wan J, Wernerus H, Westberg J, gene lists. Nucleic Acids Res 37(1):1–13.
Wester K, Wrethagen U, Xu LL, Hober S, Pon- https://doi.org/10.1093/nar/gkn923
ten F (2005) A human protein atlas for normal
and cancer tissues based on antibody proteo- 94. Huang d W, Sherman BT, Lempicki RA (2009)
mics. Mol Cell Proteomics 4(12):1920–1932. Systematic and integrative analysis of large gene
https://doi.org/10.1074/mcp.M500279- lists using DAVID bioinformatics resources.
MCP200 Nat Protoc 4(1):44–57. https://doi.org/10.
1038/nprot.2008.211
90. Uhlen M, Fagerberg L, Hallstrom BM,
Lindskog C, Oksvold P, Mardinoglu A, 95. Dissinger NJ, Damania B (2016) Recent
Sivertsson A, Kampf C, Sjostedt E, advances in understanding Kaposi’s sarcoma-
Asplund A, Olsson I, Edlund K, Lundberg E, associated herpesvirus. F1000Res 5:F1000.
Navani S, Szigyarto CA, Odeberg J, https://doi.org/10.12688/f1000research.
Djureinovic D, Takanen JO, Hober S, Alm T, 7612.1
Edqvist PH, Berling H, Tegel H, Mulder J, 96. Luther SA, Cyster JG (2001) Chemokines as
Rockberg J, Nilsson P, Schwenk JM, regulators of T cell differentiation. Nat Immu-
Hamsten M, von Feilitzen K, Forsberg M, nol 2(2):102–107. https://doi.org/10.1038/
Persson L, Johansson F, Zwahlen M, von 84205
Chapter 19
Predicting Functions of Disordered Proteins with MoRFpred

Christopher J. Oldfield, Vladimir N. Uversky, and Lukasz Kurgan
Abstract
Intrinsically disordered proteins and regions are involved in a wide range of cellular functions, and they
often facilitate protein-protein interactions. Molecular recognition features (MoRFs) are segments of
intrinsically disordered regions that bind to partner proteins, where binding is concomitant with a transition
to a structured conformation. MoRFs facilitate translation, transport, signaling, and regulatory processes
and are found across all domains of life. A popular computational tool, MoRFpred, accurately predicts
MoRFs in protein sequences. MoRFpred is implemented as a user-friendly web server that is freely available
at http://biomine.cs.vcu.edu/servers/MoRFpred/. We describe this predictor, explain how to run the
web server, and show how to interpret the results it generates. We also demonstrate the utility of this web
server based on two case studies, focusing on the relevance of evolutionary conservation of MoRF regions.
Key words Intrinsic disorder, Prediction, Molecular recognition features, MoRFs, Protein-protein
interactions, MoRFpred
1 Introduction
Intrinsically disordered proteins (IDPs) and protein regions (IDRs)

are incompetent in forming stable three-dimensional structure, yet
perform varied and vital biological functions [1–4]. The lack of the
prerequisite of a stable structure for function creates several chal-
lenges in the study of IDPs and IDRs, both experimental and
computational [3]. The crux of these challenges on the computa-
tional side is the lack of conservation in many IDRs relative to
structured proteins. Without the need to maintain rigid structures
many IDRs diverge drastically, even in closely related species
[5]. Lack of conservation confounds established methods for func-
tion annotation that rely on sequence similarity to transfer func-
tional annotations. Lack of conservation is not universal in IDRs;
many IDRs may be conserved, or more commonly conserved in
portions of their sequences [5].
337
338 Christopher J. Oldfield et al.
One mechanism of IDP function is short functional elements

within IDRs. The evolutionary origin of these functional elements
is seemingly idiosyncratic, where some examples have been found
to be evolutionarily conserved [6–8], and others have been pro-
posed to be emergent sequence features [9, 10]. A common func-
tion of these short functional elements is binding to molecular
partners, often other proteins [11, 12]. These types of features
are likely common across many biological processes [13], such as
cell cycle regulation, modulation of cellular structure, and
apoptosis.
One model of these functional elements is known as molecular
recognition features (MoRFs) [14]. It models functional elements
within IDRs as short regions of increased structural propensity
within longer regions of intrinsic disorder [13]. Examples of these
types of functional regions can readily be inferred from protein
structures and sequence properties [11]. Several predictors of
MoRFs have been developed [13–17]. Initial predictors relied on
direct interpretation of the MoRF model, by scanning for patterns
in prediction of intrinsic disorder and employing a second level of
prediction over patterns of interest [13, 15]. The most recent
MoRF predictors, including MoRFpred, relax the strict reliance
on disorder prediction patterns while still directly considering dis-
order predictions [17]. Several other methods of MoRF prediction
have been independently developed [13–16, 18–21]. In addition to
MoRFs, several other related models of functional elements with
IDRs have been proposed. Eukaryotic linear motifs (ELMs) model
these elements as short sequence motifs which can be predicted by
pattern matching and filtering spurious matches [22]. Though they
are very different models, MoRF and ELM predictions are fre-
quently coincident [23]. Further, several generalized models of
binding regions within IDRs have been developed [24–27]. Rela-
tive to other methods, MoRFpred was developed on a well-defined
dataset with short functional elements that bind to other proteins
within larger regions of intrinsic disorder. Like all methods of this
type, the specificity is difficult to assess exactly, but this predictor
features a good estimated sensitivity [17]. MoRFpred is useful for
gaining insight into the function of novel IDPs.
MoRFpred is available as a user-friendly web server at http://
biomine.cs.vcu.edu/servers/MoRFpred/. This server has been
extensively used by the community since it was released in early
2012. Usage data collected with the Google Analytics platform
reveals that MoRFpred was utilized close to 9000 times by over
2700 unique users from 711 cities and 71 countries. The article
that introduces this computational tool was already cited 175 times
(source: Google Scholar on June 29, 2018).
Predicting Functions of Disordered Proteins 339
2 Materials and Methods
2.1 Datasets For training of MoRFpred, a set of MoRFs was constructed begin-
ning with known binding regions from Protein Data Bank (PDB)
[28]. Bound peptides from PDB were carefully filtered for clear
binding to a longer protein chain and mapped back to their source
proteins. This procedure resulted in a dataset of 842 MoRFs. To
avoid training and testing on similar proteins, these MoRFs were
grouped into 427 clusters and divided into testing and training sets.
This gave training and testing sets with 421 and 419 MoRFs,
respectively, with no protein more than 30% identical between the
two sets (see Note 1).
A set of negative examples that do not contain MoRFs with
near certainty were constructed from protein chains that have been
completely structurally characterized by X-ray crystallography at a
high resolution. The chance of intrinsic disorder in the negative set
was minimized by only selecting monomeric proteins without large
cofactors that contained no missing residues due to lack of electron
density. Further, any protein with a significant amount of predicted
intrinsic disorder, >30% of residues, was discarded. Filtering for
proteins with less than 30% identity resulted in a set of 28 proteins.
2.2 Architecture MoRFpred is a support vector machine (SVM) over a rich feature
space merged with a sequence similarity-based prediction (Fig. 1).
Features considered for the linear kernel SVM predictor included
five disorder prediction methods [29–32], relative solvent accessi-
ble surface prediction [33], B-factor prediction [34], PSI-BLAST-
generated position-specific scoring matrices (PSSMs), and amino
acid propensity scales from AAindex [35]. Two broad sets of fea-
tures were used from each of these methods: (1) per residue over a
window of 25 residues and (2) values aggregated over a window.
Aggregation methods included taking the difference over a window
of 25 residues and a smaller window, which captures the features
found to be useful for previous MoRF predictors. For example,
previous MoRF predictors relied on elevated predicted disorder
surrounding a predicted MoRF, but depressed values for the
MoRF region itself. Indeed, the corresponding difference-based
aggregation was found to be one of the strongest MoRF features.
Feature selection for the SVM predictor was based on a best-
first iterative addition of ranked features. Features were ranked
based on a combination of biserial correlations [36] and single-
feature predictive performance, where poorly correlated or
performing features were removed from consideration. Iterative
addition of features was based on a modified fivefold cross-
validation procedure, where a feature was only added if it improved
prediction performance by at least 1%.
Fig. 1 Architecture of MoRFpred. The input sequence is used to generate sequence properties, from which
input features are derived by windowed averaging. A support vector machine predicts MoRFs based on these
input features. The SVM prediction is merged with similarity-based predictions to produce the final MoRFpred
score, where scores above 0.5 are predicted MoRFs (M) and those less than 0.5 are predicted non-MoRFs (n)
Similarity-based predictions were done using a PSI-BLAST

search against MoRF containing proteins in the training set.
PSI-BLAST matches were selected based on an e-value threshold.
An e-value ¼ 0.5 was selected based on optimization of perfor-
mance of the merged predictor. MoRF annotations from the
training set are transferred to the query protein based on the

PSI-BLAST alignment. These transferred annotations are merged
with SVM-based prediction by adding one to the SVM prediction
result and dividing by two, which ensures that the merged predic-
tion for transferred annotations will be over the 0.5 threshold.
2.3 Predictive Prediction performance was assessed in the MoRFpred publication

Quality [17], using the true-positive and false-positive rates, overall accu-
racy (ACC), area under the ROC curve (AUC), and success rate.
Success rate is a per-sequence measure of performance, where a
sequence is considered successfully predicted if the MoRF residues
have a higher average prediction score than the non-MoRF
residues.
MoRFpred performance was assessed in comparison to
ANCHOR and previously developed MoRF predictors. MoRFpred
had ACC ¼ 94.7% and the highest performance by success rate and
AUC evaluations, with values of 71.8% and 67.3%, respectively. The
original MoRF predictor had a very low false-positive rate, which
artificially inflated its ACC value due to the large proportion of
non-MoRF residues in the dataset. Adjusting the MoRFpred
threshold to an equally low false-positive rate results in nearly
double the true-positive rate of the original MoRF predictor.
2.4 Web Server The MoRFpred web server is freely available at http://biomine.cs.
vcu.edu/servers/MoRFpred/. The server can be accessed with an
Internet connection and any modern web browser. All computa-
tions that are needed to complete predictions are performed on the
server side.
On our web server, sequences submitted for prediction will be
returned within 20 min of submission (see Note 2). The runtime of
MoRFpred is dominated by the PSI-BLAST prediction, whose
runtime varies with protein length and database similarity.
The main server page is where proteins are submitted for
prediction. The web server only requires FASTA sequences of the
proteins of interest to preform MoRFpred predictions. Up to five
FASTA-formatted protein sequences may be entered into the large
text entry field per submission. An e-mail address is required for
each submission. All required programs for generating prediction
features, including PSI-BLAST, and disorder, RSA, and B-factor
predictions, are run automatically by scripts on the server. Upon
completion of predictions for each submission, the server will send
an e-mail notification with links to the prediction results.
2.5 Running From the main server page, three steps are required to submit
MoRFpred sequences to obtain the MoRFpred’s predictions (Fig. 2, steps are
highlighted with red numbers corresponding to the step):
Fig. 2 Primary MoRFpred page, for submission of sequences for prediction. Red numbers indicate the
sequence of steps required to submit a prediction
1. Copy your FASTA-formatted sequence (see Note 3) from its

source file or web page and paste it into the text box (see Notes
4 and 5).
2. Enter an e-mail address. This is the address to which links to the
prediction results will be sent.
3. Click “Run MoRFpred!”. This submits the sequences to our
server for MoRFpred predictions.
Once sequences are submitted for prediction, the browser is
redirected to a status page that gives the current position of the
submission in the server queue. This page will be automatically
redirected to the results page when predictions are completed.
The queue on the server is first come first serve, and if there is a
large number of submissions, predictions may be delayed. Even if
the web page is closed at this point, links to predictions will still be
received through e-mail (see Note 6).
2.6 MoRFpred The results page includes a link to the raw results (Fig. 3, red 1) as
Results well as a color-coded text display of MoRFpred results (Fig. 3, red
2). The raw results (results.csv) file gives results for each submitted
sequence, each in three lines, which are comma delimited:
1. The input sequence: the FASTA header followed by each resi-
due of the input sequence.
Fig. 3 MoRFpred prediction results page. Red numbers correspond to the primary features of the results page
2. Binary MoRFpred predictions: the string “MoRFpred” fol-

lowed by one character for each residue in the input
sequence—“O” for non-MoRF residues and “D” for MoRF
residues.
3. The raw prediction output: the string “prob” followed by one
floating point number between 0 and 1 for each residue of the
input sequence. Predicted MoRF residues correspond to values
greater than 0.5.
The color-coded text display of MoRFpred predictions
includes the FASTA header of each input sequence (Fig. 3, red
3), and several aligned rows. The rows are aligned by residue from
N-terminus to C-terminus. The rows are, from top to bottom, the
input sequence (Fig. 3, red 4), binary MoRFpred results (Fig. 3,
red 5) indicating non-MoRF residues (green “n”), and MoRF
residues (red “M”) and the raw prediction value (Fig. 3, red 6)
multiplied by 10 and rounded, with alternating residues in black
and white.
The notification e-mail contains links both to the results page
(Fig. 4, red 1) and to the raw results file (Fig. 4, red 2). This e-mail
can be saved to access results at a later time.
3 Case Studies
As subjects of the case studies we selected two proteins of different

origin, human p53 (a 393-residue-long protein) and RNase E from
E. coli (a 1061-residue-long protein). These two proteins have very
different biological functions, are characterized by different levels
of intrinsic disorder, and possess different numbers of MoRFs.
3.1 Case Study: p53 Because of its crucial biological roles in regulation of apoptosis,
genomic stability, and inhibition of angiogenesis, as well as many
Fig. 4 Notification e-mail. Red numbers correspond to links to prediction results
mechanisms of anticancer activity, cellular tumor antigen p53 is one

of the most studied proteins. The p53 signaling pathway is acti-
vated in response to a variety of stress signals. Activated p53 is
accumulated in the nucleus, where its binding to specific DNA
results in the induction or inhibition of a realm of different genes
[37, 38], many of which are involved in apoptosis, growth arrest, or
senescence [39–42]. In the unstressed mammalian cells, continu-
ous ubiquitination of the non-phosphorylated p53 by double-min-
ute-2 ubiquitin ligase (MDM2) [43] and subsequent proteasomal
degradation ensure short lifetime and low levels of p53. There is
also a negative feedback between the p53 and Akt pathways [44],
where Akt is activated in cells exposed to various stimuli ranging
from hormones to growth factors, and to extracellular matrix com-
ponents [45], and controls the MDM2-mediated targeting of p53
for degradation [46]. Loss of p53 function due to mutations in this
protein or some other alterations in the pathways leading to its
activation and regulation is a common feature in the majority of
human cancers [47]. Such mutations account for ~90% of cancer-
related mutations in the TP53 gene and are found in 50% of human
cancers [48]. For example, up to 50% of advanced-stage prostate
cancers contain mutations in p53 [49], and progression of prostate
cancer to metastatic disease is characterized by the loss of p53
[50]. Furthermore, p53 levels may have prognostic value in uro-
logical oncology [51].
There are three major functional domains in human p53, the
intrinsically disordered N-terminal regulatory domain (residues
1–92), the ordered central DNA-binding domain (DBD, residues
94–292) [52–54], and the intrinsically disordered C-terminal olig-
omerization and regulatory domain (residues 293–393) [55]. The
regulatory domains can be further subdivided into functional sub-
domains/regions, such as transactivation domain 1 (TAD1)
Fig. 5 Case study: p53. The correspondence between intrinsic disorder predictions (red line), sequence
conservation (blue line), binding regions (orange boxes), and predicted MoRF regions (green boxes) is shown.
Binding regions are discussed in the text. Sequence conservation is calculated from a set of p53 orthologs
(OrthoDB) as the relative profile entropy over maximum entropy-weighted sequence (large values indicate
greater conservation)
(residues 1–40), TAD2 (residues 40–60), and a proline-rich region,

PR (residues 64–92), in the N-terminal regulatory domain, and
tetramerization or oligomerization domain (OD; residues
325–356) and a regulatory C-terminal domain (CTD; residues
356–393) in the C-terminal regulatory domain [55, 56]. The
N-terminal and C-terminal regulatory domains show exceptional
binding promiscuity. Some of the illustrative examples of proteins
interacting with the N-terminal transactivation region of p53
include CBP/p300, CSN5/Jab1, MDM2, RPA, TFIIH, and
TFIID [43], whereas the CTD of p53 is engaged in interaction
with 14-3-3, GSK3β, hGcn5, PARP-1, S100Bββ, TAF, TAF1, and
TRRAP, to name a few [43]. Importantly, despite their crucial role
in biological activities of p53, the regulatory regions of this protein
are characterized by relatively poor evolutionary conservation,
whereas the central DBD domain is highly conserved among dif-
ferent species. Irrespective of the general lack of conservation, there
are four MoRFs in human p53 that overlap with or are included
into the known binding sites of this protein (see Fig. 5).
The first MoRF (see Fig. 5, box A) coincides with the MDM2-
binding site of p53. MDM2 is the E3 ubiquitin-protein ligase that
is known as an important oncogene due to its overexpression in
many human cancers, such as breast, colon, and prostate cancers, as
well as hematologic malignancies and sarcomas [57]. MDM2 is
most famous for its vital role in the p53 regulation via binding to
a short stretch (residues 13–29) of the p53 TAD1 that prevents
p53-driven activation or inhibition of various genes, via the
MDM2-mediated p53 ubiquitination that targets this protein for
the proteasomal degradation, and via active p53 transport out of
the nucleus due to the presence of a nuclear export signal in MDM2
[58, 59]. Therefore, alteration of the p53-MDM2 interaction path-

way is considered as a promising target for cancer therapy [57].
X-ray crystallographic studies of the p53-MDM2 complex revealed
that the MDM2-binding region of p53 forms an α-helical structure
bound to a deep groove on the surface of the N-terminal domain of
MDM2 (residues 17–125) [60].
The second MoRF (see Fig. 5, box B) is included into the p53N
fragment (residues 33–60) responsible for the p53 interaction with
the N-terminal domain of the single-stranded DNA (ssDNA)-
binding protein, replication protein A (RPA) [61]. This RPA70N
domain is characterized by an oligonucleotide/oligosaccharide-
binding fold typical for the ssDNA-binding domains, whereas the
p53N fragment, which is disordered in isolation, forms two amphi-
pathic helices, H1 and H2, following RPA70N binding [61]. Also,
unlike other MoRFs in this protein, this MoRF displays a large
amount of sequence conservation (see Fig. 5, conservation score).
The third MoRF (see Fig. 5, box C) is a part of the p53
tetramerization domain (325–356), structure of which represents
a short β-strand (residues 326–333) followed by an α-helix (resi-
dues 335–355). These two structural elements are connected by a
sharp turn facilitated by a conserved glycine residue (Gly334). Two
monomers of the p53 tetramerization domain associate to form an
antiparallel double-stranded sheet, and the antiparallel association
of their helices forms a two-helical bundle. Four chains form a
tetramer that can be described as a dimer of primary dimers [62].
The fourth MoRF (see Fig. 5, box D) is a part of the highly
promiscuous C-terminal binding region of p53 (residues 374–388)
that can bind to cyclin A [63], sirtuin [64], CBP [65], or S100ββ
[66]. It was pointed out that upon interaction with different part-
ners, this binding region of p53 displays all three major secondary
structure types in the four complexes [67], where its core fragment
becomes an α-helix when bound to S100ββ [66], a β-strand when
bound to sirtuin [64], and a coil with two distinct backbone
trajectories when bound to CBP [65] and cyclin A2 [63].
MoRFpred correctly identifies the four MoRF regions in p53
(see Fig. 5, green boxes), in spite of the significantly different
conservation profiles of the four MoRF regions (see Fig. 5, blue
lines). We note the relatively low conservation of the first, third, and
fourth MoRF region and much higher conservation values for the
second region. Interestingly, the red lines that identify the putative
propensities for disorder, which were generated with VSL2B [68],
correctly identify both termini of p53 as intrinsically disordered.
However, they also register dips where the MoRFs are located.
These dips are a by-product of the fact that MoRF regions become
structured upon interacting with the protein partner, reducing the
inherent propensity of these amino acids to be intrinsically
disordered.
3.2 Case Study: Endoribonucleases are hydrolytic enzymes that catalyze the endo-
RNase E nucleolytic cleavage of RNA, have various specificities, are univer-
sally present in all organisms, and typically operate under tight
cellular regulation. Endoribonucleases are involved in the matura-
tion, modification, and degradation of different RNAs [69]. There
are at least five endoribonucleases in E. coli (RNases I*, III, E, G,
P). Among various activities attributed to RNase E are processing
of transfer RNA, 9S ribosomal RNA, catalytic RNA of RNase P,
transfer/messenger RNA (t/mRNA) that rescues stalled ribosomes
[70–72], and general mRNA decay [73].
Being one of the larger E. coli proteins, RNase E consists of
1061 amino acid residues [74, 75]. There are two functionally
different domains in this protein, the catalytic N-terminal domain
(NTD; residues 1–498) and the regulatory C-terminal domain
(CTD; residues 499–1061) [76–78]. Although the NTD is rela-
tively conserved and has numerous homologues [79], there is little
sequence conservation in the CTD [80], which is also characterized
by low sequence complexity. The purified CTD was shown to be
mostly disordered by a set of biophysical techniques, such as limited
proteolysis, SDS–PAGE, SAXS, and far-UV CD [81]. Despite
being highly disordered, the CTD was shown to interact with
other degradosome components and with structured RNA
[81]. In agreement with these experimental data, computational
analysis clearly indicated that the NTD of RNase E was expected to
be mostly structured, whereas the CTD had characteristics of a
highly disordered protein [81].
The CTD is highly disordered, which is in agreement with the
high values of the putative propensities for disorder generated for
this protein with VSL2B [68] (see Fig. 6, red line). CTD is also
characterized by the presence of four regions of increased structural
propensity (labeled as segments A, B, C, and D, respectively),
which correspond to MoRFs. The four MoRFs were correctly
identified by the MoRFpred method (green boxes). Importantly,
all these segments are related to various biological activities of
RNase E, such as membrane targeting and CTD self-association
(segment A corresponding to residues 565–585) or interactions
with the components of the RNA degradosome, helicase
(segment B, which is a portion of the arginine-rich domain (resi-
dues 628–843)) [78, 82], enolase (segment C (residues 833–850))
[81], and polynucleotide phosphorylase PNPase (segment D,
RNase E residues 1021–1061) [81]. Like in the case of p53,
some of the MoRF regions (see Fig. 6, segments C and D) are
concomitant with a substantial decrease in the putative propensity
for disorder (red line), but the remaining two regions do not
register these dips. However, MoRFpred is still capable of identify-
ing these MoRF regions, in spite of their high propensity for
disorder and lack of conservation (blue line).
RNA binding
Binding regions A B C D
Predicted MoRFs
1.0 3.0
Conservation scroe
Disorder scroe
2.0
0.5
1.0
0.0 0.0
450 500 550 600 650 700 750 800 850 900 950 1,000 1,050
Residue index
Fig. 6 Case study: RNase E. The correspondence between intrinsic disorder predictions (red line), sequence
conservation (blue line), binding regions (orange boxes), and predicted MoRF regions (green boxes) is shown.
Binding regions are discussed in the text. Sequence conservation is calculated from a set of RNase E orthologs
(OrthoDB) as the relative profile entropy over maximum entropy-weighted sequence (large values indicate
greater conservation)
4 Notes
1. The datasets can be downloaded from http://biomine.cs.vcu.

edu/servers/MoRFpred/.
2. To partially compensate for the long runtime of the algorithm,
up to five sequences can be submitted simultaneously to the
web server. As soon as the results for one batch of up to five
sequences are returned, another set of sequences can be
submitted.
3. In FASTA format, each sequence is prefixed by a line beginning
with “>” followed by some identifying text. The sequence
should begin on the following line.
4. Up to five sequences can be submitted at a time. Ensure that
each sequence has its own “FASTA header,” which is a separate
line beginning with “>.”
5. The maximum length of each submitted sequence is 1000
residues.
6. It is advised to store or bookmark the link at this point. Pre-
dictions are stored on the server for at least 3 months, and
keeping the link will allow return to the results pages. It also
protects against lost predictions, in the case that an incorrect
notification e-mail address was entered.
References
1. Wang C, Uversky VN, Kurgan L (2016) Disor- Characterization of molecular recognition fea-
dered nucleiome: abundance of intrinsic disor- tures, MoRFs, and their binding partners. J
der in the DNA- and RNA-binding proteins in Proteome Res 6(6):2351–2366
1121 species from Eukaryota, Bacteria and 13. Oldfield CJ, Cheng Y, Cortese MS, Romero P,
Archaea. Proteomics 16(10):1486–1498 Uversky VN, Dunker AK (2005) Coupled
2. Peng Z, Yan J, Fan X, Mizianty MJ, Xue B, folding and binding with alpha-helix-forming
Wang K, Hu G, Uversky VN, Kurgan L molecular recognition elements. Biochemistry
(2015) Exceptionally abundant exceptions: 44(37):12454–12470
comprehensive characterization of intrinsic dis- 14. Yan J, Dunker AK, Uversky VN, Kurgan L
order in all domains of life. Cell Mol Life Sci 72 (2016) Molecular recognition features
(1):137–151 (MoRFs) in three domains of life. Mol BioSyst
3. Habchi J, Tompa P, Longhi S, Uversky VN 12(3):697–710
(2014) Introducing protein intrinsic disorder. 15. Cheng Y, Oldfield CJ, Meng J, Romero P,
Chem Rev 114(13):6561–6588 Uversky VN, Dunker AK (2007) Mining
4. Dunker AK, Babu MM, Barbar E, α-helix-forming molecular recognition features
Blackledge M, Bondos SE, Dosztányi Z, with cross species sequence alignments. Bio-
Dyson HJ, Forman-Kay J, Fuxreiter M, chemistry 46(47):13468–13477
Gsponer J, Han K-H, Jones DT, Longhi S, 16. Malhis N, Gsponer J (2015) Computational
Metallo SJ, Nishikawa K, Nussinov R, identification of MoRFs in protein sequences.
Obradovic Z, Pappu RV, Rost B, Selenko P, Bioinformatics 31(11):1738–1744
Subramaniam V, Sussman JL, Tompa P, 17. Disfani FM, Hsu WL, Mizianty MJ, Oldfield
Uversky VN (2013) What’s in a name? Why CJ, Xue B, Dunker AK, Uversky VN, Kurgan L
these proteins are intrinsically disordered. (2012) MoRFpred, a computational tool for
Intrinsically Disord Proteins 1(1):e24157 sequence-based prediction and characteriza-
5. Brown CJ, Takayama S, Campen AM, Vise P, tion of short disorder-to-order transitioning
Marshall TW, Oldfield CJ (2002) Evolutionary binding regions in proteins. Bioinformatics 28
rate heterogeneity in proteins with long disor- (12):i75–i83
dered regions. J Mol Evol 55:104 18. Malhis N, Jacobson M, Gsponer J (2016)
6. Meszaros B, Tompa P, Simon I, Dosztanyi Z MoRFchibi SYSTEM: software tools for the
(2007) Molecular principles of the interactions identification of MoRFs in protein sequences.
of disordered proteins. J Mol Biol 372 Nucleic Acids Res 44:W488
(2):549–561 19. Jones DT, Cozzetto D (2015) DISOPRED3:
7. Trudeau T, Nassar R, Cumberworth A, Wong precise disordered region predictions with
ET, Woollard G, Gsponer J (2013) Structure annotated protein-binding activity. Bioinfor-
and intrinsic disorder in protein autoinhibition. matics 31(6):857–863
Structure 21(3):332–341 20. Fang C, Noguchi T, Tominaga D, Yamana H
8. Varadi M, Guharoy M, Zsolyomi F, Tompa P (2013) MFSPSSMpred: identifying short
(2015) DisCons: a novel tool to quantify and disorder-to-order binding regions in disor-
classify evolutionary conservation of intrinsic dered proteins based on contextual local evolu-
protein disorder. BMC Bioinformatics 16 tionary conservation. BMC Bioinformatics
(1):153 14:300
9. Ait-Bara S, Carpousis AJ, Quentin Y (2015) 21. Xue B, Dunker AK, Uversky VN (2010) Retro-
RNase E in the gamma-Proteobacteria: conser- MoRFs: identifying protein binding sites by
vation of intrinsically disordered noncatalytic normal and reverse alignment and intrinsic dis-
region and molecular evolution of microdo- order prediction. Int J Mol Sci 11
mains. Mol Genet Genomics 290(3):847–862 (10):3725–3747
10. Davey NE, Cyert MS, Moses AM (2015) Short 22. Puntervoll P, Linding R, Gemünd C,
linear motifs – ex nihilo evolution of protein Chabanis-Davidson S, Mattingsdal M,
regulation. Cell Commun Signal 13(1):43 Cameron S, Martin DMA, Ausiello G,
11. Mohan A, Oldfield CJ, Radivojac P, Vacic V, Brannetti B, Costantini A, Ferrè F, Maselli V,
Cortese MS, Dunker AK, Uversky VN (2006) Via A, Cesareni G, Diella F, Superti-Furga G,
Analysis of molecular recognition features Wyrwicz L, Ramu C, McGuigan C,
(MoRFs). J Mol Biol 362(5):1043–1059 Gudavalli R, Letunic I, Bork P, Rychlewski L,
12. Vacic V, Oldfield CJ, Mohan A, Radivojac P, Küster B, Helmer-Citterich M, Hunter WN,
Cortese MS, Uversky VN, Dunker AK (2007) Aasland R, Gibson TJ (2003) ELM server: a
new resource for investigating short functional 35. Kawashima S, Pokarowski P, Pokarowska M,
sites in modular eukaryotic proteins. Nucleic Kolinski A, Katayama T, Kanehisa M (2008)
Acids Res 31(13):3625–3630 AAindex: amino acid index database, progress
23. Meszaros B, Dosztanyi Z, Simon I (2012) Dis- report 2008. Nucleic Acids Res 36(Database
ordered binding regions and linear motifs-- issue):D202–D205
bridging the gap between two models of 36. Tate RF (1954) Correlation between a discrete
molecular recognition. PLoS One 7(10): and a continuous variable. Point-Biserial corre-
e46829 lation. Ann Math Statist 25(3):603–607
24. Peng Z, Wang C, Uversky VN, Kurgan L 37. Zhao R, Gish K, Murphy M, Yin Y,
(2017) Prediction of disordered RNA, DNA, Notterman D, Hoffman WH, Tom E, Mack
and protein binding regions using DisoRDP- DH, Levine AJ (2000) Analysis of
bind. Methods Mol Biol 1484:187–203 p53-regulated gene expression patterns using
25. Meszaros B, Simon I, Dosztanyi Z (2009) Pre- oligonucleotide arrays. Genes Dev 14
diction of protein binding regions in disor- (8):981–993
dered proteins. PLoS Comput Biol 5(5): 38. Balint EE, Vousden KH (2001) Activation and
e1000376 activities of the p53 tumour suppressor pro-
26. Dosztanyi Z, Meszaros B, Simon I (2009) tein. Br J Cancer 85(12):1813–1823
ANCHOR: web server for predicting protein 39. el-Deiry WS (1998) Regulation of p53 down-
binding regions in disordered proteins. Bioin- stream genes. Semin Cancer Biol 8
formatics 25(20):2745–2746 (5):345–357
27. Khan W, Duffy F, Pollastri G, Shields DC, 40. Yu J, Zhang L, Hwang PM, Rago C, Kinzler
Mooney C (2013) Predicting binding within KW, Vogelstein B (1999) Identification and
disordered protein regions to structurally char- classification of p53-regulated genes. Proc
acterised peptide-binding domains. PLoS One Natl Acad Sci U S A 96(25):14517–14522
8(9):e72838 41. Sax JK, El-Deiry WS (2003) p53-induced gene
28. Berman HM, Westbrook J, Feng Z, expression analysis. Methods Mol Biol
Gilliland G, Bhat TN, Weissig H, Shindyalov 234:65–71
IN, Bourne PE (2000) The protein data bank. 42. Fridman JS, Lowe SW (2003) Control of apo-
Nucleic Acids Res 28(1):235–242 ptosis by p53. Oncogene 22(56):9030–9040
29. Dosztanyi Z, Csizmok V, Tompa P, Simon I 43. Anderson CW, Appella E (2004) Signaling to
(2005) IUPred: web server for the prediction the p53 tumor suppressor through pathways
of intrinsically unstructured regions of proteins activated by genotoxic and nongenotoxic
based on estimated energy content. Bioinfor- stress. In: Bradshaw RA, Dennis EA (eds)
matics 21(16):3433–3434 Handbook of cell signaling. Academic Press,
30. Ward JJ, McGuffin LJ, Bryson K, Buxton BF, New York, pp 237–247
Jones DT (2004) The DISOPRED server for 44. Gottlieb TM, Leal JF, Seger R, Taya Y, Oren M
the prediction of protein disorder. Bioinfor- (2002) Cross-talk between Akt, p53 and
matics 20(13):2138–2139 Mdm2: possible implications for the regulation
31. McGuffin LJ (2008) Intrinsic disorder predic- of apoptosis. Oncogene 21(8):1299–1303
tion from the analysis of multiple protein fold 45. Nicholson KM, Anderson NG (2002) The pro-
recognition models. Bioinformatics 24 tein kinase B/Akt signalling pathway in human
(16):1798–1804 malignancy. Cell Signal 14(5):381–395
32. Mizianty MJ, Stach W, Chen K, Kedarisetti 46. Abraham AG, O’Neill E (2014) PI3K/Akt-
KD, Disfani FM, Kurgan L (2010) Improved mediated regulation of p53 in cancer. Biochem
sequence-based prediction of disordered Soc Trans 42(4):798–803
regions with multilayer fusion of multiple 47. Muller PA, Vousden KH (2013) p53 mutations
information sources. Bioinformatics 26(18): in cancer. Nat Cell Biol 15(1):2–8
i489–i496
48. Soussi T, Beroud C (2001) Assessing TP53
33. Faraggi E, Xue B, Zhou Y (2009) Improving status in human tumours to evaluate clinical
the prediction accuracy of residue solvent outcome. Nat Rev Cancer 1(3):233–240
accessibility and real-value backbone torsion
angles of proteins by guided-learning through 49. Bookstein R (1994) Tumor suppressor genes
a two-layer neural network. Proteins 74 in prostatic oncogenesis. J Cell Biochem Suppl
(4):847–856 19:217–223
34. Schlessinger A, Yachdav G, Rost B (2006) 50. Pencik J, Wiebringhaus R, Susani M, Culig Z,
PROFbval: predict flexible and rigid residues Kenner L (2015) IL-6/STAT3/ARF: the
in proteins. Bioinformatics 22(7):891–893 guardians of senescence, cancer progression
and metastasis in prostate cancer. Swiss Med 63. Lowe ED, Tews I, Cheng KY, Brown NR,
Wkly 145:w14215 Gul S, Noble ME, Gamblin SJ, Johnson LN
51. Wolff JM, Stephenson RN, Jakse G, Habib FK (2002) Specificity determinants of recruitment
(1994) Retinoblastoma and p53 genes as prog- peptides bound to phospho-CDK2/cyclin
nostic indicators in urological oncology. Urol A. Biochemistry 41(52):15625–15634
Int 53(1):1–5 64. Avalos JL, Celic I, Muhammad S, Cosgrove
52. Joerger AC, Ang HC, Veprintsev DB, Blair MS, Boeke JD, Wolberger C (2002) Structure
CM, Fersht AR (2005) Structures of p53 can- of a Sir2 enzyme bound to an acetylated p53
cer mutants and mechanism of rescue by peptide. Mol Cell 10(3):523–535
second-site suppressor mutations. J Biol 65. Mujtaba S, He Y, Zeng L, Yan S, Plotnikova O,
Chem 280(16):16030–16037 Sachchidanand SR, Zeleznik-Le NJ, Ronai Z,
53. Canadillas JM, Tidow H, Freund SM, Ruther- Zhou MM (2004) Structural mechanism of the
ford TJ, Ang HC, Fersht AR (2006) Solution bromodomain of the coactivator CBP in p53
structure of p53 core domain: structural basis transcriptional activation. Mol Cell 13
for its instability. Proc Natl Acad Sci U S A 103 (2):251–263
(7):2109–2114 66. Rustandi RR, Baldisseri DM, Weber DJ (2000)
54. Wang Y, Rosengarth A, Luecke H (2007) Structure of the negative regulatory domain of
Structure of the human p53 core domain in p53 bound to S100B(betabeta). Nat Struct
the absence of DNA. Acta Crystallogr D Biol Biol 7(7):570–574
Crystallogr 63(Pt 3):276–281 67. Oldfield CJ, Meng J, Yang JY, Yang MQ,
55. Joerger AC, Fersht AR (2008) Structural biol- Uversky VN, Dunker AK (2008) Flexible
ogy of the tumor suppressor p53. Annu Rev nets: disorder and induced fit in the associa-
Biochem 77:557–582 tions of p53 and 14-3-3 with their partners.
56. Uversky VN, Oldfield CJ, Midic U, Xie H, BMC Genomics 9(Suppl 1):S1
Xue B, Vucetic S, Iakoucheva LM, 68. Peng K, Radivojac P, Vucetic S, Dunker AK,
Obradovic Z, Dunker AK (2009) Unfoldomics Obradovic Z (2006) Length-dependent pre-
of human diseases: linking protein intrinsic dis- diction of protein intrinsic disorder. BMC Bio-
order with diseases. BMC Genomics 10(Suppl informatics 7:208
1):S7 69. Ehretsmann CP, Carpousis AJ, Krisch HM
57. Bianco R, Ciardiello F, Tortora G (2005) Che- (1992) Specificity of Escherichia coli endoribo-
mosensitization by antisense oligonucleotides nuclease RNase E: in vivo and in vitro analysis
targeting MDM2. Curr Cancer Drug Targets of mutants in a bacteriophage T4 mRNA pro-
5(1):51–56 cessing site. Genes Dev 6(1):149–159
58. Moll UM, Petrenko O (2003) The MDM2- 70. Huang H, Liao J, Cohen SN (1998) Poly(A)-
p53 interaction. Mol Cancer Res 1 and poly(U)-specific RNA 30 tail shortening by
(14):1001–1008 E. coli ribonuclease E. Nature 391
59. Nag S, Qin J, Srivenugopal KS, Wang M, (6662):99–102
Zhang R (2013) The MDM2-p53 pathway 71. Kushner SR (2002) mRNA decay in Escheri-
revisited. J Biomed Res 27(4):254–271 chia coli comes of age. J Bacteriol 184
60. Kussie PH, Gorina S, Marechal V, Elenbaas B, (17):4658–4665 discussion 4657
Moreau J, Levine AJ, Pavletich NP (1996) 72. Ow MC, Kushner SR (2002) Initiation of
Structure of the MDM2 oncoprotein bound tRNA maturation by RNase E is essential for
to the p53 tumor suppressor transactivation cell viability in E. coli. Genes Dev 16
domain. Science 274(5289):948–953 (9):1102–1115
61. Bochkareva E, Kaustov L, Ayed A, Yi GS, Lu Y, 73. Steege DA (2000) Emerging features of
Pineda-Lucena A, Liao JC, Okorokov AL, mRNA decay in bacteria. RNA 6
Milner J, Arrowsmith CH, Bochkarev A (8):1079–1090
(2005) Single-stranded DNA mimicry in the 74. Casaregola S, Jacq A, Laoudj D, McGurk G,
p53 transactivation domain interaction with Margarson S, Tempete M, Norris V, Holland
replication protein A. Proc Natl Acad Sci U S IB (1992) Cloning and analysis of the entire
A 102(43):15412–15417 Escherichia coli ams gene. ams is identical to
62. Mora P, Carbajo RJ, Pineda-Lucena A, Sanchez hmp1 and encodes a 114 kDa protein that
del Pino MM, Perez-Paya E (2008) Solvent- migrates as a 180 kDa protein. J Mol Biol 228
exposed residues located in the beta-sheet (1):30–40
modulate the stability of the tetramerization 75. Claverie-Martin F, Diaz-Torres MR, Yancey
domain of p53--a structural and combinatorial SD, Kushner SR (1991) Analysis of the altered
approach. Proteins 71(4):1670–1685 mRNA stability (ams) gene from Escherichia
coli. Nucleotide sequence, transcriptional anal- of 16S rRNA. Biochem Biophys Res Commun
ysis, and homology of its product to MRP3, a 259(2):483–488
mitochondrial ribosomal protein from Neuros- 80. Kaberdin VR, Miczak A, Jakobsen JS,
pora crassa. J Biol Chem 266(5):2843–2851 Lin-Chao S, McDowall KJ, von Gabain A
76. Lopez PJ, Marchand I, Joyce SA, Dreyfus M (1998) The endoribonucleolytic N-terminal
(1999) The C-terminal half of RNase E, which half of Escherichia coli RNase E is evolution-
organizes the Escherichia coli degradosome, arily conserved in Synechocystis sp. and other
participates in mRNA degradation but not bacteria but not the C-terminal half, which is
rRNA processing in vivo. Mol Microbiol 33 sufficient for degradosome assembly. Proc Natl
(1):188–199 Acad Sci U S A 95(20):11637–11642
77. Cohen SN, McDowall KJ (1997) RNase E: still 81. Callaghan AJ, Aurikko JP, Ilag LL, Gunter
a wonderfully mysterious enzyme. Mol Micro- Grossmann J, Chandran V, Kuhnel K,
biol 23(6):1099–1106 Poljak L, Carpousis AJ, Robinson CV, Sym-
78. McDowall KJ, Cohen SN (1996) The mons MF, Luisi BF (2004) Studies of the
N-terminal domain of the rne gene product RNA degradosome-organizing domain of the
has RNase E activity and is non-overlapping Escherichia coli ribonuclease RNase E. J Mol
with the arginine-rich RNA-binding site. J Biol 340(5):965–979
Mol Biol 255(3):349–355 82. Taraseviciene L, Bjork GR, Uhlin BE (1995)
79. Wachi M, Umitsuki G, Shimizu M, Takada A, Evidence for an RNA binding region in the
Nagai K (1999) Escherichia coli cafA gene Escherichia coli processing endoribonuclease
encodes a novel RNase, designated as RNase E. J Biol Chem 270(44):26391–26398
RNase G, involved in processing of the 50 end
Chapter 20
Exploring Protein Conformational Diversity

Alexander Miguel Monzon, Maria Silvina Fornasari, Diego Javier Zea,
and Gustavo Parisi
Abstract
The native state of proteins is composed of conformers in dynamical equilibrium. In this chapter, different
issues related to conformational diversity are explored using a curated and experimentally based database
called CoDNaS (Conformational Diversity in the Native State). This database is a collection of redundant
structures for the same sequence. CoDNaS estimates the degree of conformational diversity using different
global and local structural similarity measures. It allows the user to explore how structural differences
among conformers change as a function of several structural features providing further biological informa-
tion. This chapter explores the measurement of conformational diversity and its relationship with sequence
divergence. Also, it discusses how proteins with high conformational diversity could affect homology
modeling techniques.
Key words Conformational diversity, CoDNaS database, Conformers, Native state, Protein dynam-
ics, Protein evolution
1 Introduction
Since the early crystallization studies on hemoglobin, it is known

that two or more conformational states are required to sustain
biological function. The native state is then better represented by
an ensemble of alternative protein conformations in equilibrium. A
wide range of protein movements between conformers have been
explored. These range from large relative domain movements [1],
secondary and tertiary element rearrangements [2], and loop dis-
placements [3] to small residue rearrangements [4]. Structural dif-
ferences between these conformers define the conformational
diversity of the protein. In general, most proteins have a few well-
defined conformational states. Human hemoglobin has two well-
established T and R conformations [5], and the dimeric catabolite
activator protein [6] has three, just to mention two examples.
However, intrinsically disordered proteins (IDPs) have a native
state with multiple conformers characterized by their high
353
354 Alexander Miguel Monzon et al.
flexibility and mobility, defining very complex ensembles [7]. What-

ever the mechanisms underlying conformational changes are, it is
clear that in many cases a protein requires switching among differ-
ent native structures to be functional. The conformational ensem-
ble concept becomes then a key tool to explain an endless list of
essential protein properties such as function [8–10], enzyme and
antibody promiscuity [11], signal transduction [12], protein-
protein recognition [13], origin of diseases [14], emergence of
new protein functions [15], evolutionary rate [16], and order-
disorder transitions [17], just to mention some of the most
important.
This chapter describes how to explore protein conformational
diversity using an experimentally based database. It gives some
practical advice for the analysis of its data and it highlights the
relevance of including conformational diversity in the study of
protein evolution and homology modeling.
2 Methods
2.1 Discovering The study of protein conformational diversity can be addressed

Protein using redundant structures of the same protein obtained in differ-
Conformational ent experimental conditions (e.g., with or without substrate or
Diversity Through post-translational modifications, or at different pH values). This
Structure Redundancy ensemble of structures provides snapshots of protein dynamism in
their native state and could be considered as putative native con-
formations [18, 19]. Then, the sequence redundancy in the Protein
Data Bank (PDB) [20] is an essential input to experimentally based
studies of protein conformational diversity that provides insight
into protein function. The continuous growth of the PDB during
these last years has granted access to many protein native ensembles
[21]. However, more efforts are needed to amend the incomplete-
ness of the PDB in terms of its dynamical information content
[22]. Different databases and methods have been developed over
the last 10 years to take advantage of this redundancy (Table 1).
Table 1
Databases of protein conformational diversity
Database name URL Protein chains

PDBFlex [64] http://pdbflex.org/ 28,939
CoDNaS [24] http://ufq.unq.edu.ar/codnas/ 21,152
DynDom [67] http://fizz.cmp.uea.ac.uk/dyndom/ 1578
PSCDB [68] http://idp1.force.cs.is.nagoya-u.ac.jp/pscdb/ 839
MolMov [1] http://www.molmovdb.org 230
Exploring Protein Conformational Diversity 355
2.2 Exploring Protein CoDNaS is a database of protein conformations derived from

Conformational experimental structures [23, 24]. CoDNaS contains a redundant
Diversity in the Native collection of 3D structures for each protein obtained under differ-
State Using ent experimental conditions. It has an extensive annotation of the
the CoDNaS Database different conditions under which each conformer was obtained.
CoDNaS also offers different structural similarity measures
among conformers such as global RMSD and TM score, as well as
local measures (e.g., RMSD per position expressed as Z-scores). In
this way, CoDNaS facilitates the analysis of key information on
small structural differences. CoDNaS allows users to easily relate
the degree of conformational diversity with physical, chemical, and
biological properties. The last version of the CoDNaS database
(version 2.5) includes 73% of all available protein structures in the
PDB (21,152 different protein chains, 320,144 structures, and
15.09 average conformers per protein), and possesses different
tools to run sequence searches, display structural flexibility profiles,
visualize structural alignments and bounded ligands, and allow
users to browse the database by different structural classes.
2.2.1 Database Different conformers for each protein were identified and extracted
Implementation, Biological from the PDB using the following protocol:
Annotation, and External
– BLASTClust [25] was run against all protein chains deposited in
Links
the PDB to obtain all available clusters at 95% of local sequence
identity with a minimum coverage of 0.90 between all the
sequences in the cluster. A limit at 95% was set to include
putative sequence variations for a given protein. However, to
avoid the inclusion of homologous structures in a given CoD-
NaS entry, UniProt accession numbers were used to check that
all conformers belong to the same protein.
– The only considered clusters were those with at least two struc-
tures and with a resolution of less than 4.00 Å for each of the
crystallographic structures.
– To estimate the structural dissimilarity between conformers in
each cluster, C-alpha root mean square deviation (RMSD) using
MAMMOTH (see Note 3) [26] was calculated for all the possi-
ble pairs of conformers for each protein. The maximum C-alpha
RMSD value for each protein entry was registered as a measure
of the conformational diversity extension.
– Additionally, all conformers for a given protein were clustered
using a hierarchical procedure according to the RMSD values
between them. This enables users to identify different confor-
mational substates present in the native state of the protein.
– Furthermore, CoDNaS is cross-linked with other databases:

UniProt [27]; Class, Architecture, Topology and Homology
(CATH) [28]; Enzyme Commission [29]; MobiDB [30]; and
Gene Ontology [31].
2.2.2 Working Case: Human ephrin type-A receptor 4 (EphA4) is a tyrosine kinase
Conformational Diversity receptor. Eph receptors and their ephrin ligands are both anchored
of the Ephrin Type-A onto the plasma membrane and are subdivided into two subclasses
Receptor 4 (A and B) based on their sequence conservation and binding pre-
ferences [32]. In general, type-A receptors bind to ephrin but in
particular EphA4 is the only receptor capable of binding to all nine
ephrins and other small molecules with overlapped interfaces. Bind-
ing pattern in EphA4 can be explained exploring its ensemble of
conformers. EphA4 has two groups of conformers: closed and open
forms which have been biologically characterized and identified by
molecular dynamic simulations and NMR studies [33]. Hence,
open and closed conformations of the EphA4 can be easily explored
in CoDNaS using the information provided by the hierarchical
clustering based on the RMSD values between all pairs of confor-
mers (Fig. 1). It is interesting to note the differences between the
29 conformers available in CoDNaS. Ten of them were obtained by
nuclear magnetic resonance (NMR) and 19 by X-ray diffraction. It
is possible to find this protein in CoDNaS searching by its UniProt
accession number “P54764” and to access the entry page (protein
pool identifier in CoDNaS is “2WO1_A”). The entry page includes
a set of boxes with different information about the protein, such as
protein overview, structural information, conformers, clusters of
conformational states, and information about the pair of maximum
conformational diversity. EphA4 has a maximum conformational
diversity of RMSD ¼ 3.23 Å between the structures 2WO3 chain A
and 2LW8 chain A, model 7.
Fig. 1 Dendrogram of the EphA4 conformations. We can observe different conformational substates due to the
experimental method used and transitions between open (red) and closed (blue) conformations. Filled nodes
indicate that the conformer has ligand
Fig. 2 Comparison between conformers of the EphA4 based on clustering information. (a) Superimposition of
ten conformers from the NMR ensemble (PDB code ¼ 2LW8). (b) Superimposition of 16 closed conformations
(blue) of the EphA4. (c) Superimposition of three open conformations (red) of the EphA4
Figure 1 shows two main groups at the top, one containing all
NMR conformers and the other containing all X-ray conformers
(see also Note 1). Among the group of X-ray conformers, we can
observe two branches which separate open and closed conforma-
tions of the EphA4. Filled nodes indicate conformers in complex
with the ligand. Superimposition of these three different groups
(NMR, X-ray closed, and X-ray open) reveals a high conformational
variability in the regions of the B–C, D–E, G–H, and J–K loops (see
Fig. 2) [34]. In particular, the flexibility of the D–E and J–K loops,
which move upon binding to ephrin ligands, may be directly asso-
ciated with EphA4 function and binding pattern.
2.3 Practical Issues The extension of the conformational diversity was studied in a curated
Concerning dataset (see Note 2) of ~5000 proteins with more than 5 conformers
Conformational (see Note 6) per protein [35]. This study found three protein classes
Diversity based on their dynamical behavior: rigid, malleable, and partially
disordered proteins. Approximately 60% of the analyzed proteins are
2.3.1 How Large Are part of the first group, the rigid proteins. Conformational diversity of
Conformational Changes each protein was measured as the maximum RMSD (see Note 5) after
in Known Structural Space? an all-versus-all conformer pairwise comparisons. The RMSD distri-
bution of rigid proteins has a peak in 0.8 Å, a value close to the
crystallographic error which is near 0.5 Å (see Fig. 3). This result
agrees with earlier studies that found a positive skewed distribution
of RMSD [19, 36]. It also agrees with a previous work that found an
average RMSD of 0.5 Å after comparisons between structures of the
same protein in unbound states, a value slightly different from the
observed between apo and substrate-bound forms [37]. Apparently,
large-scale protein motions are not necessary to sustain biological
function in the majority of the studied proteins. This observation is
supported by the finding that even small changes between conformers
could greatly affect catalytic parameters and biological behavior of
enzymes [38, 39].
Fig. 3 The conformational diversity distribution can be represented by three

main sets of proteins: rigid (all protein conformers without IDRs), partially
disordered (with IDRs at least in one conformer and also in the pair of
maximum conformational diversity), and malleable (with IDRs at least in one
conformer of the protein but the pair of maximum conformational diversity
remains ordered)
2.3.2 Which Kind The tail of the distribution shown in Fig. 3 has mainly IDPs,
of Proteins Have Larger malleable and partially disordered proteins in particular, and a
Conformational Changes? minor proportion of globular or ordered proteins [35]. It is impor-
tant to note that IDPs contain very flexible regions which several
times appear as missing residues in the structures derived from
crystallographic studies. Almost half of these IDPs show order-
disorder transitions; that is, they have regions that are disordered
in one group of conformers but ordered in alternative conforma-
tions. Surprisingly, regions gaining order upon ligand binding are
almost as common as the ones gaining disordered regions upon
binding. IDPs showing order-disorder transitions reach the highest
RMSD values in their aligned ordered regions [17]. The high
RMSD values between conformers are related to the increase of
structural differences in the globular or ordered region of IDPs.
These differences can be high due to very flexible loops or regions
adopting variable conformations (e.g., malleable and partially dis-
ordered proteins).
In reference to globular or ordered proteins, large conforma-
tional movements have been previously described by M. Gerstein in
the MolMov database [1]. Most of the changes comprise domains

and/or fragments such as loops, normally as rigid bodies with
hinge, shear, or more complex motions [40].
What is the relationship between sequence variation and con-
formational diversity?
Several studies highlight the existence of evolutionary signals
coming from the conformational diversity of proteins [36]. For
instance, it has been shown that proteins with large conformational
diversity show lower evolutionary rates than proteins with more
similar conformers [16]. It is reasonable to think that different
conformers impose different structural constraints in protein evo-
lution. Consequently, it is possible that the reduction in the rate of
nonsynonymous mutations could be an effect of the increment in
the structural constraints in the presence of different conformers.
This is supported by the fact that 30% of the structurally con-
strained sites in a protein are conformer specific. In particular,
conformers with the highest affinity to their ligands seem to be
the ones that have more constraints on the divergence of their
sequences [41].
Protein dynamism imposes different kinds of constraints to
sequence variation. For example, residues playing a key role in
hinge regions during collective movements, as well as other dyna-
mically important positions, tend to be conserved throughout
evolution [42, 43]. Also, there are coevolving residue pairs that
facilitate the transition between conformers by cooperatively form-
ing and breaking contacts [44]. Coevolving residue pairs could lead
to covariation signals that are detectable in a multiple sequence
alignment [45]. Since residues in contact have propensity to
coevolve, detected covariation pairs can be used to predict protein
tertiary structures [46]. In particular, coevolving residues that
result from conformer specific contacts could be used to predict
the structure of different conformers and transition states [47]. A
couple of studies show that covariation methods are good at pre-
dicting contacts which are conserved in all the structures of a given
protein [48, 49]. Therefore, covariation methods have a poor
performance to predict conformer-specific contacts. However,
this coevolutionary information can be used together with molec-
ular dynamics to predict protein conformers when at least one
structure and 2000 homologous sequences are available [50].
2.3.3 Importance Template-based modeling (TBM) is based on the fact that homol-
of the Conformational ogous proteins with detectable sequence similarity possess similar
Diversity in Homology 3D structures. Pioneering work by Chothia and Lesk found that
Modeling structural divergence increases with evolutionary distance,
measured as identity percentage, following a nonlinear relationship
[51]. Very similar sequences show modest structural differences,
which suddenly increase when percentage of sequence identity
Fig. 4 RMSD versus percent of sequence identity. RMSD values were obtained from an all-versus-all
comparison between two homologous proteins considering all their conformers. The figure contains about
3.5 million comparisons
drops below 30%. Their results and conclusions have been verified
by numerous studies [52–56]. These studies found moderate-to-
high correlation coefficients between different parameters related
to structural and sequence similarity, i.e., RMSD versus identity
percentage and evolutionary distance. They also found linear and
nonlinear behavior, and an invariably low structural variation at
100% identity (~0.5 Å). However, when conformational diversity
is taken into account the relationship between sequence and struc-
tural divergence is more complex [57]. Figure 4 shows how
RMSDs between homologous proteins change as a function of
identity percentage. This figure was derived from an all-versus-all
pairwise alignment between all the conformers for 2024 proteins
from 524 families. It is possible to observe that at around 100%
identity (the conformational diversity of the protein) (see Note 4)
several proteins show RMSDs as high as those reached by sequence
divergence during evolution (say about 30–40% identity). This
means that the structural divergence is a complex process since a
given sequence (at 100% identity) could reach several angstroms of
conformational (structural) diversity. Interestingly, if we split the
population of proteins according to their corresponding degree of
conformational diversity (rigids and highly dynamical proteins) we
can observe in Fig. 5 that the rigid proteins could certainly be more
suitable to TBM methodologies than highly dynamic ones. The
rigid proteins show an average RMSD of 0.39 Å at 100% identity,
meaning that more similar sequences have more similar structures.
This last statement, basic to TBM reliability, apparently is not true
for highly dynamical proteins (average RMSD at 100% 1.17 Å).
Fig. 5 Maximum RMSD versus sequence percent identity. Points refer to the maximum RMSD obtained from
an all-versus-all comparison between conformers from two homologous proteins. Red dots are pairs of highly
dynamic homologous proteins (conformational diversity >0.5 Å) and blue dots are pairs of rigid proteins
(conformational diversity <¼0.5 Å)
3 Notes
1. The differences among protein structures obtained by NMR

and by X-ray diffraction are well established [58]. In confor-
mational analysis, we should avoid mixing NMR and X-ray
conformers in order to prevent biases in the RMSD values of
our analysis. It is well known that NMR ensembles have larger
RMSD values between models than conformers obtained by
X-ray. However, having conformers obtained by these two
methods provides us with a complementary source of informa-
tion on protein conformational diversity by making sure that
flexibility information is well reproduced in both cases [18].
2. X-ray resolution is another important aspect to take into
account in the ensembles of structures. It is always recom-
mended to use structures with a resolution of 2.5 Å or better
for good backbone and side-chain estimations. Despite that,
we can use structures with a resolution of 2.6–4 Å to estimate
backbone RMSD. More information about structure resolu-
tion can be found at http://proteopedia.org/wiki/index.php/
Resolution.
3. The problem of quantifying the differences between two struc-
tures of the same protein is nontrivial. There are several meth-
ods to calculate the global RMSD between a pair of structures
and the magnitude of the RMSD value depends on the struc-
tural alignment algorithm used. A good review of structural
comparison methods can be found in Ref. [59]. In CoDNaS,

MAMMOTH [26] was used to align each pair of conformers
and estimate the global RMSD. MAMMOTH is a sequence-
independent structural alignment program which provides the
user with statistical reliability data of the results. MAMMOTH
uses the MaxSub algorithm [60] to identify the maximal subset
from a set of paired-up atoms which are spatially close. MAM-
MOTH defines “close” as a distance of <4 Å after superposi-
tion. This is the reason why MAMMOTH RMSD values are
mainly between 0 and 4 Å. In addition to this, it is protein
length independent.
4. Even if conformers belong to the same protein, the percentage
of identity after a structural or sequential alignment could be
less than 100%. Sometimes the structure of a protein is resolved
after introducing some point mutations to the sequence. These
mutations introduce mismatches in the alignment. Another
source of mismatches is the gaps introduced by missing resi-
dues in structural alignment of crystallographic structures. The
best way to check whether the structure corresponds to a
specific protein or not is mapping the PDB code to the UniProt
accession numbers. SIFTS offers residue-level mapping from
PDB to UniProt [61]. It is possible to use the MIToS package
to parse SIFTS XML files and use their residue-level mapping
to guide rigid structural alignments [62].
5. It is well known that crystallographic contacts affect crystallo-
graphic structures. However, several works found that the
RMSD pattern observed between many different crystallo-
graphic structures for the same protein follows a trend derived
from the inherent flexibility of the protein rather than from the
crystallization conditions [18, 19, 63, 64]. Mainly, they found
that particular structural variations between different structures
from the same protein are independent of crystallization con-
ditions. Thus, different or identical ligands, even close homol-
ogous proteins, could or could not show structural variations
derived from the experimental conditions used for structure
determination. Also, Sikic and co-workers have found that loop
flexibility is independent of crystal packing contacts [58]. Con-
sequently, in general, the pattern of flexibility in a protein is
robust to the structural bias introduced by crystal packing or
crystallographic contacts. In previous works the effect of crystal
contacts on backbone flexibility has been explored using nor-
mal mode analysis [65]. It was found that correlations between
backbone flexibility profiles, predicted using simple structure-
based methods and experimental profiles, have shown to be
almost identical when sites involved in crystal contacts were
included or removed. Besides these findings, we evaluated the
influence of crystallographic contacts in the estimation of the
conformational diversity of a protein. For this purpose, a subset

of monomeric proteins from CoDNaS were taken, with only
one protein chain in the asymmetric unit (392 pairs of con-
formers) in order to remove heterocomplexes. Using UCSF
Chimera [66] the number of crystallographic contacts at 4.5 Å
of distance in both conformers of the pair showing the maxi-
mum RMSD for that protein was estimated. The number of
contacts and their correlation with the maximum RMSD were
studied. If the maximum RMSD were affected by crystallo-
graphic contacts, a high correlation (negative or positive)
between mean number of contact atoms and RMSD value
would have been expected. However, we obtained a negligible
Spearman’s correlation coefficient of 0.048.
6. To make a robust estimation of protein conformational diver-
sity using the RMSD between conformers, it is important to
take into account the number of conformers per protein. Pro-
teins with more than five conformers are suggested in order to
study backbone flexibility. Moreover, approximately 20 would
be the suggested number for a reliable study on side-chain
heterogeneity [18].
Acknowledgments
Authors would like to thank Paula Benencio for helping us with

manuscript proofreading.
References
1. Gerstein M, Lesk AM, Chothia C (1994) disordered proteins. Curr Opin Struct Biol
Structural mechanisms for domain movements 18:756–764
in proteins. Biochemistry 33:6739–6749 8. Boehr DD, McElheny D, Dyson HJ et al
2. Gerstein M, Krebs W (1998) A database of (2006) The dynamic energy landscape of dihy-
macromolecular motions. Nucleic Acids Res drofolate reductase catalysis. Science
26:4280–4290 313:1638–1642
3. Gu Y, Li D-W, Brüschweiler R (2015) Decod- 9. Tsai CJ, Del Sol A, Nussinov R (2009) Protein
ing the mobility and time scales of protein allostery, signal transmission and dynamics: a
loops. J Chem Theory Comput 11:1308–1314 classification scheme of allosteric mechanisms.
4. Gora A, Brezovsky J, Damborsky J (2013) Mol BioSyst 5:207–216
Gates of enzymes. Chem Rev 113:5871–5923 10. Hilser VJ (2010) Biochemistry. An ensemble
5. Perutz MF, Bolton W, Diamond R et al (1964) view of allostery. Science 327:653–654
Structure of haemoglobin. An X-ray examina- 11. James LC, Roversi P, Tawfik DS (2003) Anti-
tion of reduced horse haemoglobin. Nature body multispecificity mediated by conforma-
203:687–690 tional diversity. Science 299:1362–1367
6. Popovych N, Sun S, Ebright RH et al (2006) 12. Smock RG, Gierasch LM (2009) Sending sig-
Dynamically driven protein allostery. Nat nals dynamically. Science 324:198–203
Struct Mol Biol 13:831–838 13. Yogurtcu ON, Bora Erdemli S, Nussinov R
7. Dunker AK, Keith Dunker A, Silman I et al et al (2008) Restricted mobility of conserved
(2008) Function and structure of inherently residues in protein-protein interfaces in molec-
ular simulations. Biophys J 94:3475–3485
14. Lynch TJ, Bell DW, Sordella R et al (2004) 28. Sillitoe I, Lewis TE, Cuff A et al (2015) CATH:
Activating mutations in the epidermal growth comprehensive structural and functional anno-
factor receptor underlying responsiveness of tations for genome sequences. Nucleic Acids
non-small-cell lung cancer to gefitinib. N Res 43:D376–D381
Engl J Med 350:2129–2139 29. Bairoch A (2000) The ENZYME database in
15. Tokuriki N, Stricher F, Serrano L et al (2008) 2000. Nucleic Acids Res 28:304–305
How protein stability and new functions trade 30. Potenza E, Di Domenico T, Walsh I et al
off. PLoS Comput Biol 4:e1000002 (2015) MobiDB 2.0: an improved database of
16. Zea DJ, Miguel Monzon A, Fornasari MS et al intrinsically disordered and mobile proteins.
(2013) Protein conformational diversity corre- Nucleic Acids Res 43:D315–D320
lates with evolutionary rate. Mol Biol Evol 31. Ashburner M, Ball CA, Blake JA et al (2000)
30:1500–1503 Gene ontology: tool for the unification of biol-
17. Zea DJ, Monzon AM, Gonzalez C et al (2016) ogy. The Gene Ontology Consortium. Nat
Disorder transitions and conformational diver- Genet 25:25–29
sity cooperatively modulate biological function 32. Qin H, Shi J, Noberini R et al (2008) Crystal
in proteins. Protein Sci 25:1138–1146 structure and NMR binding reveal that two
18. Best RB, Lindorff-Larsen K, DePristo MA et al small molecule antagonists target the high
(2006) Relation between native ensembles and affinity ephrin-binding channel of the EphA4
experimental structures of proteins. Proc Natl receptor. J Biol Chem 283:29473–29484
Acad Sci U S A 103:10901–10906 33. Qin H, Lim L, Song J (2012) Protein dynamics
19. Burra PV, Zhang Y, Godzik A et al (2009) at Eph receptor-ligand interfaces as revealed by
Global distribution of conformational states crystallography, NMR and MD simulations.
derived from redundant models in the PDB BMC Biophys 5:2
points to non-uniqueness of the protein struc- 34. Bowden TA, Aricescu AR, Nettleship JE et al
ture. Proc Natl Acad Sci U S A (2009) Structural plasticity of eph receptor A4
106:10505–10510 facilitates cross-class ephrin signaling. Struc-
20. Berman HM, Westbrook J, Feng Z et al (2000) ture 17:1386–1397
The Protein Data Bank. Nucleic Acids Res 35. Monzon AM, Zea DJ, Fornasari MS et al
28:235–242 (2017) Conformational diversity analysis
21. Wei G, Xi W, Nussinov R et al (2016) Protein reveals three functional mechanisms in pro-
ensembles: how does nature harness thermo- teins. PLoS Comput Biol 13:1–29
dynamic fluctuations for life? The diverse func- 36. Parisi G, Zea DJ, Monzon AM et al (2015)
tional roles of conformational ensembles in the Conformational diversity and the emergence
cell. Chem Rev 116:6516. https://doi.org/ of sequence signatures during evolution. Curr
10.1021/acs.chemrev.5b00562 Opin Struct Biol 32:58–65
22. Marino-Buslje C, Monzon AM, Zea DJ et al 37. Gutteridge A, Thornton J (2005) Conforma-
(2017) On the dynamical incompleteness of tional changes observed in enzyme crystal
the Protein Data Bank. Brief Bioinform. structures upon substrate binding. J Mol Biol
https://doi.org/10.1093/bib/bbx084 346:21–28
23. Monzon AM, Juritz E, Fornasari MS et al 38. Mesecar AD, Stoddard BL, Koshland DE Jr
(2013) CoDNaS: a database of conformational (1997) Orbital steering in the catalytic power
diversity in the native state of proteins. Bioin- of enzymes: small structural changes with large
formatics 29:2512–2514 catalytic consequences. Science 277:202
24. Monzon AM, Rohr CO, Fornasari MS et al 39. Koshland DE (1998) Conformational changes:
(2016) CoDNaS 2.0: a comprehensive data- how small is big enough? Nat Med
base of protein conformational diversity in the 4:1112–1114
native state. Database 2016:baw038 40. Rashin AA, Rashin AHL, Jernigan RL (2010)
25. Altschul SF, Gish W, Miller W et al (1990) Diversity of function-related conformational
Basic local alignment search tool. J Mol Biol changes in proteins: coordinate uncertainty,
215:403–410 fragment rigidity, and stability. Biochemistry
26. Ortiz AR, Strauss CEM, Olmea O (2002) 49:5683–5704
MAMMOTH (matching molecular models 41. Juritz E, Palopoli N, Fornasari S et al (2013)
obtained from theory): an automated method Protein conformational diversity modulates
for model comparison. Protein Sci sequence divergence. Mol Biol Evol 30:79–87
11:2606–2621 42. Liu Y, Bahar I (2012) Sequence evolution cor-
27. The UniProt Consortium (2017) UniProt: the relates with structural dynamics. Mol Biol Evol
universal protein knowledgebase. Nucleic 29:2253–2263
Acids Res 45:D158–D169
43. Saldaño TE, Monzon AM, Parisi G et al (2016) coupling between sequence and structure vari-
Evolutionary conserved positions define pro- ation. Proteins 61:535–544
tein conformational diversity. PLoS Comput 56. Illergård K, Ardell DH, Elofsson A (2009)
Biol 12:e1004775 Structure is three to ten times more conserved
44. Jeon J, Nam H-J, Choi YS et al (2011) Molec- than sequence--a study of structural response
ular evolution of protein conformational in protein cores. Proteins 77:499–508
changes revealed by a network of evolutionarily 57. Monzon AM, Zea DJ, Marino-Buslje C et al
coupled residues. Mol Biol Evol (2017) Homology modeling in a dynamical
28:2675–2685 world. Protein Sci 26:2195
45. Codoñer FM, Fares MA (2008) Why should 58. Sikic K, Tomic S, Carugo O (2010) Systematic
we care about molecular coevolution? Evol comparison of crystal and NMR protein struc-
Bioinformatics Online 4:29–38 tures deposited in the protein data bank. Open
46. de Oliveira SHP, Shi J, Deane CM (2017) Biochem J 4:83–95
Comparing co-evolution methods and their 59. Kufareva I, Abagyan R (2012) Methods of pro-
application to template-free protein structure tein structure comparison. In: Orry AJW,
prediction. Bioinformatics 33:373–381 Abagyan R (eds) Homology modeling: meth-
47. Morcos F, Jana B, Hwa T et al (2013) Coevo- ods and protocols. Humana Press, Totowa, NJ,
lutionary signals across protein lineages help pp 231–257
capture multiple protein conformations. Proc 60. Siew N, Elofsson A, Rychlewski L et al (2000)
Natl Acad Sci U S A 110:20533–20538 MaxSub: an automated measure for the assess-
48. Rodriguez-Rivas J, Marsili S, Juan D et al ment of protein structure prediction quality.
(2016) Conservation of coevolving protein Bioinformatics 16:776–785
interfaces bridges prokaryote–eukaryote 61. Velankar S, Dana JM, Jacobsen J et al (2013)
homologies in the twilight zone. Proc Natl SIFTS: structure integration with function,
Acad Sci U S A 113:15018–15023 taxonomy and sequences resource. Nucleic
49. Zea DJ, Monzon AM, Parisi G, et al (2018) Acids Res 41:D483–D489
How is structural divergence related to evolu- 62. Zea DJ, Anfossi D, Nielsen M et al (2016)
tionary information?, Molecular Phylogenetics MIToS.jl: Mutual information tools for protein
and Evolution, Available online 25 June 2018, sequence analysis in the Julia language. Bioin-
ISSN 1055-7903, https://doi.org/10.1016/ formatics 33(4):564–565
j.ympev.2018.06.033 63. Zoete V, Michielin O, Karplus M (2002) Rela-
50. Sfriso P, Duran-Frigola M, Mosca R et al tion between sequence and structure of HIV-1
(2016) Residues coevolution guides the sys- protease inhibitor complexes: a model system
tematic identification of alternative functional for the analysis of protein flexibility. J Mol Biol
conformations in proteins. Structure 315:21–52
24:116–126 64. Hrabe T, Li Z, Sedova M et al (2016)
51. Chothia C, Lesk AM (1986) The relation PDBFlex: exploring flexibility in protein struc-
between the divergence of sequence and structures. Nucleic Acids Res 44:D423–D428
ture in proteins. EMBO J 5:823–826 65. Maguid S, Fernández-Alberti S, Parisi G et al
52. Koehl P, Levitt M (2002) Sequence variations (2006) Evolutionary conservation of protein
within protein families are linearly related to backbone flexibility. J Mol Evol 63:448–457
structural variations. J Mol Biol 2836:551–562 66. Pettersen EF, Goddard TD, Huang CC et al
53. Hubbard TJ, Blundell TL (1987) Comparison (2004) UCSF chimera--a visualization system
of solvent-inaccessible cores of homologous for exploratory research and analysis. J Comput
proteins: definitions useful for protein model- Chem 25:1605–1612
ling. Protein Eng 1:159–171 67. Lee RA, Razaz M, Hayward S (2003) The
54. Russell RB, Barton GJ (1994) Structural fea- DynDom database of protein domain motions.
tures can be unconserved in proteins with sim- Bioinformatics 19:1290–1291
ilar folds. An analysis of side-chain to side-chain 68. Amemiya T, Koike R, Kidera A et al (2012)
contacts secondary structure and accessibility. J PSCDB: a database for protein structural
Mol Biol 244:332. https://doi.org/10.1006/ change upon ligand binding. Nucleic Acids
jmbi.1994.1733 Res 40:D554–D558
55. Wen B, Lampe JN, Roberts AG et al (2005)
Evolutionary plasticity of protein families:
Chapter 21
High-Throughput Antibody Structure Modeling and Design

Using ABodyBuilder
Jinwoo Leem and Charlotte M. Deane
Abstract
Antibodies are proteins of the adaptive immune system; they can be designed to bind almost any molecule,
and are increasingly being used as biotherapeutics. Experimental antibody design is an expensive and time-
consuming process, and computational antibody design methods can now be used to help develop new
therapeutics and diagnostics. Within the design pipeline, accurate antibody structure modeling is essential,
as it provides the basis for antibody-antigen docking, binding affinity prediction, and estimating thermal
stability. Ideally, models should be rapidly generated, allowing the exploration of the breadth of antibody
space. This allows methods to replicate the natural processes of antibody diversification (e.g., V(D)J
recombination and somatic hypermutation), and cope with large volumes of data that are typical of next-
generation sequencing datasets. Here we describe ABodyBuilder and PEARS, algorithms that build and
mutate antibody model structures. These methods take ~30 s to generate a model antibody structure.
Key words Antibody structure prediction, Side-chain prediction, Accuracy estimation, Developability
1 Introduction
In vertebrate organisms, antibodies are produced by B cells as part

of the adaptive immune response. Through immunoglobulin gene
recombination and somatic hypermutation, it is theoretically possi-
ble to generate ~1011 antibodies [1], each of which is specific for a
foreign molecule (also known as an “antigen”). Antibodies bind
their targets with high binding affinity, typically in the nanomolar
range [2]. The level of antibody diversity, along with their unique
binding properties, has led to an interest in designing antibodies for
a wide range of applications, especially as novel biotherapeutics
[e.g., 3–6].
Antibodies are comprised of four polypeptide chains: two
“heavy” chains and two “light” chains (Fig. 1). Each chain is
made up of multiple immunoglobulin domains, and is split into
two regions: the variable (V) and constant (C). The V regions of the
heavy (VH) and light (VL) chains combine to form the variable
367
368 Jinwoo Leem and Charlotte M. Deane
A. CDR Loops Fv Region B.
CDRH1 CDRH3
CDRL3 CDRL1
VH
VH
CDRH2 CDRL2
VL
VL
CL 1
CH
CH
1
CL
Immunoglobulin
CH2
Domains
V: Variable CH2 VH domain VL Domain
C: Constant
CH3
CH3
Heavy Chain
Light Chain
Fig. 1 Structure of an antibody molecule. (a) Antibodies are formed from two pairs of two protein chains: the
heavy chains (green) and the light chains (cyan). Each chain has a series of immunoglobulin domains, known
as the variable (V) or constant (C) regions. The two variable domains combine to form the variable fragment
(Fv), and at the tip of the Fv are the CDR loops, which form the majority of the antigen-binding site. (b) The
variable fragment has six CDR loops: CDRH1, CDRH2, and CDRH3 from the VH domain, and CDRL1, CDRL2,
and CDRL3 on the VL
fragment (Fv), which is responsible for antigen recognition. Each V

region has three loops, which have the highest degree of sequence
and structural diversity between different antibodies [7]. These are
known as the complementarity-determining region (CDR) loops.
Three CDRs from the VH (CDRH1, CDRH2, CDRH3) and three
CDRs from the VL (CDRL1, CDRL2, CDRL3) form the majority
of the antigen-binding site. Despite variation in sequence, five of
the six CDR loops (CDRH1, CDRH2, CDRL1, CDRL2, and
CDRL3) are thought to adopt a limited number of conformations,
known as the canonical classes [e.g., 8–10]. The remainder of the V
domains is collectively known as the antibody’s framework. Resi-
dues in the framework region and packing of the VH and VL
domains (i.e., VH–VL orientation) can also affect antigen binding
[11, 12].
Antibody design campaigns aim to engineer a new antibody
structure that can bind a target of interest, or modify an antibody
for enhanced function, such as affinity or thermostability. Several
methods are used for experimental antibody design, such as phage
display [13] and immunizing “humanized” mice [14]. However,
these techniques require extensive resources [15]. To facilitate the
progress of experimental work, computational techniques can help
improve an antibody’s affinity [16], increase its safety and stability
[17–19], target new antigens [20], or explore new binding
modes [21].
Antibody Structure Modeling 369
Target antigen
of interest
Structural Function Structural

modelling prediction modifications
Sequence
design
Fig. 2 Starting from an initial target sequence, it is imperative to build a model structure of the antibody
[24–26]. Next, the model antibody structure is tested for a particular function; for example, they are docked to
the target antigen for predicting binding affinity [27, 28]. From this newly formed complex, the antibody
structure is allowed to mutate, leading to a new antibody sequence. This cycle is repeated, leading to multiple
possible designs
Traditionally, antibodies have been computationally “rede-

signed” by mutation in silico. However, the ideal end goal would
be to develop a complete antibody “design” methodology, where
antibodies can be built de novo for a novel target [22, 23]. For both
redesign and complete design, a pipeline must give an amino acid
sequence that generates an antibody structure with desired proper-
ties, e.g., an antibody with sub-nanomolar binding affinity (Fig. 2)
[29, 30]. Thus, computational design pipelines rely on several
tools, including sequence annotation and analysis [31], structural
modeling [24–26], function prediction [27, 28, 32], and mutation
[33, 34].
Antibody structural modeling lies at the heart of computational
antibody design [22, 23]. Although the antigen structure may be
available (experimental or model), the sequence and the structure
of the cognate antibody during the design phase are often
unknown. Thus, it is imperative to build structural models to help
understand how an antibody can interact with the antigen, e.g., via
antibody-antigen docking [4, 27, 28]. Models allow users to inves-
tigate the impact of design choices (e.g., single amino acid muta-
tions or grafting CDR loops) on specific design objectives (e.g.,
antibody stability and safety) [35].
Once the antibody structure is modeled, mutations can be
introduced in silico, leading to directed antibody evolution. In
many cases, the modeled backbone structure is retained; only the

amino acid side chains are swapped and repacked [33, 34].
In this chapter, we outline the procedure for using ABody-
Builder for antibody structural modeling, and PEARS for single-
amino-acid mutations. Both tools are suited for computational
antibody redesign, though it is possible to use them for complete
design problems.
1.1 Antibody Antibody structure prediction can cover a broad range of problems,
Structural Modeling such as CDR loop prediction [36–39] and predicting the orienta-
tion between the variable domains [40, 41]. This chapter specifi-
cally focuses on predicting the structure of the Fv [24, 25, 28], as
this is the domain that is primarily responsible for antigen binding.
Antibody structure prediction is usually undertaken in a
template-based manner as the frameworks of antibody structures
are highly conserved. Most protocols follow a similar procedure,
with minor variations; as an example, Fig. 3 shows an overview of
the ABodyBuilder algorithm.
For a target antibody sequence, modeling programs first iden-
tify one or more template structure(s) to model the framework
region. Templates can be selected from the Protein Data Bank
(PDB) [44], or from a curated database, such as the Structural
Antibody Database (SAbDab) [2]. The coordinates of the template
structure(s) are copied and used as a scaffold for subsequent steps.
Next, the orientation between the VH and VL domains is predicted.
This can be done by using the VH–VL orientation of the template
structure [24, 25, 45], machine learning techniques [40], or
computational docking algorithms [26].
In the third stage, the CDR loops are modeled. This is often
done by knowledge-based methods, such as FREAD [36, 46,
47]. Using a database of previously observed structural fragments,
FREAD predicts the CDR loop structure based on sequence simi-
larity to the target CDR sequence and anchor geometry [47]. If a
suitable fragment is not available, CDR loops can be predicted by
ab initio methods, such as MODELLER [48] and Rosetta
[28, 37]. Programs such as Sphinx use both fragment-based and
ab initio techniques for predicting the CDR loops, which is partic-
ularly useful for the CDRH3 loop [38]. For the CDRH1, CDRH2,
CDRL1, CDRL2, and CDRL3 loops, it is possible to predict the
canonical form of the loop based on sequence [10, 49].
Finally, the torsion angles of the side chains, known as the χ
angles, are predicted using only the backbone information alone.
Some modeling methods, such as ABodyBuilder, rely on dedicated
side-chain prediction tools [24, 25, 42]. Other pipelines use a built-
in side-chain prediction algorithm [43, 50], or a solvation model
[51]. Following side-chain prediction, the model structure in some
protocols undergoes energy minimization [42, 43].
Annotate and number

antibody sequence using EVQLQQSGAE... DIVMTQSQKF...
ANARCI.
Search framework template

VH+VL(single), or
VH/VL(multiple).
Predict orientation for
multiple templates.
Search CDR loop templates

FREAD searches for CDR loop
fragments using environment-
specific substitution, anchor
RMSD, and checks for clashes.
If a suitable fragment is not
available, use a length-matched
sequence-similar template.
Otherwise, model the loop ab
initio with MODELLER.
Model side chains

PEARS predicts side chains
using data from an
IMGT position-dependent
rotamer library.
Annotate model accuracy

Estimate the expected RMSD
value for a confidence
threshold (e.g. 75%), based on
framework superimposition
data and FREAD benchmark.
Fig. 3 Overview of the ABodyBuilder modeling methodology [24]. Most modeling

methods follow a similar workflow with minor variations [25, 26, 42, 43]
Once the model is generated, ABodyBuilder annotates the

expected model accuracy, and is currently the only freely available
methodology that offers this functionality [24]. ABodyBuilder esti-
mates the probability that a region, e.g., the framework or the
CDRH1 loop, is modeled within a root mean square deviation
(RMSD) threshold, given the sequence identity or loop length. For

example, ABodyBuilder reports the probability that the model’s
CDRL3 loop would be predicted with an RMSD within 1.0 Å. For
design purposes, this allows users to understand the limitations of
the model structure, and whether the model should be considered
for further applications in silico. ABodyBuilder also flags positions
that can cause potential developability issues [52], helping users
eliminate some of the sources of error in antibody production.
1.2 Directed Side-chain prediction methods can be used to predict all the side
Evolution by Side- chains on a model structure, or they can be used to introduce
Chain Prediction mutations [e.g., 33, 34]. It is assumed that in most cases, changing
a single-amino-acid residue has little impact on the overall structure
of a protein [33]. Thus, in silico mutation can be considered a
specialized case of side-chain prediction.
In the traditional side-chain prediction problem, every resi-
due’s χ angle(s) must be predicted. In order to simplify the confor-
mational search space, the χ angles are described in discrete forms,
known as rotamers. Side-chain prediction methods generate pre-
dictions by sampling rotamers from rotamer libraries, which
describe the probability of a rotamer for a given structural property.
The most common structural property is the ϕ/ψ angles of the
backbone [53, 54]. Other properties such as secondary structure
[55] or an amino acid’s position in a protein fragment [56] have
also been used. For PEARS, our antibody-specific side-chain pre-
dictor, rotamer probabilities are dependent on their IMGT position
[57]. Numbering schemes such as the IMGT scheme provide a
method for comparing the amino acid sequences of two or more
antibodies. In theory, a given position should represent a specific
part of the immunoglobulin domain, and capture features such as
the distribution of amino acids. While there are various schemes
available [8, 58], the IMGT scheme is often preferred as it has a
clear correlation to structure [57, 59].
2 Materials
2.1 Web A WebGL and JavaScript-enabled web browser, such as Google

Requirements Chrome, is recommended for ABodyBuilder (http://opig.stats.
ox.ac.uk/webapps/abodybuilder) and PEARS (http://opig.stats.
ox.ac.uk/webapps/pears). Currently, PEARS is part of the ABody-
Builder pipeline, though users can separately submit structures to
PEARS for predicting the side chains, or mutating the antibody in
silico.
2.2 Additional To view the model structures locally, users are recommended to use
Software PyMOL (https://sourceforge.net/projects/pymol/), which is
Requirements available for Linux, Macintosh, and Windows. Users can download
annotations, e.g., sequence liabilities, as CSV files, which can be

opened in most spreadsheet applications (e.g., LibreOffice).
3 Methods
3.1 ABodyBuilder In the sequence submission form, submit the amino acid sequence
of the target antibody. In order to model a paired antibody (includ-
3.1.1 Sequence
ing single-chain Fvs), submit sequences for both the heavy and light
Annotation
chains, while for single-domain antibodies (for example, VHH
antibodies), submit the sequence for one chain (see Note 1). In
the text below, we describe the procedure for paired antibodies.
1. The submitted target sequence is numbered by ANARCI [31],
which uses a database of hidden Markov models (HMMs; see
Note 1) to number antibody sequences. During this process,
the antibody’s framework region and CDR loops are identified
using the definitions from [9].
3.1.2 Framework Once the sequence has been annotated by ANARCI, ABodyBuilder
Template Selection searches for a template framework structure from SAbDab [2].
and Orientation Prediction
1. ABodyBuilder identifies the template with the highest
sequence identity to the target sequence across the framework
region. If there is an antibody structure that is at least 80%
sequence-identical for both chains, ABodyBuilder uses this
structure as a single “global” template. Otherwise, it uses a
“hybrid” template where two templates, one for the VH and
VL, are used. See Note 2 for example of template selections.
2. If ABodyBuilder finds a global template, its orientation is used.
For hybrid templates, the orientation of the antibody with the
highest global sequence identity is used. See Note 2 for exam-
ple of orientation selections.
3.1.3 Prediction The CDR loops are predicted by a combination of FREAD [46, 47]
of the CDR Loops and MODELLER [48]. The loops are predicted in the order of
CDRL2, CDRH2, CDRL1, CDRH1, CDRL3, and then CDRH3.
The ordering is based on our ability to predict each CDR loop
individually, and the frequency of Cβ-Cβ contacts between CDR
loops. The CDRL2 and CDRH2 loops are predicted first because
they are usually modeled with the highest accuracy and there are no
contacts between them. This is followed by CDRL1 and CDRH1
as they are the next best predicted loops, and then the CDRL3 and
CDRH3.
1. FREAD is a database method; a CDR loop-specific database is
used to predict each loop, i.e., a CDRL3-specific database is
used to predict the CDRL3 loop. FREAD selects loops using
an environment-specific substitution, anchor RMSD, and
checks for clashes with the scaffold (i.e., the framework region
and existing CDR loops). If there are no suitable fragments in
the CDR-specific database, FREAD uses an antibody-specific
database, which includes fragments from all six CDR loops.
2. If FREAD does not find a suitable prediction, ABodyBuilder
searches for a length-matched loop with the highest BLO-
SUM62 score to the target CDR loop sequence.
3. If a length-matched sequence-similar loop is not available,
MODELLER is used to model the loop ab initio.
3.1.4 Side-Chain Once the CDR loops are predicted on the template framework
Prediction structure, the side chains of the model are predicted using
PEARS. PEARS uses an IMGT position-dependent distribution
of amino acid rotamers in antibody structures.
1. PEARS first builds the disulfide bridges in the antibody struc-
ture, typically between IMGT positions H23-H104 and
L23-L104.
2. Next, PEARS identifies side chain types that are known to have
a unimodal χ1 angle distribution (e.g., L116 tyrosine). The
side chains at these positions are predicted first using rotamers
with the same χ1 angle bin. If there are no suitable predictions,
these positions are predicted in the next step.
3. The remaining side chains are predicted by dead-end elimina-
tion [60] and then graph decomposition, similar to other side-
chain prediction methods [33, 61]. If no suitable predictions
can be made, only a Cβ atom is placed.
3.1.5 Annotation Once ABodyBuilder completes the modeling process (~30 s), the
of Model Structure user is immediately redirected to the results page, summarizing the
and Download Links templates that were used for the framework and the CDRs (Fig. 4).
In addition, sequence alignments of the model and target
sequences are provided. Users can choose to submit the structure
for paratope prediction (Antibody i-Patch) [27] or epitope predic-
tion (EpiPred) [32], or view the model structure for model accu-
racy and sequence liabilities (Fig. 4).
3.2 Pears The first step requires the user to upload the structure of the
antibody, with or without the antigen (Fig. 5), and specify the
3.2.1 Structure
antibody chains, for example, “HL.” To mutate residues in the
Input Form
antibody structure, the desired amino acid sequence of the anti-
body is then submitted (see Note 3). PEARS generates the mutated
structure and the user is directed to a results page with the anti-
body, renumbered in the IMGT scheme, that will be available,
along with a text file listing all the predicted χ angles.
Fig. 4 Screenshots of the ABodyBuilder results and viewer pages. Once a model is built, users are directed to
the results page (top) that lists the template structures that were used to model different regions of the
antibody. The viewer page (bottom) shows the model using BioPV [62]
Fig. 5 Screenshot of the PEARS input and results pages. In the input page (top), users can submit a modified
sequence of the antibody (see Note 3). The output page (bottom) shows the final prediction, and users can
download a tab-separated file that lists the χ angles in the final model
3.2.2 Mutation When mutating the antibody structure, PEARS aligns the submit-
of the Input Structure ted sequence to the amino acid sequence in the structure (see Note
3). In the single-mutation case, PEARS simply uses the lowest
energy rotamer to fit into the target position. Otherwise, it runs
dead-end elimination and graph decomposition.
3.2.3 Resolving Clashes When PEARS predicts the side-chain structure of a target position,
in the Structure it uses a KD-tree algorithm to check for clashes. Two atoms are
considered to clash if they are closer than 63% of the sum of their
van der Waal’s radii, which is similar to previously established cut-
offs [34]. If clashes are detected, PEARS first adds Gaussian noise
to the χ angles; if this does not resolve the clashes, no predictions
are made, and the position is left with only a Cβ atom.
4 Notes
1. Sequences are initially numbered by ANARCI. If ANARCI

cannot number the sequence (e.g., not a variable domain
sequence), or it cannot detect the anchor residues of the
CDR loops, ABodyBuilder immediately stops. We found that
most antibodies that were not modeled by ABodyBuilder had
failed at the ANARCI stage of the pipeline. Thus, users are
advised to check that their sequence can be numbered using the
ANARCI web application (http://opig.stats.ox.ac.uk/
webapps/anarci) before running ABodyBuilder.
2. Below are example selections of global and hybrid templates.
Target VH template VL template Orientation

antibody (sequence identity) (sequence identity) template
1h0d:BA 1ejo:H (89%) 1ejo:L (86%) 1ejo:HL
12e8:HL 3nig:E (90%) 1i3g:L (95%) 1i3g:HL
In the case of 1h0d:BA, a single “global” template is used

as both chains of 1ejo:HL have 80% sequence identity to the
target sequence. For 12e8:HL, a “hybrid” template is used as
the heavy chain of 1i3g:HL has 79.8% sequence identity to the
target sequence. However, across both chains, 1i3g:HL has the
highest global sequence identity, and is thus used for predicting
the VH–VL orientation.
3. To mutate an antibody, we recommend submitting a sequence
with identical length to the input structure, though PEARS can
align sequences with mismatching lengths. Furthermore, the
chain identifiers must refer to the antibody chains in the struc-
ture; otherwise the sequence alignment will fail. To fix existing
side chains, use lowercase letters, and uppercase letters else-
where. If multiple side-chain mutations are required, we rec-
ommend rerunning ABodyBuilder, though PEARS can handle
multiple mutations at once.
References
1. Georgiou G, Ippolito GC, Beausang J, Busse Friedrich GA, Bradley A (2014) Complete
CE, Wardemann H, Quake SR (2014) The humanization of the mouse immunoglobulin
promise and challenge of high-throughput loci enables efficient therapeutic antibody dis-
sequencing of the antibody repertoire. Nat covery. Nat Biotech 32:356–363
Biotechnol 32:158–168 15. Liu X, Taylor RD, Griffin L, Coker S-F,
2. Dunbar J, Krawczyk K, Leem J, Baker T, Adams R, Ceska T, Shi J, Lawson ADG, Baker
Fuchs A, Georges G, Shi J, Deane CM (2014) T (2017) Computational design of an epitope-
SAbDab: the structural antibody database. specific Keap1 binding antibody using hotspot
Nucleic Acids Res 42:D1140–D1146 residues grafting and CDR loop swapping. Sci
3. Chames P, Van Regenmortel M, Weiss E, Baty Rep 7:41306
D (2009) Therapeutic antibodies: successes, 16. Lippow SM, Wittrup KD, Tidor B (2007)
limitations and hopes for the future. Br J Phar- Computational design of antibody-affinity
macol 157:220–233 improvement beyond in vivo maturation. Nat
4. Kuroda D, Shirai H, Jacobson MP, Nakamura Biotechnol 25:1171–1176
H (2012) Computer-aided antibody design. 17. Choi Y, Hua C, Sentman CL, Ackerman ME,
Protein Eng Des Sel 25:507–521 Bailey-Kellogg C (2015) Antibody humaniza-
5. Reichert JM (2017) Antibodies to watch in tion by structure-based computational protein
2017. MAbs 9:167–181 design. MAbs 7:1045–1057
6. Weiner GJ (2015) Building better monoclonal 18. Miklos AE, Kluwe C, Der BS, Pai S, Sircar A,
antibody-based therapeutics. Nat Rev Cancer Hughes RA, Berrondo M, Xu J, Codrea V,
15:361–370 Buckley PE, Calm AM, Welsh HS, Warner
7. Schroeder HW, Cavacini L (2010) Structure CR, Zacharko MA, Carney JP, Gray JJ,
and function of immunoglobulins. J Allergy Georgiou G, Kuhlman B, Ellington AD
Clin Immunol 125:41–52 (2012) Structure-based design of super-
charged, highly thermoresistant antibodies.
8. Chothia C, Lesk A (1987) Canonical structures Chem Biol 19:449–455
for the hypervariable regions of immunoglobu-
lins. J Mol Biol 196:901–917 19. Olimpieri PP, Marcatili P, Tramontano A
(2015) Tabhu: tools for antibody humaniza-
9. North B, Lehmann A, Dunbrack RL (2011) A tion. Bioinformatics 31:434–435
new clustering of antibody CDR loop confor-
mations. J Mol Biol 406:228–256 20. Lewis SM, Wu X, Pustilnik A, Sereno A,
Huang F, Rick HL, Guntas G, Leaver-Fay A,
10. Nowak J, Baker T, Georges G, Kelm S, Smith EM, Ho C, Hansen-Estruch C, Cham-
Klostermann S, Shi J, Sridharan S, Deane CM berlain AK, Truhlar SM, Conner EM, Atwell S,
(2016) Length-independent structural simila- Kuhlman B, Demarest SJ (2014) Generation of
rities enrich the antibody CDR canonical class bispecific IgG antibodies by structure-based
model. MAbs 8:751–760 design of an orthogonal Fab interface. Nat
11. Dunbar J, Fuchs A, Shi J, Deane CM (2013) Biotechnol 32:191–198
ABangle: Characterising the VH-VL orienta- 21. Dunbar J, Knapp B, Fuchs A, Shi J, Deane CM
tion in antibodies. Protein Eng Des Sel (2014) Examining variable domain orienta-
26:611–620 tions in antigen receptors gives insight into
12. Foote J, Winter G (1992) Antibody framework TCR-like antibody design. PLoS Comput Biol
residues affecting the conformation of the 10:1–10
hypervariable loops. J Mol Biol 224:487–499 22. Lapidoth GD, Baran D, Pszolla GM, Norn C,
13. McCafferty J, Griffiths AD, Winter G, Chiswell Alon A, Tyka MD, Fleishman SJ (2015) AbDe-
DJ (1990) Phage antibodies: filamentous sign: an algorithm for combinatorial backbone
phage displaying antibody variable domains. design guided by natural conformations and
Nature 348:552–554 sequences. Proteins 83:1385–1406
14. Lee E-C, Liang Q, Ali H, Bayliss L, Beasley A, 23. Li T, Pantazes RJ, Maranas CD (2014) Opt-
Bloomfield-Gerdes T, Bonoli L, Brown R, MAVEn – a new framework for the de novo
Campbell J, Carpenter A, Chalk S, Davis A, design of antibody variable region models tar-
England N, Fane-Dremucheva A, Franz B, geting specific antigen epitopes. PLoS One
Germaschewski V, Holmes H, Holmes S, 9:1–17
Kirby I, Kosmac M, Legent A, Lui H, 24. Leem J, Dunbar J, Georges G, Shi J, Deane
Manin A, O’Leary S, Paterson J, Sciarrillo R, CM (2016) ABodyBuilder: automated
Speak A, Spensberger D, Tuffery L, Waddell N,
Wang W, Wells S, Wong V, Wood A, Owen MJ,
antibody structure prediction with data-driven 38. Marks C, Nowak J, Klostermann S, Georges G,
accuracy estimation. MAbs 8:1259–1268 Dunbar J, Shi J, Kelm S, Deane CM (2017)
25. Marcatili P, Olimpieri PP, Chailyan A, Tramon- Sphinx: merging knowledge-based and ab
tano A (2014) Antibody structural modeling initio approaches to improve protein loop pre-
with prediction of immunoglobulin structure diction. Bioinformatics 33:1346–1353
(PIGS). Nat Protoc 9:2771–2783 39. Messih MA, Lepore R, Marcatili P, Tramon-
26. Sivasubramanian A, Sircar A, Chaudhury S, tano A (2014) Improving the accuracy of the
Gray JJ (2009) Toward high-resolution structure prediction of the third hypervariable
homology modeling of antibody Fv regions loop of the heavy chains of antibodies. Bioin-
and application to antibody-antigen docking. formatics 30:2733–2740
Proteins 74:497–514 40. Bujotzek A, Dunbar J, Lipsmeier F, Sch€afer W,
27. Krawczyk K, Baker T, Shi J, Deane CM (2013) Antes I, Deane CM, Georges G (2015a) Pre-
Antibody i-Patch prediction of the antibody diction of VH-VL domain orientation for anti-
binding site improves rigid local antibody- body variable domain modeling. Proteins
antigen docking. Protein Eng Des Sel 83:681–695
26:621–629 41. Marze NA, Lyskov S, Gray JJ (2016) Improved
28. Weitzner BD, Jeliazkov JR, Lyskov S, Marze N, prediction of antibody VL-VH orientation.
Kuroda D, Frick R, Adolf-Bryfogle J, Biswas N, Protein Eng Des Sel 29:409–418
Dunbrack RL Jr, Gray JJ (2017) Modeling and 42. Yamashita K, Ikeda K, Amada K, Liang S,
docking of antibody structures with Rosetta. Tsuchiya Y, Nakamura H, Shirai H, Standley
Nat Protoc 12:401–416 DM (2014) Kotai antibody builder: automated
29. Huang P-S, Boyken SE, Baker D (2016) The high-resolution structural modeling of antibo-
coming of age of de novo protein design. dies. Bioinformatics 30:3279–3280
Nature 537:320–327 43. Bujotzek A, Fuchs A, Qu C, Benz J,
30. Khoury GA, Smadbeck J, Kieslich CA, Floudas Klostermann S, Antes I, Georges G (2015b)
CA (2014) Protein folding and de novo pro- MoFvAb: modeling the Fv region of antibo-
tein design for biotechnological applications. dies. MAbs 7:838–852
Trends Biotechnol 32:99–109 44. Berman HM, Westbrook J, Feng Z,
31. Dunbar J, Deane CM (2016) ANARCI: anti- Gilliland G, Bhat TN, Weissig H, Shindyalov
gen receptor numbering and receptor classifi- IN, Bourne PE (2000) The Protein Data Bank.
cation. Bioinformatics 32:298–300 Nucleic Acids Res 28:235–242
32. Krawczyk K, Liu X, Baker T, Shi J, Deane CM 45. Maier JKX, Labute P (2014) Assessment of
(2014) Improving B-cell epitope prediction fully automated antibody homology modeling
and its application to global antibody-antigen protocols in molecular operating environment.
docking. Bioinformatics 30:2288–2294 Proteins 82:1599–1610
33. Krivov GG, Shapovalov MV, Dunbrack RL 46. Choi Y, Deane CM (2010) FREAD revisited:
(2009) Improved prediction of protein side- accurate loop structure prediction using a data-
chain conformations with SCWRL4. Proteins base search algorithm. Proteins 78:1431–1440
77:778–795 47. Deane CM, Blundell TL (2001) CODA: a
34. Nagata K, Randall A, Baldi P (2012) SIDEpro: combined algorithm for predicting the struc-
a novel machine learning approach for the fast turally variable regions of protein models. Pro-
and accurate prediction of side-chain confor- tein Sci 10:599–612
mations. Proteins 80:142–153 48. Šali A, Blundell TL (1993) Comparative pro-
35. Almagro JC, Teplyakov A, Luo J, Sweet RW, tein modelling by satisfaction of spatial
Kodangattil S, Hernandez-Guzman F, Gilli- restraints. J Mol Biol 234:779–815
land GL (2014) Second antibody modeling 49. Adolf-Bryfogle J, Xu Q, North B, Lehmann A,
assessment (AMA-II). Proteins 82:1553–1562 Dunbrack RL Jr (2015) PyIgClassify: a data-
36. Choi Y, Deane CM (2011) Predicting antibody base of antibody CDR structural classifications.
complementarity determining region struc- Nucleic Acids Res 43:D432–D438
tures without classification. Mol BioSyst 50. Berrondo M, Kaufmann S, Berrondo M (2014)
7:3327–3334 Automated aufbau of antibody structures from
37. Finn JA, Koehler Leman J, Willis JR, given sequences using Macromoltek’s
Cisneros A, Crowe JE, Meiler J (2016) SmrtMolAntibody. Proteins 82:1636–1645
Improving loop modeling of the antibody 51. Zhu K, Day T, Warshaviak D, Murrett C,
complementarity-determining region 3 using Friesner R, Pearlman D (2014) Antibody
knowledge-based restraints. PLoS One 11: structure determination using a combination
e0154811 of homology modeling, energy-based
refinement, and loop prediction. Proteins 57. Lefranc M-P, Pommié C, Ruiz M, Giudicelli V,
82:1646–1655 Foulquier E, Truong L, Thouvenin-Contet V,
52. Jarasch A, Koll H, Regula JT, Bader M, Lefranc G (2003) IMGT unique numbering
Papadimitriou A, Kettenberger H (2015) for immunoglobulin and T cell receptor vari-
Developability assessment during the selection able domains and Ig superfamily V-like
of novel therapeutic antibodies. J Pharm Sci domains. Dev Comp Immunol 27:55–77
104:1885–1898 58. Kabat EA, Wu TT, Bilofsky H, Reid-Miller M,
53. Shapovalov MV, Dunbrack RL (2011) A Perry HM (1983) Sequences of proteins of
smoothed backbone-dependent rotamer immunological interest, 3rd edn. National
library for proteins derived from adaptive ker- Institutes of Health, Bethesda
nel density estimates and regressions. Structure 59. Lefranc M-P (2014) Immunoglobulin and T
19:844–858 cell receptor genes: IMGT and the birth and
54. Towse C-L, Rysavy S, Vulovic I, Daggett V rise of Immunoinformatics. Front Immunol
(2016) New dynamic rotamer libraries: data- 5:22
driven analysis of side-chain conformational 60. Desmet J, Maeyer MD, Hazes B, Lasters I
propensities. Structure 24:187–199 (1992) The dead-end elimination theorem
55. Lovell SC, Word JM, Richardson JS, Richard- and its use in protein side-chain positioning.
son DC (2000) The penultimate rotamer Nature 356:539–542
library. Proteins 40:389–408 61. Miao Z, Cao Y, Jiang T (2011) RASP: rapid
56. Chinea G, Padron G, Hooft RWW, Sander C, modeling of protein side chain conformations.
Vriend G (1995) The use of position-specific Bioinformatics 27:3117–3122
rotamers in model building by homology. Pro- 62. Biasini M (2015) pv: v1.8.1
teins 23:415–421
Chapter 22
In Silico-Directed Evolution Using CADEE

Beat Anton Amrein, Ashish Runthala, and Shina Caroline Lynn Kamerlin
Abstract
Recent years have seen an explosion of interest in both sequence- and structure-based approaches toward in
silico-directed evolution. We recently developed a novel computational toolkit, CADEE, which facilitates
the computer-aided directed evolution of enzymes. Our initial work (Amrein et al., IUCrJ 4:50–64, 2017)
presented a pedagogical example of the application of CADEE to triosephosphate isomerase, to illustrate
the CADEE workflow. In this contribution, we describe this workflow in detail, including code input/
output snippets, in order to allow users to set up and execute CADEE simulations on any system of interest.
Key words Enzyme design, Directed evolution, Computational enzymology, Computational enzyme
design, Empirical valence bond
1 Introduction
Directed evolution has revolutionized biotechnology, allowing

enzyme activity to be modified in a targeted fashion with the
requirement of minimal prior knowledge about the mechanistic
features of the enzyme [1–5]. Nevertheless, despite constant meth-
odological advances [6–8] there remain challenges with this
approach, in no small part due to the vastness of the sequence
space that needs sampling and the very small likelihood of identify-
ing beneficial mutations [5, 9, 10]. Here, computational
approaches can play important roles in focusing the multidimen-
sional search space, in guiding experimental design, and, ultimately,
in performing the directed evolution itself in silico. There have
been a number of approaches that aim to address this problem
using sequence- or structure-based approaches, and for reviews
we refer the authors to, e.g., Refs. [11–13].
We recently developed a novel semiautomated approach for the
computer-aided directed evolution of enzymes (CADEE) [14],
based on Warshel’s empirical valence bond (EVB) approach
[15]. The EVB approach has been established as a powerful
approach to investigate enzymatic systems, as illustrated in Fig. 1.
381
382 Beat Anton Amrein et al.
Fig. 1 Examples of various enzymes that have been studied with the EVB approach. The experimental
activation free energies (ΔG{exp) are shown in dark blue, and the calculated activation free energies (ΔG{calc)
are shown in sky blue. DHFR, Lys, AR, CM, Try, PAS, DhlA, TIM, RlPMH, AchE, ODC, CA, ATP, and KSI denote
dihydrofolate reductase, lysozyme, aldose reductase, chorismate mutase, trypsin, a bacterial arylsulfatase,
haloalkane dehalogenase, triosephosphate isomerase, a bacterial phosphonate monoester hydrolase, acetyl-
choline esterase, orotidine monophosphate decarboxylase, carbonic anhydrase, F1-ATPase, and ketosteroid
isomerase, respectively. CC-BY adopted from Ref. [14], based on data originally presented in Refs. [16–19]
We present here a methodology guide to CADEE [14], to guide

the user in setting up, deploying, and analyzing CADEE-based
simulations. The main advantage of CADEE is that it removes the
tedious manual computational setup and analysis steps that would
otherwise be required when running large numbers of independent
EVB simulations, thus greatly simplifying the simulation process.
This framework is based on powerfully interwoven tools and a
command-line interface. Specifically, CADEE assists the user with
preparation of the simulations (cadee prep), performing the actual
equilibration and EVB simulations (cadee dyn), and the subsequent
analysis (cadee ana). In addition, due to CADEE’s dependence on
the EVB approach, the main workhorse of CADEE is the Q simu-
lation package [20, 21], and CADEE uses a similar workflow to Q
(for a description of this workflow, see the Q manual [21]). In
terms of other external dependencies, CADEE utilizes the
mpi4py [22] python package to distribute the computational work-
load during molecular dynamics simulations, and relies on Open
Babel [23] to convert molecular structures into different formats.
To automatize part of the analysis workflow, an in-house script
collection (qscripts) is utilized. Finally, to allow protein structures
to be automatically modified with residue substitutions, SCWRL
[24] is required.
In Silico Directed Evolution 383
The EVB approach has been extensively used to study bio-

chemical reactivity [15–17, 25–29] (Fig. 1), and it forms the
main basis for CADEE, as it allows for extensive conformational
sampling from multiple different starting conformations with low
computational cost, thus accelerating the convergence of the calcu-
lated activation free energies. In addition, while the EVB approach
requires calibration to a reference system [15, 17], once the EVB
parabolas have been calibrated for the system of interest, the exact
same parameters are then used unchanged to rapidly model the
same reaction in different environments, for example in different
enzymes or enzyme variants. This means that for computational
studies with EVB, only a single frame of reference is required, and
thus it removes the need for additional tedious reparameterization
steps from the workflow. This makes EVB a perfect tool for in
silico-directed evolution and the later screening stages of computa-
tional enzyme design [14, 26, 30–32]. The theoretical background
for CADEE has been described in detail elsewhere, and we refer the
reader to the original CADEE paper [14] for more details about the
approach. In this contribution, we will specifically guide the user
through the CADEE workflow with input files and pedagogical
examples.
2 Background and Default Settings
In this section, we introduce important technical considerations

and practical aspects that help users to understand how to perform
CADEE simulations.
2.1 Spherical In order to economize on computational time, CADEE simula-

Boundary Conditions tions are performed using spherical rather than periodic boundary
conditions, with the protein immersed into a spherical droplet of
water described using the surface-constrained all-atom solvent
model (SCAAS) [33], with long-range electrostatics treated using
the local reaction field (LRF) approach [34]. In this model, all
residues within the inner 85% of the droplet are fully mobile,
residues within the outer 15% of the droplet are restrained to
their crystallographic positions using a 10 kcal mol1 Å2 posi-
tional restraint, and the motion of all residues outside the droplet is
fully restricted with a 200 kcal mol1 Å2 harmonic restraint to the
initial crystallographic coordinates. Additionally, ionization of resi-
dues is restricted to the innermost 85% of the droplet (i.e., the
mobile region) in order to avoid system instabilities. The use of
such a setup both accelerates the conversion of the calculations and
gives the user flexibility in terms of the sphere size used (allowing
for example for extension of the sphere size where needed to study
distant mutations or convergence of energies or dynamics). How-
ever, it also means that the user needs to benchmark different
sphere sizes until the results become independent of the sphere

size. The minimal usable sphere size varies for different systems but
is typically in the range from 20 to 30 Å radius, depending on the
system specifics.
2.2 Speed and As CADEE is a distributed computing framework that runs a large
Computing Resources number of individual tasks at once, it is important to keep compu-
tational overhead low. In CADEE, this overhead minimization is
achieved with the following tricks:
1. In the case of multistep reactions, only the rate-limiting step is
simulated to an initial approximation, and other steps are only
simulated once a more limited selection of hits have been
identified, to focus resources.
2. We have included standard settings intended for the best use of
resources:
(a) Hysteresis tends to be reduced in the runs due to equili-
bration at the approximate transition state along the reac-
tion coordinate.
(b) Short thermalization and 8 ns of tandem equilibration and
EVB phases.
(c) Four replicas each with 8 EVB snapshots ¼ 32 data points
for statistics.
(d) The EVB calculations are initiated from structural snap-
shots collected every nanosecond of the initial equilibra-
tion; this allows post-calculation assignment of when the
system has sufficiently equilibrated so that stable energet-
ics are reached.
(e) By default, each enzyme variant is simulated for a total of
50 ns.
3. In order to decrease the simulation time, we advise the user to
start the EVB simulations at the transition state, propagating
the trajectories to the reactant and product complexes. This
both accelerates convergence (as the user is starting from a state
with partial bonds to reacting atoms) and, provided that suffi-
cient computational resources are available, reduces the real
time of the simulations as trajectories can be propagated in
both directions at once.
4. CADEE efficiency is high, thanks to a pleasingly parallel imple-
mentation, allowing hundreds to thousands of simulations to
be performed in parallel.
5. While these high-efficiency defaults are best suited for produc-
tion simulations, they are inconvenient for test simulations and
the initial system setup. To overcome this, a special script can be
employed, as described in Subheading 4.3.2.
2.3 Simulation A simpack as used by CADEE is a tarball, which contains the input
Packages (Simpacks) files for a CADEE simulation. Once a simulation has started, all the
results are appended to the simpack. It is therefore not only an
input, but also an output file. Once a simulation is started (“cadee
dyn”), the contents of the simpack are copied to a temporary folder.
The simulation is then spooled to the right position (skipping steps
that have been computed previously) and then molecular dynamics
simulations are performed. Once a simulation step has been com-
pleted, it will be compressed, collected, and appended to the sim-
pack in intervals to reduce strain on the file system. A default
CADEE simpack contains a total of 8 ns of equilibration time.
Every 1000 ps thereof, a snapshot is used to perform a medium-
length EVB simulation (each 520 ps total length).
By default, we suggest the user primarily relies on the medium-
length EVB simulations for estimating the likely activation free
energies for the constructs being tested (the longer the simulation
time, the more likely the simulations have converged). As each
simpack contains eight medium-length EVB runs, it is possible to
allow retro-actively for additional N ns initial equilibration, by
removing the data points of the first N medium-length EVB runs.
The reasoning for this advanced internal setup (in comparison to
traditional Q inputs and workflows) is that CADEE hides this
complexity from the user and hence does not cripple productivity,
but rather empowers the user during the analysis. For example, in
our model system below, we have not accounted for (i.e., removed)
the first two data points from the simpack, in order to give the
system an additional 2 ns of initial equilibration time, resulting in a
total equilibration time of 3 ns.
Technical details: Simpacks use the following nomenclature
protocol: [variant-name]_[replica].tar, for example for the wild-
type protein: “wt_0.tar, wt_1.tar, wt_2.tar, wt_3.tar” and for a
histidine 104 to alanine variant: “H104A_0.tar, H104A_1.tar,
H104A_2.tar, H104A_3.tar”, etc. For our working example,
CADEE creates four independent replicas (seeds) for each enzyme
variant, leading to a total of 4 8 ¼ 32 medium EVB runs. In
addition, we have decided to manually remove the first 2 EVB
simulations from each simpack, to allow for longer initial equilibra-
tion without increasing real simulation time, effectively yielding
4 6 ¼ 24 medium EVB energy profiles (see also the previous
paragraph). As a baseline, all simpacks must contain a topology file
(mutant.top), a simulation-ready PDB file (mutant.pdb), and the
FEP file (mutant.fep) which contains the EVB parameters for the
different reacting states for the reaction being studied. If molecular
dynamics simulations should be performed, the simpacks must
contain numbered input files (*.inp), as per the following scheme:
01_* to 09_*: initialization and thermalization, 1000_eq.inp to
4660_fep.inp: equilibration and FEP files. Files containing the
string (“_eq”) are 50 ps equilibration runs (the reason for the
short 50 ps equilibration runs being to allow data backup at least

every hour of real time on all tested clusters). For more details
about the content of the input files, refer to the Q manual [21].
3 CADEE Installation
This protocol has been written specifically in conjunction with

CADEE version 0.9. However, we note that CADEE is constantly
in development, and therefore we advise the reader to refer to the
latest CADEE release and associated instructions, available along
with the downloaded CADEE release https://github.com/
kamerlinlab/cadee. For simplicity, here, we will guide the user
through CADEE installation and setup using the model example
of triosephosphate isomerase, as with our initial CADEE
paper [14].
3.1 The Wild-Type CADEE relies on the user to have already characterized and vali-
Enzyme Reaction dated the reaction mechanism of the enzyme of interest, and to
have calibrated the EVB coupling and gas-phase shift parameters
against relevant experimental or computational data (usually
corresponding to the energetics of the reaction catalyzed by the
wild-type enzyme, or the corresponding uncatalyzed reaction in
aqueous solution), as described in Ref. [14]. It is therefore crucial
that the system is carefully prepared, as the quality of data obtained
from all subsequent steps builds on the correct modeling of the
baseline reaction (i.e., for the EVB calibration).
3.2 Installation and CADEE has been written and tested on Linux machines. The
System Requirements parallel computing has been tested on a variety of Intel as well as
AMD clusters, as these systems were accessible through the Swed-
ish National Infrastructure for Computing (SNIC) at various sites
in Linköping (NSC/Triolith https://www.nsc.liu.se/), Uppsala
(UPPMAX/Tintin and Rackham https://www.uppmax.uu.se/),
and Umeå (HPC2N/Akka, Abisko, and Kebnekaise https://www.
hpc2n.umu.se/). We note that while the software was written to
run on all SNIC-provided resources, we have not, as yet, tested it
on other SNIC clusters. In addition, we have not used other
resource managers than SLURM, as this is the primary resource
manager on SNIC systems. For simplicity, we assume that the user
will be using a Debian- or Ubuntu-based system, as those systems
are under widespread use.
The CADEE installer may be downloaded from our official
GitHub repository, which is located at http://www.github.com/
kamerlinlab/cadee. For new users, we recommend following the
CADEE installation instructions described in the following sec-
tions in order to get started.
3.3 How to Read this Throughout this section, we assume that a modern implementation
Chapter of the Bourne again shell (bash) is installed and used by the user.
Code that needs to be typed into a terminal emulator is explicitly
identified as a Code Input Snippet:
# Code Input Snippet (1)

/bin/bash --version
Note that lines ending with \\ imply that the command con-
tinues on the next line (therefore “Enter” should not be used or the
command might not work as intended). Similarly, the
corresponding output is explicitly identified as a Code Output
Snippet:
# Code Output Snippet (1)

GNU bash, version 4.3.48(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2013 Free Software Foundation, Inc.
[...]
Here, lines containing [. . .] represent messages that have been

removed to simplify and shorten output. And finally, all lines start-
ing with “#” are comments.
3.4 Downloading and As Q is to be compiled by the user, a FORTRAN compiler is

Installing Third-Party necessary and we suggest the use of the free and open-source
Programs software (FOSS) compiler gfortran [35]. CADEE also requires a
working MPI implementation, for example OpenMPI [36] and/or
3.4.1 Installing a MPICH [37]. We further recommend installation of the versioning
Compiler, MPI, git, and program Git (https://git-scm.com), as this allows for the installa-
Python 2.7 tion instructions to be followed exactly. Since CADEE is a Python
2.7 [38] based framework that utilizes setup tools in the installa-
tion, Python and setup tools need to be installed. Finally, CADEE
also requires Open Babel [23].
Assuming a Debian-based Linux distribution, the following
commands will resolve the required and recommended software
dependencies:

sudo apt-get install gfortran openmpi-bin git openbabel
sudo apt-get install mpich gcc python2.7 python-pip

# The software should be installed without errors.
3.4.2 Licensing and Q [39] needs to be licensed, downloaded, and installed, as CADEE
Downloading Q relies on the functional capabilities of this molecular simulation
package (see http://xray.bmc.uu.se/~aqwww/q/ for further
details). First-time users are advised to thoroughly familiarize

themselves with Q, and to establish and try out simple reaction
mechanisms with EVB before starting CADEE simulations, to
facilitate the usage of CADEE with their own systems of interest.
3.4.3 Licensing and SCWRL4 [24] needs to be licensed, downloaded, and installed, as
Downloading SCWRL4 CADEE utilizes the functional capabilities of this package to rapidly
predict a likely side-chain orientation (rotamer). Users are advised
to visit http://dunbrack.fccc.edu/scwrl4/ for instructions on the
licensing, download, and installation of SCWRL4.
3.5 Download and CADEE can be either downloaded from https://github.com/

Installation of CADEE kamerlinlab/cadee or obtained by git-cloning the repository to
the user’s computer:
3.5.1 Downloading
CADEE # Code Input Snippet (3)
cd $HOME
mkdir -p Downloads
cd $HOME/Downloads
git clone https://github.com/kamerlinlab/cadee cadee
cd $HOME/Downloads/cadee
export CADEE_DIR="$PWD"

# Example output:
Cloning into ’cadee’...
remote: Counting objects: 298, done.
remote: Compressing objects: 100% (34/34), done.
remote: Total 298 (delta 25), reused 40 (delta 18), pack-reused 246
Receiving objects: 100% (298/298), 479.34 KiB | 491.00 KiB/s, done.
Resolving deltas: 100% (118/118), done.
Checking connectivity... done.
3.5.2 Q and SCWRL First, SCWRL4 should be installed to a folder in $PATH, as also
Installation mentioned in the download instructions. Next, a copy of the Q
executables has to be placed in the folder prepared for them
($CADEE_DIR/cadee/executables/q): Once Q has been com-
piled, qfep5, qdyn5, qprep5, and qcalc5 should be copied to
$CADEE_DIR/cadee/executables/q/. Alternatively, the setup.
py script will search in the $PATH for the Q executables.
3.5.3 CADEE Installation Once all required dependencies are installed (see Subheading 3.4),
one may proceed to install CADEE:

python setup.py install --user
The setup first locates the executables of Q, Open Babel, and

SCWRL4, and it will stop the installation if an error is detected, or if
any of the following executables is not found: qdyn5, qfep5,
qprep5, scwrl4, and babel. Additionally, Python libraries might
need to be installed during the installation to make sure that
CADEE will function properly (mpi4y, numpy).

Welcome to CADEE Pre-Setup Check.
[...]
Installing cadee script to $HOME/.local/bin
[...]
Finished processing dependencies for cadee==0.9
4 Testing the CADEE Installation
4.1 First Start Once CADEE has been installed successfully, the CADEE wrapper
script (“cadee”) will be available on the command line. This script
has been written for the ease of users familiar with Q, as the syntax
is maintained between the two programs.

cadee --help
If the output is similar to the following lines, CADEE has been

installed successfully:
(C) Copyright 2017 Beat Anton Amrein & Shina Caroline Lynn
Kamerlin
Usage:
cadee [ prep(p) | dyn(d) | ana(a) | tool(t) ]
Multi Core Tasks:
mpirun -n X cadee dyn
mpiexec -n X cadee dyn
X == Number of cores to use; 2+.
In case the output does not resemble the above output (e.g.,
“cadee: command not found”), the installation has failed (or is
incomplete), and we suggest users refer to the troubleshooting
described in Subheading 8.2.
4.2 Preparing a As described in the introduction, to use CADEE, the valence bond
CADEE Simulation states describing different reacting species for the reaction of inter-
est need to be pre-parameterized and calibrated for running the
EVB simulations that underlie CADEE (for details about the EVB
approach, see, e.g., Refs. [15, 17]). To simplify CADEE usage and
understanding, and to allow for easy CADEE testing, we have
included a set of sample EVB input files for the user (see
$CADEE_DIR/example). For more information about the theo-
retical background, we refer the user to earlier publications [14, 15,
17].
In order to run properly, CADEE requires the following files:
1. A structure file in PDB format ($CADEE_DIR/example/wt.
pdb), comprising the wild-type enzyme with correct ionization
states for ionizable residues, solvated in a water droplet (gener-
ated by Qprep5): The initial coordinates are typically obtained
from the Protein Data Bank [40, 41] and then adjusted to be
compatible with Q.
2. A “FEP file” ($CADEE_DIR/example/wt.fep), i.e., a file con-
taining the force field parameters for the different EVB states,
for the purposes of the simulation setup: Note that Qdyn does
not distinguish between general free energy perturbation and
specific EVB calculations when reading input.
3. The qprep5 input file ($CADEE_DIR/example/wt.qpinp),
which was used to generate the initial simulation-ready PDB
file: CADEE will use this file to check the system configuration
and to prepare the relevant topologies.
4. The full path to the folder containing all topology and parameter
files needed to perform the simulation ($CADEE_DIR/exam-
ple/libraries/).
4.2.1 Preparing a CADEE To begin with our working example, a simulation may be prepared
Simulation with CADEE’s “prep” keyword:

mkdir testing_example
cd testing_example
cp -r $CADEE_DIR/example/*.
cadee prep wt.pdb wt.fep wt.qpinp ./libraries/ --template $CADEE_DIR/\\
simpack_templates/simpack_template_0.05ns_15ps_2.5ps_32.5ps.tar.bz2
In this example, the sample input files included with the

CADEE distribution ($CADEE_DIR/testing_example/wt) con-
tain the benchmark simulations of the reaction with the wild-type
enzyme. If the “cadee prep” command is initiated without the “--
libmut” or “--alascan” arguments, CADEE will create a subfolder
“wt” and prepare a “wild-type” simulation, i.e., no mutations are
introduced and the EVB FEP file is not modified. This is useful,
because the wild-type reaction is the first to be tested by CADEE,
and if the wild-type reaction does not run properly, then there is no
point in trying out other enzyme variants.
Note also that without the --template argument, CADEE
would prepare inputs worth 12 ns of simulation time. This default
setup will be used later in these protocols. For the sake of testing
CADEE in a short time period, we use instead a 0.1 ns template, for
demonstration purposes (the energetics obtained are thus not
meaningful in and of themselves, and the simulations are rather
merely illustrative of how to prepare and execute CADEE).

[...]
INFO:root:No parameters provided. Will prepare simpacks from input; "wt".
INFO:prep.create_inputs:Creating input files for wt.
INFO:root:Packing wt:
INFO:root:Pack # 0, Seed: 582993
Success! You find your simpacks in $CADEE_DIR/testing_example/wt.
In case the subfolder “wt” exists, CADEE will warn the user
about this. In many cases, the wt.qpinp files then need to be
adapted to a CADEE-specific format, using absolute filenames
and coordinates: The “cadee prep” command will try to automati-
cally perform these changes and create a new file, inserting “.new”
before the file extension; for example, in the example used here, this
would be “wt.new.qpinp” (if this file already exists, “cadee prep”
stops and asks the user to remove “wt.new.qpinp”). Only then are
the wild-type simpacks created and finally packed. The very last line
indicates that the simpacks are ready to use. Caution: The simpacks
have been prepared, but not yet computed. Instructions for the
computation are provided in Subheading 4.3.1. Note that instead
of deleting the old “wt.new.qpinp” the input line may be adjusted,
and instead of wt.qpinp, wt.new.qpinp may be used:

cadee prep wt.pdb wt.fep wt.new.qpinp./libraries/
The above input, will then, in turn, generate the following

output:

[...]
CRITICAL:root:Cannot continue: Folder $CADEE_DIR/testing_example/wt exists.
Please (re)move it.
A directory listing of the folder indicated on the last line reveals:
ls $CADEE_DIR/testing_example/wt
wt_0.tar wt_1.tar wt_2.tar wt_3.tar
CADEE sequentially names the simpacks [mutant-name]_[X].

tar, with X in [0,1,2,3,. . .], where X denotes the replica number for
each individual trajectory (for example wt_0.tar, wt_1.tar, wt_2.tar
wt_3.tar, for the wild-type enzyme). Caution: If the simpack is
untarred, it is important to make sure that this is done inside an
empty (sub-)folder, because simpacks do not contain any directory
structure, only files. It should however not be necessary for users to
unpack simpacks, as this is performed internally by CADEE.
4.2.2 Preparing for Put simply, a simpack contains input files required to run qdyn5 and
Molecular Dynamics qfep5, both of which are utilities that are needed in order to
Simulations perform and analyze EVB simulations. When CADEE is computing
simpacks, no new files are generated in the simpack folder, but the
simpacks simply increase in size from a couple of megabytes to
gigabytes. It is therefore crucial that the folder containing the
simpacks holds enough free storage to accommodate this. A sim-
pack contains all restart information needed, and if a run is inter-
rupted and later restarted, the simpack alone is enough to restart
the CADEE simulation. Simpacks should in principle not be cor-
rupted, except if CADEE has stopped ungracefully, for example if a
simulation runs out of storage. Clearly, however, in the event that
simpacks are corrupted, then the corrupted simpacks need to be
repaired before proceeding (see the troubleshooting description in
Subheading 8.3).
4.3 Performing CADEE includes scripts to automatize parallel computation, and in

Ensemble Simulations a standard CADEE simulation several simpacks are computed at
Using CADEE once. When in doubt, the user should refer to their supercomput-
ing support team, as scripts may need to be adapted to the comput-
4.3.1 Efficient Way: ing resources available. For simplicity, we assume that an interactive
Saving CPU Time— session is available and/or the user is able to run CADEE locally.
“CADEE Dyn” For the working example presented here, a four-core allocation is
required.
The following command will copy the wild-type simulation
that was prepared above in Subheading 4.2.1 into a new folder
and start the simulation. Note: Depending on the architecture and

speed of the computing resource, this computation may need up to
2 h to complete on a four-core machine.

mkdir -p $HOME/global/cadee_tutorial
cp -r $CADEE_DIR/testing_example/* $HOME/global/cadee_tutorial
# you may need to adjust mpirun
mpirun.mpich -np 5 cadee dyn $HOME/global/cadee_tutorial/wt | tee cadee.log
This command will launch “cadee dyn” with four working tasks
(plus one for input/output). Note that the log file will be only
written to standard out (the console) by default, and when using
the “| tee cadee.log” part of above command, the log is additionally
written to cadee.log. We note that the resulting simpack includes all
input files and output files; that is, “cadee dyn” will not generate
special output files, but instead the simpack files will become larger
(for more about simpacks, see Subheading 2.3). Depending on the
mpi implementation used, the “mpirun.mpich” command needs to
be adjusted (possible commands include “srun,” “mpiexec,” or
“mpirun”). The command above should create output similar to:

[...]
Rank 0: Started @ 1506234059.03 MPI Info: enabled: True rank: 0 size: 5
0 - 170924 08:20:59,032 - dyn - INFO - Settings: Path:
$HOME/global/cadee_tutorial/wt, Alpha: None, Hij: None, Force mapping: False.
0 - 170924 08:20:59,032 - dyn - INFO - Add input file wt_2.tar.
0 - 170924 08:20:59,033 - dyn - INFO - Prioritized.
0 - 170924 08:20:59,075 - cadee.dyn.tools - INFO - Committed cadee.db.
0 - 170924 08:20:59,076 - dyn - INFO - Number of simpacks left on queue 3.
1 - 170924 08:20:59,076 - dyn - INFO - Working on
$HOME/global/cadee_tutorial/wt/wt_3.tar.
2 - 170924 08:20:59,197 - dyn.traj - INFO - Next qdyn simulation step initialized.
2 - 170924 08:20:59,227 - dyn.traj - INFO - 01_dyn_seed.inp

0 - 170924 08:20:59,878 - dyn - INFO - Sleeping @ 0.8 s.
0 - 170924 08:21:22,216 - dyn - INFO - Slept for 22.3 seconds.
2 - 170924 08:21:22,162 - dyn.traj - WARNING - Found HOT ATOM’
2 - 170924 08:21:22,199 - dyn.traj - INFO - 02_dyn_no_shake.inp
2 - 170924 08:22:20,817 - dyn.traj - INFO - 03_dyn_warm_1.inp
[...]
1 - 170924 08:32:58,208 - dyn - INFO - Backup timing 0.049s, MB: 9.65 Speed:
196.77 MB/s
1 - 170924 08:32:58,212 - dyn.traj - INFO - 1190_eq.inp
[...]
1 - 170924 09:01:01,811 - dyn.traj - INFO - 1450_fep.inp
[...]
92.55 MB/s
0 - 170924 09:02:05,815 - dyn - INFO - Sending shutdown message to 1
0 - 170924 09:02:05,915 - dyn - INFO - Worker 1 was removed from worker-list:
There are 0 (out of 4) left[...]
0 - 170924 09:02:05,916 - dyn - INFO - Preparing to end this Simulation! Syncing...
0 - 170924 09:02:05,916 - dyn - INFO - Database connection closed.
0 - 170924 09:02:05,916 - dyn - INFO - Removing Temporary Files...
0 - 170924 09:02:05,916 - dyn - INFO - DONE. Exiting
As implied in the log excerpt above, the parallel dynamics

simulation was started with five MPI ranks; the first process is
used as “master” for distributing work, I/O control, and logging
the job. All ranks > ¼ 1 are slave processes, performing the actual
number crunching. To lower the load on the file system, parallel
input/output is limited to 8 simpacks reading/writing simulta-

neously, except when more than 128 cores are used (then #cores/
16 are allowed). The temporary storage is assumed to be set in the
environment variable $CADEE_TMP. If this variable is not set, /
scratch/, then /tmp and then /dev/shm are used. Caution: It is a
user responsibility to make sure that enough space is available in the
folder for temporary files.
4.3.2 Saving Wall- In certain scenarios, it is important to get results fast, and to use the
Clock Time available resources for speed, not for efficiency, such as when a wild-
type reaction needs to be prototyped for a certain enzyme. In such a
case, CADEE ships scripts which need to be adjusted to the user’s
machine. These scripts are located in $CADEE_DIR/cadee/
tools/pcadee.sh and $CADEE_DIR/cadee/tools/srunq.sh,
respectively. Once adapted to the computer system of interest,
they can be launched by:

mkdir -p $HOME/global/cadee_tutorial_wallclock
cp -r $CADEE_DIR/testing_example/* $HOME/global/cadee_tutorial_wallclock
$CADEE_DIR/cadee/tools/pcadee.sh $HOME/global/cadee_tutorial_wallclock/wt
Simpack Folder $HOME/global/cadee_tutorial_wallclock/wt

Will use 4 per simpack.
Will run at most 1 simpacks at one time.
This will use 4 cores from 4.
Will Distribute Jobs and Start Work in 1 Second

===============================================
$HOME/global/cadee_tutorial_wallclock/wt/wt_1.tar
wt_1.tar @0 > Start: Son Sep 24 09:45:04 CEST 2017
wt_1.tar @0 > You supplied a SIMPACK:
wt_1.tar @0 > Unpacking Simpack
($HOME/global/cadee_tutorial_wallclock/wt/wt_1.tar) to tmpdir (/tmp/8788).
wt_1.tar @0 >
wt_1.tar @0 >
wt_1.tar @0 > ###########
wt_1.tar @0 > # CONFIG: #
wt_1.tar @0 > ###########
wt_1.tar @0 > bkp int: 540
wt_1.tar @0 > simpack: $HOME/global/cadee_tutorial_wallclock/wt/wt_1.tar
wt_1.tar @0 > cores: 4
wt_1.tar @0 > exe: mpiexec -n 4 $HOME/bin/qdyn5p
wt_1.tar @0 > md5sum: 6219dabb4f56f72bb914c1cb159f79a2
$HOME/bin/qdyn5p
wt_1.tar @0 > workdir: /tmp/8788

wt_1.tar @0 >
wt_1.tar @0 >
wt_1.tar @0 >
wt_1.tar @0 > Working Directory; localhost:/tmp/8788
wt_1.tar @0 > Preparing 01_dyn_seed ...
wt_1.tar @0 > Running MD Simulation on 01_dyn_seed.inp ...
wt_1.tar @12 > Finished: 01_dyn_seed.log
wt_1.tar @12 > Zipping.
wt_1.tar @12 > Backup Skipped.
wt_1.tar @12 > Working Directory; localhost:/tmp/8788
wt_1.tar @12 > Preparing 02_dyn_no_shake ...
wt_1.tar @12 > Running MD Simulation on 02_dyn_no_shake.inp ...
wt_1.tar @35 > Finished: 02_dyn_no_shake.log
[...]
wt_1.tar @517 > Running MD Simulation on 1260_fep.inp ...
wt_1.tar @542 > Finished: 1260_fep.log
wt_1.tar @542 > Backup Complete, Duration: 0, [ Son Sep 24 09:54:06 CEST 2017 ]
[...]
wt_1.tar @967 > Running MD Simulation on 1450_fep.inp ...
wt_1.tar @986 > Finished: 1450_fep.log
wt_1.tar @986 > Backup Skipped.
wt_1.tar @986 > Backup Complete, Duration: 0, [ Son Sep 24 10:01:31 CEST 2017 ]
[...]
wt_1.tar @986 > All OK.
wt_1.tar @986 > End: Son Sep 24 10:01:31 CEST 2017
wt_1.tar @986 > Duration: 986 s
Cleanup Done.
[...]
[...]
No Simpacks left. Terminating after 4424.
5 Pedagogical Examples
In this section, we demonstrate the commands required to repro-

duce the pedagogical examples from our original CADEE [14]
publication. While both the preparation and analysis of the example
can be performed on a laptop computer, we strongly advise the user
to use a supercomputing cluster to perform the dynamic simula-
tions: As each simpack needs several days to finish and this example
implies computation of approximately 1100 simpacks, laptop and

office computers will be inadequate to perform all required com-
putations in a reasonable timeframe. Please note that CADEE
counts residues sequentially, starting at 1, renumbering based on
missing residues. The PDB structure used in this example (1ney) is
missing the first residue. To convert a CADEE residue number in
the sequence number +1 has to be added. For better readability, we
use CADEE residue numbering throughout the whole text.
5.1 Example: To prepare simpacks for an alanine scan is straightforward, once the
Alanine Scan system has been correctly prepared and benchmarked. The argu-
ment needed to run an alanine scan using CADEE is “--alascan.”
Optional parameters are “--radius” (mutate all residues within a
certain radius around the center of the simulation sphere) and
“--nummuts” (prepare alanine scan inputs for the N innermost
residues, increasing the radius around the simulation center).

cp -r $CADEE_DIR/example $CADEE_DIR/pedagogical_example
cd $CADEE_DIR/pedagogical_example
cadee prep wt.pdb wt.fep wt.qpinp libraries --alascan --nummuts 48
For example, in the original CADEE paper [14], we used

--nummuts 48 to make sure that a full compute node of the Abisko
cluster at HPC2N in Umeå [42] was used.

Determining Radius needed to accommodate 48 mutants. Please
wait...Done! Radius is 14.24
[...]
INFO:root:Preparing alascan.
INFO:prep.alascan:Won’t mutate residue, contains FEPatoms: 164
INFO:prep.alascan:Look up center_xyz in qprep5inp -> 23.311
42.835 14.513
INFO:prep.alascan:Won’t mutate residue, contains FEPatoms: 495
INFO:prep.alascan:Mutate (’ARG’, ’97’) to ALA
[...]
INFO:prep.alascan:Mutate (’VAL’, ’6’) to ALA
INFO:prep.create_inputs:Creating input files for LEU235ALA .
[...]
INFO:prep.create_inputs:Creating input files for LEU92ALA .
INFO:root:Packing LEU235ALA:
INFO:root:Packing SER95ALA:

[...]
Success! You find your simpacks in $CADEE_DIR/pedagogical_ex-
ample/ala_scan .
This code snippet will generate the simpacks for 48 protein

variants (including WT) and it will create four seeds per input
(a total of 192 simpacks).
Next, to actually perform the computational alanine scan, the
simpacks should be copied to a location with plenty of storage, a
minimum of 3 gigabytes/simpack is recommended, and the simu-
lation may then be started:

cp -r $CADEE_DIR/pedagogical_example $HOME/global
mpirun.mpich -np 193 cadee dyn $HOME/global/pedagogical_example/ala_scan
$HOME/global/pedagogical_example/ala_scan, Alpha: None, Hij: None, Force
mapping: False.
0 - 170924 10:24:12,531 - dyn - INFO - Add input file LEU12ALA_2.tar.
[...]
0 - 170924 10:24:12,531 - dyn - INFO - Add input file CYS40ALA_1.tar.
0 - 170924 10:24:12,532 - dyn - INFO - Prioritized.
[...]
[# All 192 (=48x4) simpacks need to be processed.]
[# We recommended to use 192 cores and approx. 10 days wallclock time]
[# Alternatively, the job can be split into smaller parts, for example 4x48]
[...]
Simulating one simpack on modern hardware usually takes

between 1 week and 10 days. Older CPUs, and/or increasing the
simulation sphere increases the simulation time up to 2 weeks or
longer. After the simulation is finished, an analysis tool is available
to specifically analyze the alanine scan, and to prepare the next
simulation. The activation free energies can be evaluated using the
standard EVB mapping procedure, with user-defined off-diagonal
elements (Hij) and gas-phase shifts (α). In the present case, we are
using triosephosphate isomerase as our model system, following
our initial CADEE paper [14]. The off-diagonal and gas-phase shift
parameters have been previously published in the Supporting
Information of Ref. [14], and were calibrated to 60.0 (Hij) and

229.0 (α) kcal mol1, respectively. CADEE will automatically pre-
pare the free energy mapping of all EVB simulations performed,
provided that the aforementioned EVB parameters have been
defined on the command line.

mpirun.mpich -n 2 cadee dyn \\
HOME/global/pedagogical_example/ala_scan \\
–hij 60.0 –alpha 229.0 –force
This will add the mapping output to cadee.db. In case a new

EVB free energy mapping should be enforced, the --force flag can
be used. This might be advisable, if the free energy mapping is
interrupted ungracefully, or when the initial EVB parameters
provided as input need to be corrected. CADEE will then simply
first remove old EVB mapping results and subsequently restart the
mapping all over again (apart from the mapping files, no other files
will be deleted by --force_map).

$HOME/global/pedagogical_example/ala_scan, Alpha: 229.0, Hij: 60.0, Force
mapping: True.
0 - 170924 14:54:16,618 - dyn - INFO - Add input file ARG97ALA_0.tar.
[...]
1 - 170924 14:54:16,937 - dyn - INFO - Working on $HOME/global/pedagogica-
l_example/ala_scan/ARG97ALA_1.tar.
[...]
1 - 170924 14:54:57,647 - dyn.ana - INFO - deleting old .qana.mapped files...
1 - 170924 14:55:02,504 - dyn.ana - INFO - ARG97ALA 1 medium 1190_eq dGa:
14.67 dG0: 9.25
1 - 170924 14:55:07,619 - dyn.ana - INFO - ARG97ALA 1 medium 1650_eq dGa:
14.97 dG0: 9.34
[...]
0.03 MB/s
0 - 170924 19:57:23,050 - dyn - INFO - Sending shutdown message to 1

0 - 170924 19:57:23,151 - dyn - INFO - Worker 1 was removed from worker-list:
There are 0 (out of 1) left
[...]
0 - 170924 19:57:23,151 - dyn - INFO - Preparing to end this Simulation!
Syncing...
0 - 170924 19:57:23,289 - dyn - INFO - Database connection closed.
0 - 170924 19:57:23,289 - dyn - INFO - Removing Temporary Files...
0 - 170924 19:57:23,291 - dyn - INFO - DONE. Exiting
5.2 Example: The analyse.py script is used to perform automated analysis of

Automated Analysis of CADEE alanine scans. This script can be called by:
a CADEE Alanine Scan
cadee ana alanize cadee.db
firefox index.html

# Firefox will be started and a web ui can be used.
This will create a file index.html, which can be downloaded and

opened in the browser of the user’s choice. The interface (ui),
which is shown also in Fig. 2, is intuitive to use; the user may select
the residues to be mutated next directly through this interface, and
the html interface will present the user with information on how to
proceed.
The web user interface allows for the clicking of a residue and
then choosing the action that should be performed on it. The input
to run cadee.py will then be displayed in the “CADEE command”
section.
5.3 Example: Manual In some cases, the user might desire the raw data from a CADEE
Analysis of CADEE simulation to perform analysis with their own post-processing
Simulations scripts, and avoid information overload from the cadee.db files
(which are in sqlite3 format). For those cases, we provide a script
to convert the activation energies or the free energies to the
comma-separated value (csv) file format. The corresponding data
can then be opened with any relevant spreadsheet software (see also
Fig. 3).

cadee ana csv cadee.db activation_barriers.csv #dG*
cadee ana csv_exo cadee.db free_energy.csv #ddG
/bin/ls
Fig. 2 A screenshot of the web user interface for the analysis of alanine scans. CC-BY adapted from Ref. [14]
Fig. 3 The initial alanine scan performed on triosephosphate isomerase (TIM). The alanine scan was prepared
with the –nummut argument as follows: After the initial system setup, 48 residues distributed radially around
the center of the simulation sphere that were neither alanine nor glycine were selected, and hydrogen atoms
and heavy atoms other than backbone atoms and Cβ were removed. Each variant was subsequently
re-solvated and the CADEE simulation was started. From the displayed data, three positions were selected
to start the next simulations: L93, Y164, and T172, respectively, in positions 92, 163, and 171 in CADEE (see
main text). CC-BY adopted from Ref. [14]

Exporting Barrier
Success... Wrote activation_barriers.csv...
Exporting deltaG
Success... Wrote free_energies.csv...
activation_barriers.csv free_energy.csv
5.4 Example: In some cases, the user might desire to merge cadee.db files (which
Concatenating cadee. are in sqlite3 format). For those cases, we provide a script to
db Files concatenate two or more cadee.db files.

cadee ana cat cadee.db cadee.db1
/bin/ls *.db

[...]
0 - 170924 20:27:49,213 - cadee.dyn.tools - INFO - Committed
cadee.db.
cadee1.db
cadee2.db concat_cadee.db
5.5 Example: Point Once a promising hotspot site has been identified, one way to
Saturation continue the CADEE analysis is to perform computational combi-
Mutagenesis natorial saturation mutagenesis on it (Fig. 4). This can be simplified
by using a reduced set of amino acids for the calculations (see Reetz
[43]), and CADEE supports different amino acid libraries for this
purpose. A list of the amino acid libraries implemented into
CADEE is shown in Table 1.
In the current working example, we have saturated three posi-
tions to all 20 natural amino acids, which were initiated as follows:

cd $CADEE_DIR/pedagogical_example
cadee prep wt.pdb wt.fep wt.qpinp libraries --libmut 92:\\
SATURATE --libmut 163:SATURATE --libmut 171:SATURATE
Fig. 4 The pedagogical example and point saturation mutagenesis of residues 93, 164, and 172, compared to
the wild-type simulation on the left. The data has been sorted by residue number. The free energy profiles of
the L93W, L93Q, L93R, Y164R, T172R, L93K, Y164K, and T172K variants did not converge. CC-BY adopted
from Ref. [14]
Table 1
We present here CADEE’s built-in amino acid libraries, with both the
associated shortcut and the one-letter amino acid codes for each library
residue, respectivelya
Shortcut Library used

All, saturation ARNDCEQGHILKMFPSTWYV
NDT FLIVYHNDCRSG
Special CGP
Hydrophobic AVILMFYW
Minus, negative, charged DE
Plus, positive, charged+ RHK
Charged DERHK
Neutral STNQ
Custom sequence of one-letter amino acid codes
“AVILME” AVILME
“AAVILME” Error: (cannot use A 2)
a
When “cadee prep” is launched with the --libmut argument, the designated amino acid
position will be mutated to a library of amino acids. For example “cadee prep . . . --libmut
92:ALL” will saturate position 92, mutating it to all 20 natural amino acids. Or “cadee
prep . . . –libmut 92:SPECIAL 163:MINUS” will prepare a combinatorial saturation
mutagenesis run (to 93(wt/C/P/G) with 163(wt/D/E), respectively,
([3 + 1] [2 + 1]) ¼ 12 mutants)
In the original CADEE paper [14], we used --nummuts 48 to

make sure that a full compute node of the Abisko cluster at HPC2N
in Umeå [42] could be used. The --libmut keyword calls SCWRL4
to prepare inputs for the arbitrary point mutations. As arguments,
CADEE expects a residue number followed by a colon and then a
keyword or a sequence of one-letter amino acid codes. The above
input would hence mutate the residues at positions 92, 163, and
171 to all 20 natural amino acids. This command results in
3 20 4 ¼ 240 simpacks. Note that throughout the main
textm we are using CADDEE integral residue numbers; these
correspond to residues 93, 164, and 173 is the PDB structure.

INFO:root:Preparing libmut - LIBrary MUTatagenesis.
92:SATURATE (+ native/wt)
[...]
INFO:prep.pyscwrl:Clash-Score was: 5.262, will now re-run and
allow Scwrl4 to modify residues [90, 92]
INFO:prep.pyscwrl:Clash-Score new: 0.0 ==> Keep.
[...]
The aforementioned Code Output Snippet describes what

CADEE is doing: First, it creates a mutant and computes a
“Clash-Score” (which is exponential to the overlap of the van der
Waals radii). CADEE then allows Scwrl4 to realign those residues in
the clash zone (i.e., residues 90 and 92) and rerun Scwrl4. Upon
encountering no steric clashes, CADEE keeps the new configura-
tion (for details about the Scwrl4 calculations, see the Scwrl4
publication [24]).
[...]
INFO:root:Working on $CADEE_DIR/pedagogical_example/libmut/Y163W.
INFO:prep.pyscwrl:Clash-Score was: 0.218, will now re-run and allow Scwrl4 to modify
residues [163, 183, 667]
INFO:prep.pyscwrl:Clash-Score new: 0.460648886108 ==> Rollback!
[...]
In this case, as the clash was minor, allowing SCWRL4 to
realign residues 163, 183, and 667 did not help the Clash-Score,
so CADEE reverts to the original alignment (“Rollback!”).
INFO:root:Working on $CADEE_DIR/pedagogical_example/libmut/L092D.
INFO:prep.pyscwrl:No clashes detected.
[...]
In the above case, changing leucine to an aspartic acid did not

cause clashes. CADEE hence continues with this suggestion.
[...]
INFO:prep.create_inputs:Creating input files for Y163R.
[...]
INFO:prep.create_inputs:Creating input files for T171H.
INFO:root:Packing L092V:
INFO:root:Packing Y163M:
[...]
ample/libmut.
Finally, the generated simpacks need to be computed with

“cadee dyn” (not shown).
Fig. 5 Data obtained through partial combinatorial saturation mutagenesis at positions 93(A/G/H), 164(S/P/H/
E/C/A), and 172(W/S/R/L/D). ΔG{ denotes the calculated activation free energies for each variant, and the
error bars denote the standard deviation over 4 6 ¼ 24 EVB trajectories per variant. As displayed, some
variants have very large uncertainty in the calculated values. These instabilities can be caused by different
factors or combinations of factors (for example structural instabilities caused by the insertion of the new
residue or insufficient equilibration time). To improve the equilibration, the trend within data collection could
be studied and longer simulations conducted, or the current ones extended, or additional simulations
performed. CC-BY adopted from Ref. [14]
5.6 Example: After a reduced set of interesting amino acids have been selected by
Combinatorial the user, combinatorial saturation mutagenesis can be performed
Saturation (Fig. 5) to screen if the subsequent mutations are additive or, when
Mutagenesis introduced at the same time, cause a higher effect than the individ-
ual mutations (hysteresis). Note that CADEE was written with the
aim of testing the saturation mutagenesis of several different resi-
dues together with a single command: “cadee prep . . . --libmut.”
We have therefore decided to use the results obtained from individ-
ual point saturation at each of the three positions, choosing a subset
of amino acids to be tested at each hot spot:

mv libmut point_saturation
cadee prep wt.pdb wt.fep wt.qpinp libraries --libmut 92:’AGH’\\
163:’SPHECA’ 171:’WSRLD’
This would prepare 4 7 6 ¼ 168 enzyme variants

(4 ¼ 672 simpacks), as we demonstrated in the case of the
published triosephosphate isomerase data [14].

INFO:root:Preparing libmut - LIBrary MUTatagenesis.
92:AGH (+ native/wt)
163:SPHECA (+ native/wt)
171:WSRLD (+ native/wt)
INFO:root:[(92, [’A’, ’G’, ’H’]), (163, [’S’, ’P’, ’H’, ’E’,
’C’, ’A’]), (171, [’W’, ’S’, ’R’, ’L’, ’D’])]
[...]
INFO:root:Working on $CADEE_DIR/pedagogical_example/libmut/
Y163A.
INFO:prep.pyscwrl:No clashes detected.
[...]
Y163S-T171W.
allow Scwrl4 to modify residues [163, 171, 519, 547, 549,
551, 730, 767, 771]
[...]
L092A-Y163S-T171W.
549, 551, 730, 767, 771]
[...]
L092H-Y163C-T171R.
562, 730, 767, 771]
[...]
L092H-Y163A-T171D.
allow Scwrl4 to modify residues [90, 92, 163, 171, 730, 767]
[...]
INFO:prep.create_inputs:Creating input files for L092G.
[...]
INFO:prep.create_inputs:Creating input files for Y163C-T171L.
[...]
INFO:prep.create_inputs:Creating input files for L092H-Y163S-

T171D.
[...]
INFO:root:Packing L092A-Y163S:
INFO:root:Packing L092A-Y163H-T171R:
INFO:root:Packing Y163A:
[...]
ample/libmut.
Finally, the generated simpacks need to be computed with

“cadee dyn” (not shown).
5.7 CADEE CADEE provides a straightforward and fast way to generate and
Customization test hundreds to thousands of mutants of a well-parameterized
EVB reaction. To generate simpacks, CADEE relies on “simpack-
templates”: Currently, one simpack-template is included as a
default, and a second one has been used in Subheading 4.2. We
strongly recommend that users examine the existing templates and
adjust them as per user requirements: Additional templates and
documentation (readme.md) are available in $CADEE_DIR/sim-
pack_templates/.
6 Limitations of CADEE
In the case of multistep reaction profiles where only the initial rate-
limiting step was subjected to CADEE evolution, we recommend
taking the best CADEE hits and running EVB on all other reaction
steps to ensure that the proposed residue substitutions do not cause
a change in rate-limiting step. We also recommend that additional
(and longer) simulations should be run for the best hits identified
to both improve the quality of the predictions obtained and reduce
the risk of false positives due to too short sampling time.
6.1 Specific l Thermalization (initial system heating) and equilibration are

Limitations of performed at the putative transition state, i.e., at λ ¼ 0.5 along
CADEE v 0.9: the reaction coordinate.
l The automatic EVB mapping does not currently support more
complex functional forms for the off-diagonal term.
l Rank0 is reserved for input/output and not computing
calculations.
l The --trajcsv argument is not officially supported.
l Temperature averages are not extracted from Qdyn6 log files.
l Arbitrary mutations are only obtained using SCWRL4; other
tools are not currently available, but can be implemented by
adding additional modules.
7 How to Deploy CADEE Effectively
Protein structures are complex, and efficiently predicting function-

ally beneficial mutations therefore requires a well-defined strategy.
The following protocol could be used to evaluate the residue
alterations that result from the CADEE analysis. Here, if multiple
domains are encoded in the sequence, the scores could be individ-
ually estimated for each of the domains.
1. Selecting and curating the native enzyme structure:
(a) This step is the most important step to define the ultimate
accuracy of the simulations, as the structural topology is
used by CADEE for the EVB analysis. Hence, the selected
protein structure should be the experimentally solved
structure or the closest predicted model with the well-
defined near-native topology for structurally continuous
or discontinuous domains(s), to accurately define the
active-site contours.
(b) The active-site charge and proton configuration for the
reactive residues should be exactly the same as the
enzyme’s native state actually interacting with the reac-
tant, as it would accurately guide the EVB computation by
considering the polarization effects of such charges, and
result in biologically meaningful ΔΔG scores.
2. Pre-analyzing the considered structure before the CADEE
analysis:
(a) Selecting the top-ranked set of homologs on the basis of
screening the protein sequence and structure databases
with the information retrieved from additional datasets,
viz. structural classification of proteins (SCOP [44]), evo-
lutionary classification of protein domains (ECOD [45]),
protein family (Pfam [46]), conserved domain database

(CDD [47]), SMART [48], and TIGRFAM [49], by
considering the culling options for reducing the spurious
redundant hits: This entire information affirms the con-
sideration of a good candidate that has diversified in its
sequence evolution, although it retains the structural
topology of the considered sequence, especially when a
well-annotated database does not exist.
(b) Evaluating the co-localized residues with sequence and
structural penalty-based profiles, and iterating this step
for the final top-ranked hits, as the erroneous alignment
shift errors usually incurred due to the simultaneous con-
sideration of distant and closest hits is easily resolved. An
easy way to detect accuracy is the sequence overlap of the
functionally conserved and active-site residues.
(c) Evaluating the residue propensity at each locus in correla-
tion with the mean scores scattered across the chain in the
constructed profile to further correlate with the CADEE
results.
(d) Evaluating the biophysical restraints imposed on the sec-
ondary/tertiary structure by altering a given residue.
(e) Estimating whether the considered mutation overlaps
with the structural state observed in the closest hits
through tools like PSIPRED [50, 51].
(f) Robustly evaluating the stability of the sequence, if some
mutation(s) are considered. This is because the native
energetic Z-score of a structure should be well predicted
prior to the test, and if a mutation forcefully decreases that
level, the extent of decrement could also be devastating
for the model, or if it increases it won’t simply be mean-
ingless. Currently, CADEE evaluates the structural topol-
ogy for the localized atomic clashes through SCWRL4 to
produce the best set of ΔΔG scores. If any of these steps is
not implemented properly and the protocol accumulates a
few errors that could subsequently extrapolate to alter-
ation of the protein structure, especially in/near the
demarcated sphere zone, CADEE will ultimately yield
meaningless results and therefore care is crucial.
(g) Evaluating the CADEE results through coevolving resi-
due propensities and substitution matrices would allow
the CADEE-generated top-ranked scores to be justified
with the underlying biochemical mechanism defining the
improved activity or stability.
(h) The resulted mutations should be biochemically consid-
ered and evaluated for elucidating their plausible role in
either stabilizing the enzyme structure or improving the
enzymatic activity.
8 Troubleshooting
8.1 CADEE A successful CADEE installation ends with this line:

Installation
Finished processing dependencies for cadee
If this does not closely resemble the CADEE installation out-

put, something with the setup did not work. The user is advised to
double-check that all the dependencies are installed, and try to (re-)
install CADEE again.
8.2 CADEE First Start If the first start of the CADEE script fails, two things could have
gone wrong:
1. Problem: Setup failed. The following script can be used to check
if CADEE is installed properly.

ok=1
msg="\n\n\n\n\n"
python -c ’import cadee’
if [[ $? -ne 0 ]]
then
msg="$msg\nUnable to load cadee module!"
ok=0
fi
cadee > /dev/null
if [[ $? -eq 127 ]]
then
msg="$msg\nUnable to locate ’cadee’ in \$PATH."
ok=0
fi
if [[ $ok -eq 1 ]]
then
msg="$msg\nCADEE is installed."
else
msg="$msg\nCADEE is *NOT* installed!
DIAGNOSIS:"
fi
echo -e "\n\n$msg"
CADEE is installed.
2. Problem: Debian or Ubuntu are being used, and the Python

script path is not in the user’s $PATH. If the above script
produces the output "Unable to locate ’cadee’ in $PATH.",
but not "Unable to load cadee module!" then the issue might
be the $PATH variable.

# ONLY FOR DEBIAN/UBUNTU
echo $PATH | grep "$HOME/.local/bin" || echo ’Please fix $PATH.’
Please fix $PATH.
Solution: If the user is using Debian or Ubuntu, the Python script

path might be missing in the PATH. This can be fixed by adding the
following line to the end of the $HOME/.profile file:
PATH¼"$HOME/.local/bin:$PATH" (using the user’s
favorite text editor)
Alternatively, the following script can be executed:

# Ubuntu / Debian ONLY
if [ -d "$HOME/.local/bin" ] ; then
ok=0
echo $PATH | grep -q "$HOME/.local/bin:" && ok=1
echo $PATH | grep -q ":$HOME/.local/bin" && ok=1
if [ $ok -eq 0 ]
then
echo ’export PATH="$HOME/.local/bin:$PATH"’ >> $HOME/.profile
source $HOME/.profile
echo ’Added $HOME/.local/bin to .profile.’
else
echo ’Stop: $HOME/.local/bin is already in your $PATH.’
fi
else
echo ’Stop: $HOME/.local/bin is not a directory. ’
fi
’Added $HOME/.local/bin to .profile.
8.3 Simpacks For some computer architectures, the compute time needed to
perform one simulation exceeds the wall-clock limit. CADEE is
hence able to restart and continue simulations, and the user can
simply resubmit the original submission file, to continue the simu-
lation with the same command, as there is no special restart flag. To
detect unfinished simpacks, the most straightforward way is to
compare the simpack sizes (/bin/ls –lS). Sometimes, however, a
node may have crashed, or a hard disk quota may have been hit, and
hence a simpack may be faulty and not finish even with enough
wall-clock time available. In those cases, it is advisable to untar the
simpack and repack it. A script to do this is:

cadee tool repair_simpack /$HOME/global/cadee_tutorial/wt/\\
wt_0.tar
[...]
1. Searching duplicate logfiles:
2. Searching duplicate energy files:
3. Searching missing restartfiles:
4. Searching damaged logfiles:
5. Search for logfiles lacking ’terminated normally’:
6. Search for gzipped logfiles lacking ’terminated normally’:
/$HOME/global/cadee_tutorial/wt/wt_0.tar:
No Problems with this simpack. Awesome!
If the script stops because of a bad tar archive (and it is certain

that it is a simpack issue) the --force flag may be used to force
repacking of the archive. CAUTION: If the parameter is not actu-
ally a simpack, the script will behave unpredictably, and may lead to
data loss, especially if applied with the --force flag. The flag is
especially helpful in crashes caused by disk space shortages. When
used, the faulty simpack tarballs are unpacked, faulty files are
deleted, and a new, uncorrupt archive is written back to the disk.
Caution: The original simpack will be overwritten during this
process.
8.3.1 Simpack A simpack contains all files necessary to perform an EVB simulation
Customization with Q. A minimal simpack hence contains files for the (1) initializa-
tion, (2) thermalization/heat-up, (3) equilibration, (4) free energy
perturbation/empirical valence bond computation, and (5) empiri-
cal valence bond free energy mapping. More detailed information
about simpacks and how they can be customized can be found in
$CADEE_DIR/simpack_templates/readme.md.
9 Overview and Conclusions
We recently developed a comprehensive and widely automated

toolkit for the computer-aided directed evolution of enzymes
(CADEE), freely available for download from Github at the follow-
ing link: https://github.com/kamerlinlab/cadee. The theoretical
background to CADEE has been described in detail in Ref.
[14]. The current contribution provides detailed protocols for
different types of simulations supported by CADEE, as well as
relevant snippets of code input/output, and when used in conjunc-
tion with the original CADEE publication [14] provides a compre-
hensive overview of the current scope, limitations, and future
prospects of CADEE.
Acknowledgments
The European Research Council provided financial support under

the European Community’s Seventh Framework Programme
(FP7/2007-2013)/ERC Grant Agreement 306474. SCLK
would also like to thank the Knut and Alice Wallenberg Foundation
and the Royal Swedish Academy of Sciences for a Wallenberg
Academy Fellowship, and the Swedish Research Council for
providing support through project grant 2015-04928. All calcula-
tions were performed on the Abisko cluster at the HPC2N center in
Umeå and on the Triolith cluster at the NSC in Linköping, thanks
to a generous supercomputing allocation provided by the Swedish
National Infrastructure for Computing (SNIC grant 2015/16-
12). In addition, we would like to thank Arina Gromova for exten-
sive testing of CADEE, Fabian Steffen-Munsberg for initial testing,
and Miha Purg for helpful discussions about qscripts/qtools.
References
1. Bornscheuer UT (1998) Directed evolution of improved enzymes: how to escape from local
enzymes. Angew Chem Int Ed 37:3105–3108 minima. ChemBioChem 13:1060–1066
2. Bull AT, Ward AC, Goodfellow M (2000) 11. Barrozo A, Borstnar R, Marloie G, Kamerlin
Search and discovery strategies for biotechnol- SCL (2012) Computational protein engineer-
ogy: the paradigm shift. Microbiol Mol Biol ing: bridging the gap between rational design
Rev 64:573–606 and laboratory evolution. Int J Mol Sci
3. Tao H, Cornish VW (2002) Milestones in 13:12428–12460
directed enzyme evolution. Curr Opin Chem 12. Kiss G, Çelebi-Ölçum N, Moretti R, Baker D,
Biol 6:858–864 Houk KN (2012) Computational enzyme
4. Currin A, Swainston N, Day PJ, Kell DB design. Angew Chem Int Ed 52:5700–5725
(2015) Synthetic biology for the directed evo- 13. Romero-Rivera A, Garcia-Borràs M, Osuna S
lution of biocatalysts: navigating sequence (2017) Computational tools for the evaluation
space intelligently. Chem Soc Rev of laboratory-engineered biocatalysts. Chem
44:1172–1239 Commun 53:284–297
5. Packer MS, Liu DR (2015) Methods for the 14. Amrein BA, Steffen-Munsberg F, Szeler I,
directed evolution of proteins. Nat Rev Genet Purg M, Kulkarni Y, Kamerlin SCL (2017)
16:79–394 CADEE: computer-aided directed evolution
6. Arnold FH, Volkov AA (1999) Directed evolu- of enzymes. IUCrJ 4:50–64
tion of biocatalysts. Curr Opin Chem Biol 15. Warshel A, Weiss RM (1980) An empirical
3:54–59 valence bond approach for comparing reactions
7. J€ackel C, Kast P, Hilvert D (2008) Protein in solutions and in enzymes. J Am Chem Soc
design by directed evolution. Annu Rev Bio- 102:6218–6226
phys 37:153–173 16. Warshel A, Sharma PK, Kato M, Xiang Y,
8. Currin A, Swainston N, Day PJ, Kell DB Liu H, Olsson MHM (2006) Electrostatic
(2015) Synthetic biology for the directed evo- basis for enzyme catalysis. Chem Rev
lution of protein biocatalysts: navigating 106:320–3235
sequence space intelligently. Chem Soc Rev 17. Kamerlin SCL, Warshel A (2010) The EVB as a
44:1172–1239 quantitative tool for formulating simulations
9. Romero PA, Arnold FH (2009) Exploring pro- and analyzing biological and chemical reac-
tein fitness landscapes by directed evolution. tions. Faraday Discuss 145:71–106
Nat Rev Mol Cell Biol 10:866–876 18. Luo J, van Loo B, Kamerlin SCL (2012) Exam-
10. Gumulya Y, Sanchis J, Reetz MT (2012) Many ining the promiscuous phosphatase activity of
pathways in laboratory evolution can lead to Pseudomonas aeruginosa arylsulfatase: a
comparison to analogous phosphatases. Pro- 33. King G, Warshel A (1989) A surface con-
teins Struct Funct Bioinf 80:1211–1226 strained all-atom solvent model for effective
19. Barrozo A, Duarte F, Bauer P, Carvalho ATP, simulations of polar solutions. J Chem Phys
Kamerlin SCL (2015) Cooperative electro- 91:3647–3661
static interactions drive functional evolution in 34. Lee FS, Warshel A (1992) A local reaction field
the alkaline phosphatase superfamily. J Am method for fast evaluation of long-range elec-
Chem Soc 137:9061–9076 trostatic interactions in molecular simulations.
20. Q Official Website. http://xray.bmc.uu.se/ J Chem Phys 97:3100–3107
~aqwww/q 35. Stallman RM (2009) GCC developer commu-
21. Manual for the molecular Dynamics package nity, using the Gnu compiler collection: A Gnu
Q. http://xray.bmc.uu.se/~aqwww/q/ manual for Gcc version 4.3.3. CreateSpace.
documents/qman5.pdf p 636
22. MPI4Py. https://pypi.python.org/pypi/ 36. Gabriel E, Fagg GE, Bosilca G, Angskun T,
mpi4py Dongarra JJ, Squyres JM, Sahay V,
23. O’Boyle NM, Banck M, James CA, Morley C, Kambadur P, Barrett B, Lumsdaine A, Castain
Vandermeersch T, Hutchison GR (2011) RH, Daniel DJ, Graham RL, Woodall TS
Open babel: an open chemical toolbox. J Che- (2004) Open MPI: Goals, concept, and design
minform 3:33–33 of a next generation MPI implementation. In:
Kranzlmüller D, Kacsuk P, Dongarra J (eds)
24. Krivov GG, Shapovalov MV, Dunbrack RL Recent Advances in Parallel Virtual Machine
(2009) Improved prediction of protein side- and Message Passing Interface: 11th
chain conformations with SCWRL4. Proteins European PVM/MPI Users’ Group Meeting
Struct Funct Bioinf 77:778–795 Budapest, Hungary, September 19–22, 2004.
25. Frushicheva MP, Cao J, Chu ZT, Warshel A Proceedings. Springer Berlin Heidelberg, Ber-
(2010) Exploring challenges in rational lin, Heidelberg, pp 97–104
enzyme design by simulating the catalysis in 37. Gropp W (2002) MPICH2: A New Start for
artificial Kemp eliminase. Proc Natl Acad Sci MPI Implementations. In: Proceedings of the
107:16869–16874 9th European PVM/MPI Users’ Group
26. Frushicheva MP, Cao J, Warshel A (2011) Meeting on recent advances in parallel virtual
Challenges and advances in validating enzyme machine and message passing interface,
design proposals: the case of Kemp eliminase Springer-Verlag, p 7
catalysis. Biochemistry 50:3849–3858 38. Python Software Foundation. Python Lan-
27. Kamerlin SCL, Warshel A (2011) The empiri- guage Reference, version 2.7. http://www.
cal valence bond model: theory and applica- python.org/
tions. WIREs Comput Mol Sci 1:30–45 39. Marelius J, Kolmodin K, Feierberg I, Åqvist J
28. Amrein BA, Bauer P, Duarte F, Janfalk Carls- (1998) Q: A molecular dynamics program for
son Å, Naworyta A, Mowbray SL, free energy calculations and empirical valence
Widersten M, Kamerlin SCL (2015) Expand- bond simulations in biomolecular systems. J
ing the catalytic triad in epoxide hydrolases and Mol Graph Model 16:213–225
related enzymes. ACS Catal 5:5702–5713 40. Berman HM, Westbrook J, Feng Z,
29. Ben-David M, Sussman JL, Maxwell CI, Gilliland G, Bhat TN, Weissig H, Shindyalov
Szeler K, Kamerlin SCL, Tawfik DS (2015) IN, Bourne PE (2000) The Protein Data Bank.
Catalytic stimulation by restrained active-site Nucleic Acids Res 28:235–242
floppiness—the case of high density 41. Berman HM, Henrick K, Nakamura H (2003)
lipoprotein-bound serum paraoxonase-1. J Announcing the worldwide Protein Data Bank.
Mol Biol 427:1359–1374 Nat Struct Mol Biol 10:980–980
30. Roca M, Vardi-Kilshtain A, Warshel A (2009) 42. HPC2N. http://www.hpc2n.umu.se/
Toward accurate screening in computer-aided
enzyme design. Biochemistry 48:3046–3056 43. Reetz MT, Wu S (2008) Greatly reduced
amino acid alphabets in directed evolution:
31. Frushicheva MP, Mills MJL, Schopf P, Singh making the right choice for saturation muta-
MK, Prasad RB, Warshel A (2014) Computer genesis at homologous enzyme positions.
aided enzyme design and catalytic concepts. Chem Commun 21:5499–5501
Curr Opin Chem Biol 21:56–62
44. Murzin AG, Brenner SE, Hubbart T, Chothia
32. Carvalho ATP, Barrozo A, Doron D, Kilshtain C (1995) SCOP: a structural classification of
AV, Major DT, Kamerlin SCL (2014) Chal- proteins database for the investigation of
lenges in computational studies of enzyme sequences and structures. J Mol Biol
structure, function and dynamics. J Mol 247:536–540
Graph Model 54:62–79
45. Cheng H, Schaeffer RD, Liao Y, Kinch LN, for the functional annotation of proteins.
Pei J, Shi S, Kim BH, Grishin NV (2014) Nucleic Acids Res 39(Database):D225–D229
ECOD: an evolutionary classification of pro- 48. Ponting CP, Schultz J, Milpetz F, Bork P
tein domains. PLoS Comput Biol 10: (1999) SMART: identification and annotation
e1003926 of domains from signalling and extracellular
46. Finn RD, Bateman A, Clements J, Coggill P, protein sequences. Nucleic Acids Res
Eberhardt RY, Eddy SR, Heger A, 27:229–232
Hetherington K, Holm L, Mistry J, Sonnham- 49. Haft DH, Selengut JD, White O (2003) The
mer ELL, Tate J, Punta M (2014) Pfam: the TIGRFAMs database of protein families.
protein families database. Nucleic Acids Res Nucleic Acids Res 31:371–373
42:D222–D230 50. Jones DT (1999) Protein secondary structure
47. Marchler-Bauer A, Lu S, Anderson JB, prediction based on position-specific scoring
Chitsaz F, Derbyshire MK, DeWeese-Scott C, matrices. J Mol Biol 292:195–202
Fong JH, Geer LY, Geer RC, Gonzales NR, 51. Buchan DWA, Minneci F, Nugent TCO,
Gwadz M, Hurwitz DI, Jackson JD, Ke Z, Bryson K, Jones DT (2013) Scalable web ser-
Lanczycki J, Lu F, Marchler GH, vices for the PSIPRED Protein Analysis Work-
Mullokandov M, Omelchenko MV, Robertson bench. Nucleic Acids Res 41(W1):
CL, Song JS, Thanki N, Yamashita RA, W340–W348
Zhang D, Zhang N, Zheng C, Bryant SH
(2011) CDD: a conserved domain database
Correction to: Enhancing Statistical Multiple Sequence
Alignment and Tree Inference Using Structural Information
Joseph L. Herman
Correction to:
Chapter 10 in: Tobias Sikosek (ed.),
Computational Methods in Protein Evolution,
Methods in Molecular Biology, vol. 1851,
https://doi.org/10.1007/978-1-4939-8736-8_10
The published version of this book included errors in code listings in Chapter 10. These
code listings have been corrected and text has been updated.
The updated online version of this chapter can be found at

https://doi.org/10.1007/978-1-4939-8736-8_10
E1
INDEX
A Combinatorial mutagenesis .........................124, 128–129

Command line........................................3, 41, 64, 66, 69,
Accuracy............................................. 2, 3, 15, 25, 34, 36, 78, 108–110, 139, 143, 185, 193, 211, 225, 227,
39, 85, 86, 92, 93, 107, 128, 132, 137, 154, 197, 253, 254, 382, 389, 399
224, 303, 304, 306, 312, 370, 373, 408, 409
Computational enzyme design..................................... 383
Adaptation ...........................50–52, 56, 59, 63, 171–180, Computational enzymology ......................................... 383
219, 273, 287 Conda ............................................................................ 5, 8
Affinity ..................... 3, 12, 20, 126, 138, 153, 159–161,
Conformational diversity ..................................... 353–363
173, 322, 323, 359, 367–369 Conformers .....................................................31, 35, 353,
Alchemical free energy calculation ............................3, 20, 355–363
21, 31, 33, 38, 42
Context-dependent mutations ............................ 123–133
Alchemistry................................. 3, 20, 21, 25–40, 42, 43 Critical assessment of methods of protein
Alignment uncertainty ................................ 143, 146, 204 structure prediction (CASP) ............................... 15
Amino acid change......................................................2, 51
Critical assessment of PRediction of interactions
Amino acid coevolution....................................... 105–119 (CAPRI)............................................................. 313
Amino acid interactions ........................84, 105, 106, 216 Cytoscape.................................................... 114, 241–243,
Amino acid mutation ........................19–44, 85, 369, 370 257, 324
Ancestral protein reconstruction........................ 135–164,
224–225, 228 D
Ancestral sequence reconstruction (ASR) .................... 75,
136–138, 147, 148, 151, 171–180, 224, 225, 228 DCA, see Direct coupling analysis
Antibody ................... 304, 306, 309, 311, 354, 367–377 De novo gene evolution ................................................. 67
De novo genes...................................................63, 64, 66,
B 67, 69, 78
Developability................................................................ 372
Bash................................................... 5, 8, 43, 66, 73, 387 Direct coupling analysis (DCA) ............................. 84, 85,
Basic local alignment search tool (BLAST) ..............8, 14, 89–95, 98, 100
66, 69, 70, 138, 139, 173, 235, 254, 277, 278,
Directed evolution ...................................... 372, 381, 412
280–283, 287, 302 Disordered protein..............................337–348, 357, 358
Bayesian graphical model..................................... 105–119 DNA .................................................. 3, 67, 78, 153, 160,
Bayesian hierarchical models ............................... 105–119
220, 298, 344
BEAST .................................................................. 175, 179 dN/dS .......................................................................59–61
Biochemistry.................................................................. 106
Bioinformatics ................................................................. 85 E
Biophysics ............................................................... 42, 234
Birth-death models ......................................51, 53–57, 61 EC-Blast...............................................267–268, 270, 271
BLAST, see Basic local alignment search tool ECOD, see Evolutionary classification of protein domains
Elastic network model (ENM)..................................... 216
C ELM, see Eukaryotic linear motif
Empirical valence bond (EVB)........................... 381–386,
CASP, see Critical assessment of methods of protein
388, 390–392, 398, 399, 405, 407, 408, 412
structure prediction Energy landscape............................................................. 86
Class, Architecture, Topology and Homology Enzyme ....................................................... 173, 177, 263,
(CATH) ......................................6, 235, 236, 238,
264, 267, 270, 273, 274, 347, 354, 356, 357,
239, 263–271, 273, 356 381–386, 390, 391, 395, 406, 408, 410, 412
CoDNas database................................................. 355–357 design....................................................................... 274
Coevolution.................................. 83–100, 105–119, 217
evolution......................................................... 263–274
https://doi.org/10.1007/978-1-4939-8736-8, © Springer Science+Business Media, LLC, part of Springer Nature 2019
417
COMPUTATIONAL METHODS IN PROTEIN EVOLUTION: METHODS IN MOLECULAR BIOLOGY
418 Index
Epistasis ...................................................... 106, 123–128, High-order epistasis ...................................................... 128
131, 133 Hmmer .................................................... 86, 91, 100, 289
Eukaryotic linear motif (ELM) .................................... 338 Homologs..........................................................69, 71, 86,
EVB, see Empirical valence bond 138, 139, 142, 153, 277–286, 302, 319, 328, 409
Evolution ...........................................................23, 50, 56, Homology .......................................................67, 70, 142,
58, 60, 61, 63, 64, 84, 106, 107, 111, 124, 136, 184, 207, 251–260, 281–284, 288, 302, 303,
138, 171, 172, 174, 179, 184, 195, 208, 211, 311, 313, 318, 319, 327
215–229, 234, 236, 245, 259, 263, 273, 287, Homology model(ing) ............................................3–7, 9,
288, 301, 303, 314, 354, 359, 360, 369, 372, 10, 12, 15, 153, 155, 159, 176, 179, 221, 235,
381–413 301, 307, 311, 354, 359–361
Evolutionary biochemistry .................................. 106, 173 Host-pathogen interaction (HPI)....................... 317–329
Evolutionary classification of protein domains Hybrid structure ................................................ 30–33, 41
(ECOD) ................................................... 235, 236, Hybrid topology ..........................................27, 32–33, 41
238, 239, 278, 280, 409 HyPhy ...........................................................108–113, 116
Evolutionary relationship ...................136, 234, 235, 253
I
F
IDP, see Intrinsically disordered protein
FastML.................................................................. 176, 179 In silico mutagenesis ................................... 179, 369, 372
Fasttree .......................................109, 111, 116, 145, 146 Interface mimicry ................................................. 318–320
Figtree................................................................... 164, 175 InterPro ...............................................173, 289, 293, 294
Fitness .................84, 124, 125, 216, 218, 219, 221, 223 Intrinsically disordered protein (IDP) ........................337,
FoldX ............................................................5–7, 179, 180 338, 353, 358
Force field .......................................................... 30–33, 39, Intrinsic disorder ........................338, 339, 343, 345, 348
43, 84, 86, 87, 306, 390 I-TASSER .......................................................15, 179, 313
Free energy calculations.............................. 20, 21, 42, 44
Free energy change (ΔG) .........................................19–44 L
Funtree ...............................263–265, 269–270, 273–274
Ligand................................................ 21, 32, 42, 51, 153,
159, 160, 186, 201, 210, 272, 303, 308–310,
G
312, 313, 355–359, 362
Gene birth ....................................................................... 63 Linux................................................................3, 4, 14, 86,
Gene duplication .................................49–57, 60, 61, 138 108, 185, 211, 372, 386, 387
Gene family......................................51–53, 55, 56, 58–61
Gene ontology (GO) .......................................... 259, 264, M
271, 288, 289, 295–298, 356 Mac OS ................................................................. 108, 309
Gene tree ......................................................52–55, 58, 59 MAFFT .................................. 66, 74, 116, 175, 177, 178
Genome ....................................................... 49, 50, 60, 64
MAMMOTH .............................................. 240, 355, 362
Genome evolution ...........................................50, 64, 288 Marginal posterior probability (MPP) ........................119,
Genome-wide detection ................................................. 67 136, 147
Github.........................................116, 185, 253, 386, 412 Markov chain Monte Carlo (MCMC)............... 107, 108,
Globins ..................... 185, 186, 195, 200, 201, 208, 210 114, 115, 175, 178, 187–193, 195, 208, 209
GO, see Gene ontology MATLAB .............................................100, 128, 131–133
Graph clustering...........................................253, 258–260 Maximum likelihood (ML) .................................... 25, 55,
Graphical user interface (GUI) ..........108–110, 185, 241
109, 112, 113, 116, 136, 137, 145, 147, 149,
Gromacs..........................................21, 30–34, 36, 42–44, 195, 200, 201, 205, 212, 220–221, 224, 228, 270
87, 95, 97, 100 MD, see Molecular dynamics
MDTraj ............................................................................ 15
H
Mean-field substitution model ............................ 221–225
Hamiltonian ..........................................22, 23, 27, 30, 42 Membrane protein ....................................................49–61
Hepatitis C virus (HCV) ..................................... 115–118 Message passing interface (MPI) ............... 108, 393, 394
HHblits.............................. 254, 256, 258, 280, 281, 302 ML, see Maximum likelihood
HH-suite .............................................................. 253–255 Modeller ..........................................................6, 7, 11, 14,
Hidden Markov Model (HMM) ............................ 85, 86, 15, 31, 154, 179, 370, 373, 374
91, 254, 256, 281, 302, 373 Model quality assessment .................................... 306–307
Index 419
Model quality estimates ................................................ 309 Phylogeny ................................................................ 54, 58,
Molecular dynamics (MD) ..................................... 20, 21, 110, 111, 136–138, 142, 143, 145–147, 151,
30, 33, 35, 39, 86, 87, 94, 99, 356, 382, 385, 392 160, 161, 225, 226, 263, 295
Molecular evolution ............................................ 124, 135, PhyML ...................................................67, 109, 116, 117
136, 145, 219, 222, 226 Pmx ........................................................ 21, 30–33, 37–44
Molecular mimicry ............................................... 317–320 Point mutation ............................................. 30, 177, 179,
Molecular phylogenetics ............................................... 124 362, 403
Molecular recognition features (MoRFs) .......... 338–341, Position specific scoring matrix
343, 345–348 (PSSM).....................................139, 140, 302, 339
MoRFpred ............................................................ 337–348 PPI, see Protein-protein interaction
MPI, see Message passing interface Prediction ................................................. 2, 3, 19, 20, 39,
MPP, see Marginal posterior probability 63–80, 83–100, 107, 138, 153, 159–162, 176,
MrBayes ......................................................................... 175 204, 235, 304, 306, 309, 311, 317–329,
Multiple sequence alignment (MSA) ..................... 73, 76, 338–345, 348, 369, 370, 373–374,
85, 86, 89–93, 100, 116, 142, 172, 175–180, 376, 377, 408
183–212, 218, 220, 222, 224–227, 253, 254, Profile-HMM alignment............................................... 236
256, 259, 272 PROSITE ...................................................................... 288
Mutation........................................... 1–15, 19–21, 26–36, Protein-coding genes ................................................63–80
38–41, 44, 51, 52, 56, 60, 74–77, 84, 106, 110, Protein complex .......................................... 179, 235, 313
123–133, 175, 179, 216, 218, 219, 221–224, Protein conformation ............................... 2, 87, 353–363
369, 372, 376, 409 Protein Data Bank (PDB) ...........................................4, 6,
11, 87, 96, 97, 99, 152, 153, 185, 186, 195, 196,
N 211, 217, 218, 220, 222, 223, 225, 227, 228,
Native state ........................ 215–218, 224, 353–357, 408 235–241, 243, 244, 269, 270, 278–282, 285,
302, 306, 307, 309–312, 320, 321, 323, 324,
NJplot ................................................................... 175, 176
Non-equilibrium transitions................. 35–37, 41, 42, 44 327, 329, 339, 354, 355, 357, 362, 370, 385,
Non-synonymous substitution ........................... 110–113, 390, 397, 403
Protein domains ........................................... 6, 9, 10, 160,
116–119
Novel genes ........................................................ 50, 63, 64 234, 235, 277, 287, 288, 298, 409
Protein dynamics.................................................. 354, 359
O Protein engineering ...................................................... 173
Protein evolution ................................. 61, 172, 215–221,
Oligomeric protein ....................................................... 304 225, 226, 228, 234, 236, 354
OncoKB........................................................................... 12 Protein family ..................................................84, 93, 136,
OpenMM....................................................................... 306 138, 139, 141–143, 145, 146, 151, 153, 160,
ORF formation................................................................ 75 220, 224, 303, 314, 409
Protein folding .................................................... 1–15, 21,
P 25–27, 84, 86, 88, 177
PAML, see Phylogenetic analysis by maximum likelihood stability..................................................................... 221
Parallel tempering ................................................ 191, 192 Protein function .................................................. 2, 61, 63,
Parsimony .................................................. 52–54, 58, 145 137, 138, 164, 171, 173, 176, 354
PAUP* ........................................................................... 116 prediction................................................................. 234
PDB, see Protein Data Bank Protein-ligand complex ....................................... 153, 161
Pfam ........................................................ 6, 85, 88–90, 92, Protein-protein interaction (PPI) ............. 1–15, 19, 310,
100, 139, 220, 221, 238, 288, 289, 293–295, 409 317–320, 322, 324–327, 329
Phylogenetic analysis by maximum likelihood Protein space ........................................................ 233–245
(PAML)..................................................... 176, 224 navigation ....................................................... 233–245
Phylogenetics................................... 52, 61, 67, 122, 137, Protein stability .......................................... 3, 12, 20, 171,
138, 142, 145, 146, 151, 172, 174, 175, 178, 215–229
185, 204, 205, 215–229, 253, 259, 274 Protein structure ................................................... 3, 9, 31,
Phylogenetic tree..................................................... 66, 76, 33, 83, 84, 87, 98–100, 135, 153, 155, 184, 186,
77, 106, 109, 113, 116, 117, 136, 142–147, 161, 195–197, 220, 221, 224, 226, 234, 235,
173, 175, 184, 193, 219, 220, 224, 227, 228, 237–243, 285, 301, 305, 307, 311, 312, 329,
253, 258, 264, 265, 269, 270, 273, 295 355, 361, 382, 408, 409
420 Index
Protein structure (cont.) StatAlign ............................................. 185–187, 189–190,
alignment ................................................................. 240 193, 195, 197, 211, 212
prediction........................................................ 234, 235 Statistical alignment ...................................................... 185
ProtTest ....................................................... 145, 175, 220 Structural alignment ............................................. 87, 236,
PSIPRED...................................... 86, 278, 280, 302, 409 240, 241, 273, 320, 355, 361, 362
PSSM, see Position specific scoring matrix Structural biology ........................................................... 83
PyMOL ..................................................31, 241, 244, 372 Structural modeling ................... 138, 153–159, 369–372
Python .......................................................... 8, 11, 21, 31, Structural network ............................................... 318, 324
66, 86, 89, 91, 94, 98, 100, 118, 139, 140, 144, Structure alignment ...................................................... 240
148, 149, 152, 155, 157, 160, 278, 296, 382, Structure based model (SBM)................................83–100
387, 389, 411 Structure prediction ................................ 85, 94, 153, 370
Structure space ................................... 234–235, 237–240,
Q 242, 243
Quaternary structure ........................................... 301–314 Substitution model ............................................. 111, 176,
211, 215–217, 219–221, 224–225, 227, 228
R Substitution rate.......................................... 111, 223, 227
Superorganism network......................322–324, 326, 327
RAxML ................................................109, 111, 146–148 Support vector machine (SVM) .......................... 339–341
Repeat proteins .................................................... 251–260 SWISS-MODEL ........................ 302–307, 309, 311–313
RNA ......................................................27, 115, 116, 118, Synonymous substitution .................................... 110, 111
119, 153, 155, 159, 160, 347
Root-mean-square deviation (RMSD) .................. 87, 96, T
195, 204, 205, 208–210, 212, 242, 243, 264,
355–358, 360–363, 371–373 Temperature ........................................... 22, 44, 189–192,
211, 212, 218, 219, 227, 408
Rosetta ................................... 15, 31, 313, 321, 322, 370
Thermodynamics...............................................2, 4, 7, 10,
S 19–22, 26, 27, 29, 34, 38, 40, 217, 221, 224,
226–228
SBM, see Structure based model Thermostability ......................................... 27, 31, 39, 368
SCOP ........................ 204, 235, 236, 238, 239, 242, 409
SCWRL4................... 306, 388, 389, 403, 404, 408, 409 U
Secondary structure ..................................... 7, 15, 85, 86,
93, 94, 228, 235, 303, 310, 372 UCSF chimera........................................... 87, 96, 97, 241
prediction.....................................84, 93–94, 280, 302 Uniprot ....................................................... 4–6, 9, 12, 85,
88, 92, 96, 99, 173, 254–256, 258, 355, 356, 362
Sequence alignment ................................................ 84, 87,
97, 108–111, 113, 136, 142–148, 151, 153, 157, UNIX ........................................................... 110, 139, 143
189, 197–201, 236, 264, 273, 277–286, 374, 377
V
Sequence homology............................251–260, 318, 319
Side chain prediction .................................. 370, 372, 374 Vertical analysis.............................................................. 174
Small molecule ....................................153, 272, 317, 329
Stability constrained substitution models...................216, W
219–221, 223–228 Windows ...........4, 14, 96, 108–110, 185, 309, 339, 372

Computational Methods in Protein Evolution 2019

Uploaded by

Copyright:

Available Formats

Computational Methods in Protein Evolution 2019

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computational Methods in Protein Evolution 2019

Uploaded by

Copyright:

Available Formats

Methods in

Molecular Biology 1851

Tobias Sikosek Editor

For further volumes:

ISSN 1064-3745 ISSN 1940-6029 (electronic)

Heidelberg, Germany Tobias Sikosek

1 Predicting the Effect of Mutations on Protein Folding

14 Exploring Enzyme Evolution from Changes in Sequence, Structure,

XIANLI JIANG Department of Biological Sciences, University of Texas at Dallas, Richardson,

CHRISTINE M. ORENGO Institute of Structural and Molecular Biology, University College

Predicting the Effect of Mutations on Protein Folding

Proteins usually fold into specific, stable, three-dimensional struc-

or a gain, of new interaction partners. Mutations can also change

1.1 Sequence-Based Sequence-based tools usually rely on some form of a conservation

1.2 Structure-Based Structure-based tools predict the effect of mutations on protein

This tutorial requires basic knowledge of the Linux command line

Database Pipeline Local Pipeline

ELASPIC also uses many external programs, which are listed in

Database Description URL

with reasonable default values, are shown below. The KEY_

2. Download and extract the foldx executable into a folder that

Software Description URL License

$ mkdir -p "${LOCAL_BIN_DIR}" && cd "${LOCAL_BIN_DIR}"

3. Download and install either Miniconda or the Anaconda

4. Add the conda channels required for installing ELASPIC to

5. Install ELASPIC, including all its dependencies, into a new

6. Download the BLAST nonredundant database, check down-

7. (Optional) Create an ELASPIC database, which will contain

8. (Optional) Load precalculated data into the ELASPIC

(Option B) If we would like to use the Provean supporting sets

Steps 7 and 8 above are required if we want to use ELASPIC to

database file called elaspic.db, located in our home directory) to

2. Download a structure of glutathione S-transferase epsilon,

3. Run ELASPIC to calculate Provean supporting sets and opti-

4. Run ELASPIC to evaluate the structural impact of each

$ elaspic run -p 3zml.pdb -m

Alternatively, we can use GNU parallel to process multiple

$ parallel --res mutation_logs --joblog mutation_logs.txt \

5. Once the above commands have finished running, we can read

6. We can compare the results that we obtained by running

2. Download oncokb.tsv file from the ELASPIC downloads page.

3. Process oncokb.tsv to create a file containing only unique

4. Run ELASPIC to calculate Provean supporting sets and

Alternatively, we can use GNU parallel to process multiple

5. Run ELASPIC to evaluate each individual mutation.

Alternatively, we can use GNU parallel to process multiple

6. The results are stored in the uniprot_domain_mutation and

uniprot_id_2 ELSE uniprot_id_1 END partner_uniprot_id, ddg interface_ddg

7. We can compare the results that we obtained by running

1. It is likely that ELASPIC would work on Windows 10 subsys-

6. I-TASSER can be used instead of Modeller to construct

Funding: P.M.K. acknowledges support from a NSERC Discovery

Accurate Calculation of Free Energy Changes upon Amino

Due to the central role of the free energy in thermodynamics and

Electronic supplementary material: The online version of this chapter (https://doi.org/10.1007/978-1-4939-

In this section, we briefly review some of the central concepts that

The equality holds only in the limiting case of a reversible

where H is the Hamiltonian of the system, which depends on the

estimation of free energy differences. More recently, Jarzynski has

he βW ðτÞ i ¼ e βΔG ð5Þ