We use the set of manually curated human reactions to electronically infer reactions in fourteen evolutionarily divergent eukaryotic species for which high-quality whole-genome sequence data are available, and hence a comprehensive and high-quality set of protein predictions exists. These species include the laboratory mouse and rat, the nematode C. elegans, and budding and fission yeasts. The estimated success rates of our orthology inference strategy can be stated as ‘the percentage of eligible reactions, defined in step 2 below, in the current human data set for which an event can be inferred in the model organism. By this measure, success rates range from 82.4% for the laboratory mouse to 9.8% for P. falciparum.
Electronic inference proceeds in four steps:
- Protein homology data were obtained from PANTHER. PANTHER uses the Reference proteome dataset maintained by UniProt to generate phylogenetic trees of protein-coding genes across numerous species. Homologs are derived from these trees and annotated by type (ortholog, paralog). Additionally, in cases with multiple orthologs PANTHER infers the least diverged ortholog based on protein sequence divergence, which is utilized during our inference process. A detailed description of PANTHER's methodology can be found in PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees.
- All human reactions in the Reactome knowledgebase involving one or more proteins are eligible for electronic inference, with two exceptions. Reactions that were themselves inferred based on data from the model organism, and reactions involving species in addition to human (e.g., HIV infection of human cells) are excluded from electronic inference. Eligible reactions are checked to determine whether each involved protein has at least one homologous protein (HP) in the reaction's input, output and (if present) catalyst in the organism undergoing inference. If a human reaction involves a complex, at least 75% of the accessioned protein components of the human complex must have HPs in the model organism.
- For each reaction that meets these criteria, an equivalent reaction is created for the model organism by replacing each human protein with its model organism HP. If a human protein corresponds to more than one model organism HP, a DefinedSet called ‘Homologues of …’ is created, with the model organism HPs as members.
- After all possible reactions have been inferred for the species, any human pathways that contain at least 1 inferred reaction will be inferred for the species as well.
These electronically inferred reactions are predictions based on a number of assumptions. Most basically, we assume that if we can find model organism HPs corresponding to all proteins involved in a human reaction, then the proteins mediate the same reaction in the model organism. This may not be true. On the other hand, we may miss a truly homologous reaction in the model organism because it is mediated by structurally divergent proteins that were not identified as such by PANTHER’s techniques. Similarly, complexes sharing less than 75% homologous subunits between species may nevertheless continue to perform the same function. The electronically inferred reactions presented in Reactome are thus not data, but hypotheses useful to direct the design of confirmatory experiments.
A modified version of the orthoinference process was used to create a first draft of the SARS-CoV-2 infection pathway. Events in the SARS-Cov-2 pathway corresponding to each event in the SARS-CoV-1 pathway were created and populated with SARS-CoV-2 protein-containing physical entities based on orthology to SARS-CoV-1 proteins. We continue to replace predicted SARS-Cov-2 events with experimentally validated events when the relevant experimental evidence becomes available.