VPOD: A database of opsins and machine-learning models to predict λmax phenotypes.
Histogram distributions of Vertebrate and Invertebrate Opsin Light Sensitivity Data - λmax - from VPOD_het_1.1 with a scaled Kernel Density Estimate (KDE) curves overlaid to better visualize the general shape and characteristics of our λmax distributions
We introduce the Visual Physiology Opsin Database, a newly compiled database for all heterologously expressed opsin genes with λmax phenotypes (wavelength of maximal absorbance; peak-senstivity). VPOD_1.1 contains 1123 unique opsin genotypes and corresponding λmax phenotypes collected across all animals from 90 separate publications.
We use VPOD data and deepBreaks (an ML tool designed for exploring genotype-phenotype associations) to show regression-based machine learning (ML) models often reliably predict λmax, account for non-additive effects of mutations on function, and identify functionally critical amino acid sites.
We provide an approach that lays the groundwork for future robust exploration of molecular-evolutionary patterns governing phenotype, with potential broader applications to any family of genes with quantifiable and comparable phenotypes.
UPDATE!!! - Want to predict opsin λmax with our models?
If so, visit our GitHub for OPTICS, an open-source tool that predicts the Opsin Phenotype (λmax)
Instructions for navigating VPOD data files, including raw and curated data used for training models, and for accessing scripts/notebooks used for sequence data manipulation, testing, and model training.
- Navigate to the folder vpod_data/VPOD_1.X (i.e. vpod_data/VPOD_1.0 or vpod_data/VPOD_1.1)
- The folder vpod_1.1_data_subsets_2024-05-02 contains all data subsets for VPOD_1.1.
- Files marked xxx.txt (ex. vert.txt) are the unaligned data subsets.
- Files marked xxx_aligned.txt (ex. vert.txt) are the aligned data subsets (not-formatted; will become obsolete in later versions of VPOD).
- Files marked VPOD_xxx_het_1.1 (ex. VPOD_vert_het_1.1.fasta) are the fully aligned and formatted data subsets.
- Files marked xxx_meta (ex. wds_meta.tsv) are the corresponding metadata files for each subset (includes species, accession, gene names, λmax, etc).
- Select the raw_database_files folder for the raw components of the database that you can load into a mySQL database and creat your own formatted dataset using steps 0-2 of the vpod_main_wf.ipynb Jupyter notebook. For more information on the specific meaning of data columns, see Frazer et al. 2024.
- litsearch.csv - All literature search information relevant to the current version of VPOD.
- references.csv - All publication references relevant to the current version of VPOD.
- opsins.csv - All opsin sequence data and taxonomic meta-data.
- heterologous.csv - All opsin phenotype data (λmax) and experiment related meta-data.
- Navigate to the folder results_files - All subfolders contain results specific to different subtests outlined in Frazer et al. 2024
- wt_model_predicting_all_mutants_test - Contains results of the Wild-Type model (which lacks data from artificially mutated sequences) predictions on all experimentally mutated opsins in VPOD.
- This folder contains results for several versions of VPOD (i.e. VPOD_1.0 and VPOD_1.1).
- mutant_pred_analysis.ipynb - Contains code to run and results for Wilcoxn Signed-Rank Test - testing for statistically significant difference in the distributions of squared prediction errors for the whole-dataset model and wild-type model on all mutant data . Results are displayed directly in the notebook.
- epistasis_pred_test - Contains predictions by and comparisons between our WT and WDS models on 111 'epistatic mutations' (non-additive) to more generally the capabilities of our ML models to predict epistatic interactions between mutations.
- epistasis_analysis.ipynb - Contains code for and results of Wilcoxn Signed-Rank Tests - testing for statistically significant differences in the distributions of squared error between the WDS-epi model to WT model, WDS-epi model to the expected additive mutation λmax values (EAV), and WT model to EAV, respectively.
- full_iter_sample_results - Contains results of our 'sample iterate test' on each dataset - where x number of datapoints are removed from the training data prior to training and used as test data. This is repeated until all points have been sampled once.
- imputation_tests - Contains results comparing the predictions of phylogentic imputation and our ML models for the same data points (varies by dataset).
- main_model_results - Contains model training results for each dataset, seperated by alignment method used (MAFFT, MUSCLE, and GBLOCKS following MUSCLE alignment) and then by database version (i.e. VPOD_1.0 or VPOD_1.1).
- msp_tests - Contains results for model predictions from each dataset on thirty unseen wild-type invertebrate opsins from a separately curated MSP dataset.
- perf_v_tds - Contains results tracking the correlation between dataset size and model R^2, to better understand how training data relate to model performance.
- sws_ops_prediction_comparison_test - Contains results comparing predictive capabilities of models trained on different data subsets by randomly selecting and removing the same 25 wild-type ultraviolet or short-wave sensitive opsins from the training data of the WDS, Vertebrate, WT, and UVS/SWS.
Instructions for accessing scripts and Jupyter notebooks used to create the database and train ML models.
NOTE - It's recommended that you open these scripts in a compiler for a more detailed explination of how to properly use them*
- Navigate to the folder scripts_n_notebooks/sequence_manipulation
- mutagenesis.py- Used to generate mutants from a sequence accessible via NCBI by accession number or manually enter the raw sequence.
- chimeras.py - Used to generate chimeric sequences from sequences accessible via NCBI by accession or manually enter the raw sequence.
- Opsin chimeras are mutants where one or more transmembrane domains are copied from a different opsin to replace the original.
- Contact us for assistance with using this script.
- Navigate to the folder scripts_n_notebooks
- Select the folder vpod_ML_workflows to access notebooks used for training ML models.
- vpod_main_wf.ipynb - Primary notebook for users, with a full pipeline for everything from creating a local instance of VPOD using mySQL to formatting datasets and training ML models for λmax predictions.
- substests folder - Contains notebooks used for subtests outlined in Frazer et al. 2024.
- vpod_wf_iterate_train_all_subsets.ipynb - Iteratively train each dataset with no modifications to the training process; simply streamlines the training of all datasets in case of desire.
- vpod_wf_wt_mut_test.ipynb - Trains and tests the predictive capabilities of the Wild-Type model (which lacks data from artificially mutated sequences) on all experimentally mutated opsins in VPOD.
- vpod_main_wf_msp_iterate.ipynb - Iteratively train and test models from each dataset on thirty unseen wild-type invertebrate opsins from a separately curated MSP dataset.
- vpod_wf_imp_sample_test_iterate.ipynb - Iteratively train and test models from each dataset on randomly sampled subset of data; for comparison to predictions made by phylogentic imputation.
- vpod_wf_iterate_all_sample_t1_ops.ipynb - Iteratively subsample T1 dataset, removing 'x' datapoints before training before training and using it as test-data until all datapoints are sampled once.
- vpod_wf_iterate_epistasic_muts.ipynb - Iteratively subsample whole-dataset of mutations which demonstrate epistatic interactions between mutations, removing X datapoints before training before training and using it as test-data until all datapoints are sampled once (can also be modified to remove all epistatic mutants at once, as detailed in Frazer et al. 2024).
- vpod_wf_iterate_model_perf_vs_tds.ipynb - Iteratively adds or substracts 'x' datapoints from dataset to track the correlation between dataset size and model performance.
- vpod_wf_iterate_subsample.ipynb - Iteratively subsample target dataset, removing 'x' datapoints before training before training and using it as test-data until all datapoints are sampled once.
- vpod_wf_iterate_sws_comp.ipynb - Iteratively removes the same 25 randomly wild-type ultraviolet or short-wave sensitive opsins from the training data of the WDS, Vertebrate, WT, and UVS/SWS to compare predictive capabilities of models trained on different data subsets.
- Select the folder figure_making to access the Jupyter notebook figuremaking.ipynb used to generate some of the figures used in Frazer et al. 2024.
- figures contains a collection of completed figures and drafts use in Frazer et al. 2024 - seperated by version of the database used to generate the figures (i.e VPOD_1.0 or VPOD_1.1)
- Select the opsin_wt_tree folder for all files used to make the wild-type opsin gene-tree, supplementary figure 10 (S10), from Frazer et al. 2024.
- We've also provided it as a .svg file in this same folder or click here to download.
- Select the folder phylogenetic_imputation to access all files used to predict λmax via phylogenetic imputation and compare with predictions made by ML, as detailed in Frazer et al. 2024.
- Phylogenetic_Imputation.Rmd - Used to load tree files and make λmax predictions via phylogenetic imputation [Requires RStudio].
- trees - Contains all the tree files and λmax meta-data neccessary for predictions via phylogentic imputation.
Instructions for using VPOD and training ML models with deepBreaks.
- Follow the directions and install deepBreaks using the guide provided deepBreaks on the GitHub.
- Refer to the requirements.txt provided above and ensure all necessary packages are installed in a dedicated environment (using Conda is recommended).
- Navigate to scripts_n_notebooks/vpod_ml_workflows and open vpod_main_wf.ipynb.
- To train models using raw_database_files, start from the top of the document and follow the instructions provided in the notebook.
- To train models using formatted_data_subsets, scroll down to Step 3: deepBreaks and follow the instructions provided in the notebook.
All data and code is covered under a GNU General Public License (GPL)(Version 3), in accordance with Open Source Initiative (OSI)-policies
IF citing this GitHub and its contents use the following DOI provided by Zenodo...
IF citing the paper "Discovering genotype-phenotype relationships with machine learning and the Visual Physiology Opsin Database (VPOD)" use the following citation...
Seth A. Frazer, Mahdi Baghbanzadeh, Ali Rahnavard, Keith A. Crandall, & Todd H Oakley. (2024). Discovering genotype-phenotype relationships with machine learning and the Visual Physiology Opsin Database (VPOD). bioRxiv, 2024.02.12.579993. https://doi.org/10.1101/2024.02.12.579993 [pre-print]
Contact information for author questions or feedback.
Todd H. Oakley
Seth A. Frazer
Here is a link to a bibliography of the publications used to build VPOD_1.2 (Full version not yet released)
If you know of publications for training opsin ML models not included in the VPOD_1.2 database, please send them to us through this form