VPOD: A database of opsins and machine-learning models to predict λmax phenotypes.
Histogram distributions of Vertebrate and Invertebrate Opsin Light Sensitivity Data - λmax - from VPOD_het_1.1 with a scaled Kernel Density Estimate (KDE) curves overlaid to better visualize the general shape and characteristics of our λmax distributions
We introduce the Visual Physiology Opsin Database, a newly compiled database for all heterologously expressed opsin genes with λmax phenotypes (wavelength of maximal absorbance; peak-senstivity). VPOD_1.1 contains 1123 unique opsin genotypes and corresponding λmax phenotypes collected across all animals from 90 separate publications.
We use VPOD data and deepBreaks (an ML tool designed for exploring genotype-phenotype associations) to show regression-based machine learning (ML) models often reliably predict λmax, account for non-additive effects of mutations on function, and identify functionally critical amino acid sites.
We provide an approach that lays the groundwork for future robust exploration of molecular-evolutionary patterns governing phenotype, with potential broader applications to any family of genes with quantifiable and comparable phenotypes.
Instructions for navigating VPOD data files, including raw and curated data used for training models, and for accessing scripts/notebooks used for sequence data manipulation, testing, and model training.
- Navigate to the folder vpod_data/VPOD_1.X (i.e. vpod_data/VPOD_1.0 or vpod_data/VPOD_1.1)
- Select the formatted_data_subsets folder to access subsets of the database suitable for direct model training without requiring mySQL or sequence alignment.
- The folder vpod_1.1_data_subsets_2024-05-02 contains all data subsets for VPOD_1.1.
- Files marked xxx.txt (ex. vert.txt) are the unaligned data subsets.
- Files marked xxx_aligned.txt (ex. vert.txt) are the aligned data subsets (not-formatted; will become obsolete in later versions of VPOD).
- Files marked VPOD_xxx_het_1.1 (ex. VPOD_vert_het_1.1.fasta) are the fully aligned and formatted data subsets.
- Files marked xxx_meta (ex. wds_meta.tsv) are the corresponding metadata files for each subset (includes species, accession, gene names, λmax, etc).
- Select the raw_database_files folder for the raw components of the database that you can load into a mySQL database and creat your own formatted dataset using steps 0-2 of the vpod_main_wf.ipynb Jupyter notebook. For more information on the specific meaning of data columns, see Frazer et al. 2024.
- litsearch.csv - All literature search information relevant to the current version of VPOD.
- references.csv - All publication references relevant to the current version of VPOD.
- opsins.csv - All opsin sequence data and taxonomic meta-data.
- heterologous.csv - All opsin phenotype data (λmax) and experiment related meta-data.
- Select the formatted_data_subsets folder to access subsets of the database suitable for direct model training without requiring mySQL or sequence alignment.
- Navigate to the folder results_files - All subfolders contain results specific to different subtests outlined in Frazer et al. 2024
- wt_model_predicting_all_mutants_test - Contains results of the Wild-Type model (which lacks data from artificially mutated sequences) predictions on all experimentally mutated opsins in VPOD.
- This folder contains results for several versions of VPOD (i.e. VPOD_1.0 and VPOD_1.1).
- mutant_pred_analysis.ipynb - Contains code to run and results for Wilcoxn Signed-Rank Test - testing for statistically significant difference in the distributions of squared prediction errors for the whole-dataset model and wild-type model on all mutant data . Results are displayed directly in the notebook.
- epistasis_pred_test - Contains predictions by and comparisons between our WT and WDS models on 111 'epistatic mutations' (non-additive) to more generally the capabilities of our ML models to predict epistatic interactions between mutations.
- epistasis_analysis.ipynb - Contains code for and results of Wilcoxn Signed-Rank Tests - testing for statistically significant differences in the distributions of squared error between the WDS-epi model to WT model, WDS-epi model to the expected additive mutation λmax values (EAV), and WT model to EAV, respectively.
- full_iter_sample_results - Contains results of our 'sample iterate test' on each dataset - where x number of datapoints are removed from the training data prior to training and used as test data. This is repeated until all points have been sampled once.
- imputation_tests - Contains results comparing the predictions of phylogentic imputation and our ML models for the same data points (varies by dataset).
- main_model_results - Contains model training results for each dataset, seperated by alignment method used (MAFFT, MUSCLE, and GBLOCKS following MUSCLE alignment) and then by database version (i.e. VPOD_1.0 or VPOD_1.1).
- msp_tests - Contains results for model predictions from each dataset on thirty unseen wild-type invertebrate opsins from a separately curated MSP dataset.
- perf_v_tds - Contains results tracking the correlation between dataset size and model R^2, to better understand how training data relate to model performance.
- sws_ops_prediction_comparison_test - Contains results comparing predictive capabilities of models trained on different data subsets by randomly selecting and removing the same 25 wild-type ultraviolet or short-wave sensitive opsins from the training data of the WDS, Vertebrate, WT, and UVS/SWS.
- wt_model_predicting_all_mutants_test - Contains results of the Wild-Type model (which lacks data from artificially mutated sequences) predictions on all experimentally mutated opsins in VPOD.
Instructions for accessing scripts and Jupyter notebooks used to create the database and train ML models.
NOTE - It's recommended that you open these scripts in a compiler for a more detailed explination of how to properly use them*
- Navigate to the folder scripts_n_notebooks/sequence_manipulation
- mutagenesis.py- Used to generate mutants from a sequence accessible via NCBI by accession number or manually enter the raw sequence.
- chimeras.py - Used to generate chimeric sequences from sequences accessible via NCBI by accession or manually enter the raw sequence.
- Opsin chimeras are mutants where one or more transmembrane domains are copied from a different opsin to replace the original.
- Contact us for assistance with using this script.
- Navigate to the folder scripts_n_notebooks
- Select the folder vpod_ML_workflows to access notebooks used for training ML models.
- vpod_main_wf.ipynb - Primary notebook for users, with a full pipeline for everything from creating a local instance of VPOD using mySQL to formatting datasets and training ML models for λmax predictions.
- substests folder - Contains notebooks used for subtests outlined in Frazer et al. 2024.
- vpod_wf_iterate_train_all_subsets.ipynb - Iteratively train each dataset with no modifications to the training process; simply streamlines the training of all datasets in case of desire.
- vpod_wf_wt_mut_test.ipynb - Trains and tests the predictive capabilities of the Wild-Type model (which lacks data from artificially mutated sequences) on all experimentally mutated opsins in VPOD.
- vpod_main_wf_msp_iterate.ipynb - Iteratively train and test models from each dataset on thirty unseen wild-type invertebrate opsins from a separately curated MSP dataset.
- vpod_wf_imp_sample_test_iterate.ipynb - Iteratively train and test models from each dataset on randomly sampled subset of data; for comparison to predictions made by phylogentic imputation.
- vpod_wf_iterate_all_sample_t1_ops.ipynb - Iteratively subsample T1 dataset, removing 'x' datapoints before training before training and using it as test-data until all datapoints are sampled once.
- vpod_wf_iterate_epistasic_muts.ipynb - Iteratively subsample whole-dataset of mutations which demonstrate epistatic interactions between mutations, removing X datapoints before training before training and using it as test-data until all datapoints are sampled once (can also be modified to remove all epistatic mutants at once, as detailed in Frazer et al. 2024).
- vpod_wf_iterate_model_perf_vs_tds.ipynb - Iteratively adds or substracts 'x' datapoints from dataset to track the correlation between dataset size and model performance.
- vpod_wf_iterate_subsample.ipynb - Iteratively subsample target dataset, removing 'x' datapoints before training before training and using it as test-data until all datapoints are sampled once.
- vpod_wf_iterate_sws_comp.ipynb - Iteratively removes the same 25 randomly wild-type ultraviolet or short-wave sensitive opsins from the training data of the WDS, Vertebrate, WT, and UVS/SWS to compare predictive capabilities of models trained on different data subsets.
- Select the folder figure_making to access the Jupyter notebook figuremaking.ipynb used to generate some of the figures used in Frazer et al. 2024.
- figures contains a collection of completed figures and drafts use in Frazer et al. 2024 - seperated by version of the database used to generate the figures (i.e VPOD_1.0 or VPOD_1.1)
- Select the opsin_wt_tree folder for all files used to make the wild-type opsin gene-tree, supplementary figure 10 (S10), from Frazer et al. 2024.
- We've also provided it as a .svg file in this same folder or click here to download.
- figures contains a collection of completed figures and drafts use in Frazer et al. 2024 - seperated by version of the database used to generate the figures (i.e VPOD_1.0 or VPOD_1.1)
- Select the folder phylogenetic_imputation to access all files used to predict λmax via phylogenetic imputation and compare with predictions made by ML, as detailed in Frazer et al. 2024.
- Phylogenetic_Imputation.Rmd - Used to load tree files and make λmax predictions via phylogenetic imputation [Requires RStudio].
- trees - Contains all the tree files and λmax meta-data neccessary for predictions via phylogentic imputation.
- Select the folder vpod_ML_workflows to access notebooks used for training ML models.
Instructions for using VPOD and training ML models with deepBreaks.
- Follow the directions and install deepBreaks using the guide provided deepBreaks on the GitHub.
- Refer to the requirements.txt provided above and ensure all necessary packages are installed in a dedicated environment (using Conda is recommended).
- Navigate to scripts_n_notebooks/vpod_ml_workflows and open vpod_main_wf.ipynb.
- To train models using raw_database_files, start from the top of the document and follow the instructions provided in the notebook.
- To train models using formatted_data_subsets, scroll down to Step 3: deepBreaks and follow the instructions provided in the notebook.
All data and code is covered under a GNU General Public License (GPL)(Version 3), in accordance with Open Source Initiative (OSI)-policies
IF citing this GitHub and its contents use the following DOI provided by Zenodo...
10.5281/zenodo.10667840
IF citing the paper "Discovering genotype-phenotype relationships with machine learning and the Visual Physiology Opsin Database (VPOD)" use the following citation...
Seth A. Frazer, Mahdi Baghbanzadeh, Ali Rahnavard, Keith A. Crandall, & Todd H Oakley. (2024). Discovering genotype-phenotype relationships with machine learning and the Visual Physiology Opsin Database (VPOD). bioRxiv, 2024.02.12.579993. https://doi.org/10.1101/2024.02.12.579993 [pre-print]
Contact information for author questions or feedback.
Todd H. Oakley
Seth A. Frazer
Here is a link to a bibliography of the publications used to build VPOD_1.2 (Full version not yet released)
If you know of publications for training opsin ML models not included in the VPOD_1.2 database, please send them to us through this form