Skip to content

Contrastive learning harmonizing protein language models and natural language models

License

Notifications You must be signed in to change notification settings

wukevin/proteinclip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProteinCLIP

Introduction and background

Installation

To install proteinCLIP, start by cloning this repository. Then, create the requisite conda environment, activate it, and install ProteinCLIP in editable mode using pip. For example:

conda env create -f environment.yml
conda activate proteinclip
pip install -e ./

Note: we highly recommend the mamba package manager as an alternative to conda.

In addition to installation, you will likely need to download data files if you intend to train ProteinCLIP yourself; all datasets we use can be found at Zenodo.

using ProteinCLIP

We provide pre-trained ProteinCLIP "adapter" models for the ESM2 family of models as well as ProtT5. These models are available under the pretrained directory and can be loaded using provided functions; see below for an example.

from proteinclip import model_utils

m = model_utils.load_proteinclip("esm", 33)  # For ESM2, 33-layer model

# Create a synthetic example
# Size corresponds to embedding dimension of "parent" protein language model
model_input = np.random.randn(1280)
# ProteinCLIP expects input to be unit-normalized
model_input /= np.linalg.norm(model_input)
x = m.predict(model_input)
print(x.shape)  # (128,)
print(np.linalg.norm(x))  # 1.0; ProteinCLIP produces unit-norm vectors

Pre-trained models are available for the following models

  • ESM2, 36-layer: model_utils.load_proteinclip("esm", 36)
  • ESM2, 33-layer: model_utils.load_proteinclip("esm", 33)
  • ESM2, 30-layer: model_utils.load_proteinclip("esm", 30)
  • ESM2, 12-layer: model_utils.load_proteinclip("esm", 12)
  • ESM2, 6-layer: model_utils.load_proteinclip("esm", 6)
  • ProtT5: model_utils.load_proteinclip("t5")

These models are stored in the ONNX format so feel free to write your own loaders as well. These models are small and can run their forward inference passes very quickly even on CPU.

Example training commands

Training ProteinCLIP

To train ProteinCLIP yourself, you can use the pre-computed embeddings that we have provided above, or you can compute your own embeddings stored in a hdf5 format as (uniprot ID -> embedding array). After you have obtained a protein embedding file, pass it to training script as follows:

Example command:

python bin/train_protein_clip.py configs/clip_hparams.json /path/to/uniprot_sprot.dat.gz /path/to/protein_embedding.hdf5 --unitnorm -g text-embedding-3-large

Training should only take a couple hours with pre-computed embeddings.

Training protein-protein interaction classifier

We provide a training command to automatically train a protein-protein classifier using the data splits provided by Bernett et al. The input to this training call is a directory to a training run of the above ProteinCLIP; the relevant hdf5 embeddings for proteins will be loaded, as well as the CLIP architecture itself (as specified by the --clipnum argument).

Example command:

python bin/train_ppi.py configs/supervised_hparams.json -c ./protein_clip/version_0 --clipnum 1 -n ppi_classifier

Training should take a few minutes.

References

(1) Bernett, J., Blumenthal, D. B., & List, M. (2024). Cracking the black box of deep sequence-based protein–protein interaction prediction. Briefings in Bioinformatics, 25(2), bbae076.

About

Contrastive learning harmonizing protein language models and natural language models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published