Skip to content

In-silico Saturation Mutagenesis implementation with 10x or more speedup for certain architectures.

License

Notifications You must be signed in to change notification settings

kundajelab/fastISM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quickstart

A Keras implementation for fast in-silico saturated mutagenesis (ISM) for convolution-based architectures. It speeds up ISM by 10x or more by restricting computation to those regions of each layer that are affected by a mutation in the input.

Installation

Currently, fastISM is available to download from PyPI. Bioconda support is expected to be added in the future. fastISM requires TensorFlow 2.3.0 or above.

pip install fastism

Usage

fastISM provides a simple interface that takes as input Keras models. For any Keras model that takes in sequence as input of dimensions (B, S, C), where

  • B: batch size
  • S: sequence length
  • C: number of characters in vocabulary (e.g. 4 for DNA/RNA, 20 for proteins)

Perform ISM as follows:

from fastism import FastISM

fast_ism_model = FastISM(model)

for seq_batch in sequences:
    # seq_batch has dim (B, S, C)
    ism_seq_batch = fast_ism_model(seq_batch)
    # ism_seq_batch has dim (B, S, num_outputs) 

fastISM does a check for correctness when the model is initialised, which may take a few seconds depending on the size of your model. This ensures that the outputs of the model match that of an unoptimised implementation. You can turn it off as FastISM(model, test_correctness=False). fastISM also supports introducing specific mutations, mutating different ranges of the input sequence, and models with multiple outputs. Check the Examples section of the documentation for more details. An executable tutorial is available on Colab.

Benchmark

You can estimate the speedup obtained by comparing with a naive implementation of ISM.

# Test this code as is
>>> from fastism import FastISM, NaiveISM
>>> from fastism.models.basset import basset_model
>>> import tensorflow as tf
>>> import numpy as np
>>> from time import time

>>> model = basset_model(seqlen=1000)
>>> naive_ism_model = NaiveISM(model)
>>> fast_ism_model = FastISM(model)

>>> def time_ism(m, x):
        t = time()
        o = m(x)
        print(time()-t)
        return o

>>> x = tf.random.uniform((1024, 1000, 4),
                          dtype=model.input.dtype)

>>> naive_out = time_ism(naive_ism_model, x)
144.013728
>>> fast_out = time_ism(fast_ism_model, x)
13.894407
>>> np.allclose(naive_out, fast_out, atol=1e-6) 
True
>>> np.allclose(fast_out, naive_out, atol=1e-6) 
True # np.allclose is not symmetric

See notebooks/ISMBenchmark.ipynb for benchmarking code that accounts for initial warm-up.

Getting Help

fastISM supports the most commonly used subset of Keras for biological sequence-based models. Occasionally, you may find that some of the layers used in your model are not supported by fastISM. Refer to the Supported Layers section in Documentation for instructions on how to incorporate custom layers. In a few cases, the fastISM model may fail correctness checks, indicating there are likely some issues in the fastISM code. In such cases or any other bugs, feel free to reach out to the author by posting an Issue on GitHub along with your architecture, and we'll try to work out a solution!

Citation

fastISM: Performant in-silico saturation mutagenesis for convolutional neural networks; Surag Nair, Avanti Shrikumar*, Jacob Schreiber*, Anshul Kundaje (Bioinformatics 2022) http://doi.org/10.1093/bioinformatics/btac135.

*equal contribtion

Preprint available on bioRxiv.