Protein sequence representations for repeat detection

Install

To use pLM-Repeat, you first need to set up the python environment.

Create a new conda environment:

conda create -p your_path_to_new_env python=3.9
conda activate your_path_to_new_env

Install dependencies:

pip install -r requirement.txt
cd path_to_plmrepeat
pip install -e .

Introduction

The workflow of pLM-Repeat can be briefly described as follows:

Perform self-sequence alignment using pLM-BLAST (Kaminski et al., Bioinformatics, 2023). For more details on pLM-BLAST, please check the reference and its repository.
Add transitive traces
Build the score matrix based on all traces
Estimate possible lengths of repeat via the cumulative score to the diagonal
Determine the positions of the representative repeat using a sliding window with the estimated length
Compute a weighted embedding of the representative repeat
Conduct another round of pLM-BLAST search between representative embedding and full-length sequence embedding to extract repeat instances
Output alignment (score) matrix, repeat ranges, and multiple sequence alignment (experimental)

Performance

Usage

Run pLM-Repeat to detect repeats

cd path_to_plmrepeat
plmrepeat-run --emb path_to_embedding_file --seq path_to_sequence_file --out path_to_output_dir --transitivity --draw
e.g. plmrepeat-run --emb ./example/2QJ6_A.emb --seq ./example/2QJ6_A.fasta --out ./example/2QJ6_A/ --transitivity --draw

See help information

plmrepeat-run --help

The colab demo version is available here: https://colab.research.google.com/drive/1ouBwciiXy7HPnddut15JAGAmREWaqaZ7. In this colab notebook, you can run pLM-Repeat in an interactive way, i.e. selecting optimal repeat length based on the score/substitution matrix plot.

Output

Example on protein 4R36_A:

Residue range of detected repeats:

(31, 46), (49, 64), (67, 82), (97, 112), (121, 136), (139, 156), (157, 172), (175, 190)

Repeat alignment (Please note that the multiple alignment is created by directly combining pairwise alignments of representative repeat v.s. detected repeat, so the concatenated multiple alignment could be far away from optimal):

>representative repeat
ALIGNGCIVGNSTKMAGE
>repeat1
AKIGENVEIAPFVYI---
>repeat2
VVIGDNNKIMANANI---
>repeat3
SRIGNGNTIFPGAVI---
>repeat4
AEIGDNNLIRENVTI---
>repeat5
TIVGNNNLLMEGVHV---
>repeat6
ALIGNGCIVGNSTKMAG-
>repeat7
IIIDDNAIISANVLM---
>repeat8
CRVGGYVMIQGGCRF---

Contact

This repository is under active construction. Please get in touch via [email protected] if you encounter any issues when using the pipeline once it is ready. Many thanks!

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
example		example
figure		figure
model		model
pLM-BLAST-main		pLM-BLAST-main
plmrepeat		plmrepeat
plmrepeat_colab		plmrepeat_colab
script		script
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein sequence representations for repeat detection

Install

Introduction

Performance

Usage

Output

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Protein sequence representations for repeat detection

Install

Introduction

Performance

Usage

Output

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages