-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Description
Hey everyone! I'm new to Biopython and wanted to make my first contribution. I was going over the tutorial, specifically the Seq class, and found that there's no separate class or module for protein sequences. I feel like a separate class would be useful in cases where you're only dealing with protein sequences.
I'm proposing a class called ProteinSeq (possibly added to a new package called "Protein"?) that inherits from Seq. It does the following:
- Ensures the protein is valid.
- Returns protein-specific properties.
- Converts from 1-letter to 3-letter amino acid codes.
- Initializes proteins from their 3-letter amino acid codes.
I saw that Seq handles the codon -> amino acid conversion using the translate() function so a ProteinSeq object would just use that parent function.
Please let me know what you guys think and whether this is/isn't something useful to add.
Here's the code:
from Bio.Seq import Seq
from Bio.SeqUtils import ProtParam
from Bio.Data import IUPACData
class ProteinSeq(Seq):
"""A typed sequence class specifically for protein sequences.
New features:
1. Validate protein alphabet --> _validate()
2. Return protein-specific properties using the ProteinAnalysis class --> all the @property annotated functions
3. Convert 1-letter AA code to 3-letter --> one_to_three_letter()
4. Initialize protein from 3-letter AA code --> from_three_letter()
Example usage (assuming a new package "Protein" is added):
from Bio.Protein import ProteinSeq
protein = ProteinSeq('PHAGY')
print(protein)
Output:
PHAGY"""
valid_letters = set(IUPACData.protein_letters)
def __init__(self, data, *args, validate=True, **kwargs):
super().__init__(data, *args, **kwargs)
if validate:
self._validate()
def _validate(self):
"""Validate that all characters are valid amino acids."""
seq_set = set(str(self))
if not seq_set.issubset(self.valid_letters):
raise ValueError(
f"ERROR: Invalid characters in protein sequence: {seq_set - self.valid_letters}"
)
@property
def analysis(self):
"""Return a ProtParam analysis object."""
return ProtParam.ProteinAnalysis(str(self))
@property
def molecular_weight(self):
return self.analysis.molecular_weight()
@property
def isoelectric_point(self):
return self.analysis.isoelectric_point()
@property
def aromaticity(self):
return self.analysis.aromaticity()
@property
def gravy(self):
return self.analysis.gravy()
@property
def amino_acid_composition(self):
return self.analysis.count_amino_acids()
def one_to_three_letter(self):
"""
Convert 1-letter AA codes to 3-letter codes.
Example Usage:
three_letter_aa = ProteinSeq.one_to_three_letter('MASL')
print(three_letter_aa)
Output:
Met-Ala-Ser-Leu
"""
aa1 = str(self)
aa3 = []
for aa in aa1:
if aa not in IUPACData.protein_letters:
raise ValueError(f"ERROR: Invalid amino acid {aa}")
aa3.append(IUPACData.protein_letters_3to1_rev[aa])
return "-".join(aa3)
@classmethod
def from_three_letter(cls, seq3):
"""
Initialize ProteinSeq from 3-letter AA codes separated by hyphens.
Example Usage:
new_protein = ProteinSeq.from_three_letter("Met-Ala-Lys")
print(new_protein)
Output:
MAK
"""
parts = seq3.split("-")
aa1 = []
mapping = IUPACData.protein_letters_3to1
for aa in parts:
aa_cap = aa.capitalize()
if aa_cap not in mapping:
raise ValueError(f"ERROR: Invalid 3-letter amino acid {aa}")
aa1.append(mapping[aa_cap])
return cls("".join(aa1))
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels