Skip to content

Proposal: Optional ProteinSeq class for protein sequences #5115

@harisj739

Description

@harisj739

Hey everyone! I'm new to Biopython and wanted to make my first contribution. I was going over the tutorial, specifically the Seq class, and found that there's no separate class or module for protein sequences. I feel like a separate class would be useful in cases where you're only dealing with protein sequences.

I'm proposing a class called ProteinSeq (possibly added to a new package called "Protein"?) that inherits from Seq. It does the following:

  • Ensures the protein is valid.
  • Returns protein-specific properties.
  • Converts from 1-letter to 3-letter amino acid codes.
  • Initializes proteins from their 3-letter amino acid codes.

I saw that Seq handles the codon -> amino acid conversion using the translate() function so a ProteinSeq object would just use that parent function.

Please let me know what you guys think and whether this is/isn't something useful to add.

Here's the code:

from Bio.Seq import Seq
from Bio.SeqUtils import ProtParam
from Bio.Data import IUPACData

class ProteinSeq(Seq):

    """A typed sequence class specifically for protein sequences.

    New features:
        1. Validate protein alphabet --> _validate()
        2. Return protein-specific properties using the ProteinAnalysis class --> all the @property annotated functions
        3. Convert 1-letter AA code to 3-letter --> one_to_three_letter()
        4. Initialize protein from 3-letter AA code --> from_three_letter()
      
     Example usage (assuming a new package "Protein" is added):
         from Bio.Protein import ProteinSeq 
         protein = ProteinSeq('PHAGY')
         print(protein)

      Output:
          PHAGY"""

    valid_letters = set(IUPACData.protein_letters)

    def __init__(self, data, *args, validate=True, **kwargs):
        super().__init__(data, *args, **kwargs)
        if validate:
            self._validate()

    def _validate(self):
        """Validate that all characters are valid amino acids."""
        seq_set = set(str(self))
        if not seq_set.issubset(self.valid_letters):
            raise ValueError(
                f"ERROR: Invalid characters in protein sequence: {seq_set - self.valid_letters}"
            )

    @property
    def analysis(self):
        """Return a ProtParam analysis object."""
        return ProtParam.ProteinAnalysis(str(self))

    @property
    def molecular_weight(self):
        return self.analysis.molecular_weight()

    @property
    def isoelectric_point(self):
        return self.analysis.isoelectric_point()

    @property
    def aromaticity(self):
        return self.analysis.aromaticity()

    @property
    def gravy(self):
        return self.analysis.gravy()

    @property
    def amino_acid_composition(self):
        return self.analysis.count_amino_acids()

    def one_to_three_letter(self):
        """
        Convert 1-letter AA codes to 3-letter codes.
       
        Example Usage:
            three_letter_aa = ProteinSeq.one_to_three_letter('MASL')
            print(three_letter_aa)
        
        Output:
            Met-Ala-Ser-Leu
        """
        aa1 = str(self)
        aa3 = []
        for aa in aa1:
            if aa not in IUPACData.protein_letters:
                raise ValueError(f"ERROR: Invalid amino acid {aa}")
            aa3.append(IUPACData.protein_letters_3to1_rev[aa])
        return "-".join(aa3)

    @classmethod
    def from_three_letter(cls, seq3):
        """
        Initialize ProteinSeq from 3-letter AA codes separated by hyphens.

        Example Usage:
            new_protein = ProteinSeq.from_three_letter("Met-Ala-Lys")
            print(new_protein)
        
        Output:
            MAK
        """
        parts = seq3.split("-")
        aa1 = []
        mapping = IUPACData.protein_letters_3to1

        for aa in parts:
            aa_cap = aa.capitalize()
            if aa_cap not in mapping:
                raise ValueError(f"ERROR: Invalid 3-letter amino acid {aa}")
            aa1.append(mapping[aa_cap])

        return cls("".join(aa1))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions