Skip to content

GenBank multi-line qualifier value parsed incorrectly in specific cases #5101

@BioGavin

Description

@BioGavin

Hi, I found a parsing issue when reading GenBank files generated by antiSMASH.

When a long qualifier value such as /domain_id is wrapped across multiple lines, Biopython inserts a whitespace during parsing, which is not desired in this case.

This behavior is acceptable for many qualifiers, but here the inserted whitespace changes the intended format of the value.

Here is the GBK file content:
Image

The code and extraction output are below:

from Bio import SeqIO

gbk = "QUAK01000239.1.region001.gbk"

for record in SeqIO.parse(gbk, "genbank"):
    for f in record.features:
        if f.type == "aSDomain":
            domain_id = f.qualifiers["domain_id"]
            print(domain_id)
Image

I would like to know whether this behavior is caused by non-standard GenBank formatting in the antiSMASH output, or if Biopython has any recommended approach to reliably handle such multi-line qualifier values.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions