Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bio::Tools::GFF doesn't write eukaryotic multi-exon genes correctly #369

Open
krobison13 opened this issue Apr 7, 2022 · 7 comments
Open

Comments

@krobison13
Copy link

When writing GFF, the same frame is assigned to every range in a multi-exon gene rather than correctly assigning 0,1 or 2 to specify the frame

Twitter note including image of the offending loop

@hyphaltip
Copy link
Member

hyphaltip commented Apr 8, 2022

thanks for reporting a bug rather than just twitter rant!

maybe others can look at this too @cjfields -- its been 15+ years with that code - I think this is more about assumptions about features vs locations here that are obscuring the problem you describe.

If one is reading and writing multi-exons as individual features which is the typical way the frame is encoded this all works as planned - but if a single feature is encoded as a split-location - frame isn't encoded in a multi-location genbank file location necessarily.

probably If you wanted it to be computed from the data that might be helpful but it also make assumptions about the Generic feature being a CDS. This goes back to pre-GFF3 when the assumptions about how parent/child relationships were encoded and there were multiple interpretations of how to do this from gff1->gff2->gff2.5 /gtf etc.

I think much better validators and correctors for GFF (perhaps http://genometools.org/ ) have implemented a more dedicated logic.

maybe you can show input data that you used - are you are converting genbank to GFF and expecting frame to be computed and the assumption that it is a CDS with a frame to be carried through?

@krobison13
Copy link
Author

The zip file has a simple Genbank-formatted entry and a simple program that exposes the problem -- the correct sequence of frames is 0,0,2,1,0

gff-bug-reveal.zip

@cjfields
Copy link
Member

cjfields commented Apr 8, 2022

Yeah I agree w/ @hyphaltip , I suspect there's bit rot from prior logical assumptions that have changed over time. I also vaguely recall Bio::Tools::GFF was to be deprecated in preference to Bio::DB::SeqFeature, though I'm not sure that is still the case.

Would it be worth looking into Bio::Tools::GFF or should we check Bio::DB::SeqFeature? If @scottcain around, maybe he would know? I think there was a GenBank-to-GFF conversion script for Bio::DB::SeqFeature (maybe within the GBrowse2 code?), we could check to see if if gives the correct frames.

@scottcain
Copy link
Member

Ugh. Bio::Tools::GFF was old and janky a long time ago and should probably be marked as such, since I don't think it is likely to have improved with age. It is hard to remember the logic that went into that bit of code (I don't recall if I wrote it--I hope not--but I certainly might have!). I think @cjfields is right about there having been a GB to GFF3 script, but I don't recall where it lived.

There is a script with GBrowse, https://github.com/GMOD/GBrowse/blob/master/bin/load_genbank.pl, but it loads into a Bio::DB::GFF database (so, GFF2 and mysql or postgres). I don't have the time to do the code archeology to determine if it handles strand better.

@hyphaltip
Copy link
Member

hyphaltip commented Apr 12, 2022 via email

@carandraug
Copy link
Member

I think @cjfields is right about there having been a GB to GFF3 script, but I don't recall where it lived.

There is bin/bp_genbank2gff3 which is in this repo and part of the BioPerl distribution.

There was also a bin/bp_genbank2gff which was moved to the Bio-DB-GFF distributio.

@cjfields
Copy link
Member

Apologies to @krobison13 about the wait, but all of us 'old-timers' are pretty time constrained these days.

Coming back around to this, I think we should deprecate Bio::Tools::GFF particularly if there are better options, but we should definitely point in the right direction regardless what we decide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants