GFF source methods

From WormBaseWiki
Jump to navigationJump to search

GFF source and feature

GFF2 description at the Sanger Institute

In the WormBase GFF files genes are represented in several ways each specified by a different source and feature (second and third columns)

Gene spans

This is the largest extent of a genes' transcripts from the begining of the most 5' transcripts 5' UTR to the end of the most 3' transcripts 3' UTR. Each gene is represented as a single line.

  • source = gene; feature = gene.

eg nlp-36 in WormBase

CHROMOSOME_III  gene gene    9488630 9489091 .  +   .  Gene "WBGene00007185" ; Position "0.567795" ; Locus "nlp-36"

CDS

A CDS is the coding sequence of a gene from the start codon to the stop codon (so does not include UTR). A gene may have 1 or more CDS's. Each CDS is represented as a single line describing the start and end coordinates.

  • source = curated; feature = CDS.
CHROMOSOME_III  curated   CDS     9488634 9488986 .  +  .  CDS "B0464.3" ; WormPep "CE:CE00017" ;  Locus "nlp-36" ;  Status "Confirmed" ;  Gene "WBGene00007185" ;

The individual exons and introns are described with a single line per exon / intron eg

So for the exampe 3 exon gene . . .

  • source = curated; feature = exon;
CHROMOSOME_III  curated exon    9488634 9488695 .       +       .       CDS "B0464.3"
CHROMOSOME_III  curated exon    9488749 9488839 .       +       .       CDS "B0464.3"
CHROMOSOME_III  curated exon    9488891 9488986 .       +       .       CDS "B0464.3"
  • source = curated; feature = intron;
CHROMOSOME_III  curated intron  9488696 9488748 .       +       .       CDS "B0464.3" ; Confirmed_EST FM248941 ; Confirmed_EST OSTF051A3_1
CHROMOSOME_III  curated intron  9488840 9488890 .       +       .       CDS "B0464.3" ; Confirmed_EST yk1241c10.5 ; Confirmed_EST OSTF051A3_1

Exons are also represented with their coding phase

  • source = curated; feature= coding_exon; (note: the actual coordinates are the same)
CHROMOSOME_III  curated coding_exon     9488634 9488695 .       +       0       CDS "B0464.3"
CHROMOSOME_III  curated coding_exon     9488749 9488839 .       +       1       CDS "B0464.3"
CHROMOSOME_III  curated coding_exon     9488891 9488986 .       +       0       CDS "B0464.3"

Coding_transcript

Each CDS can have one or more Coding_transcripts. Where a CDS has multiple transcript they will only vary in the UTRs. Coding_transcripts are the best equivalent to a full length mRNA that we can build based on available evidence. They go from transcription start site (Eg SL1) to polyA site. The full extent of a Coding_transcript is defined as a single line and this CDS has two Coding_transcripts . .

  • source = Coding_transcript; feature = protein_coding_primary_transcript
CHROMOSOME_III  Coding_transcript    protein_coding_primary_transcript   9488630 9489087 .   +   .   Transcript "B0464.3.1"
CHROMOSOME_III  Coding_transcript    protein_coding_primary_transcript   9488632 9489091 .   +   .   Transcript "B0464.3.2"

Compare the coordinates of these to the full gene span (above), which is bigger than both of these extending from 9488630 to 9489091 - the outer extremities of the two coding_transcipts.

Each Coding_transcript is composed of the following feature types; coding_exons, introns, five_prime_UTR and three_prime_UTR

Exons - source = Coding_transcript; feature = exon

CHROMOSOME_III  Coding_transcript       exon    9488630 9488695 .       +       .       Transcript "B0464.3.1"
CHROMOSOME_III  Coding_transcript       exon    9488749 9488839 .       +       .       Transcript "B0464.3.1"
CHROMOSOME_III  Coding_transcript       exon    9488891 9489087 .       +       .       Transcript "B0464.3.1"

and as the CDS does has a "coding_exon" equivalent

Exons - source = Coding_transcript; feature = coding_exon

CHROMOSOME_III  Coding_transcript       coding_exon     9488634 9488695 .       +       0       Transcript "B0464.3.1" ; CDS "B0464.3"
CHROMOSOME_III  Coding_transcript       coding_exon     9488749 9488839 .       +       1       Transcript "B0464.3.1" ; CDS "B0464.3"
CHROMOSOME_III  Coding_transcript       coding_exon     9488891 9488986 .       +       0       Transcript "B0464.3.1" ; CDS "B0464.3"

Introns - source = Coding_transcript; feature = intron

CHROMOSOME_III  Coding_transcript       intron  9488696 9488748 .       +       .       Transcript "B0464.3.1" ; Confirmed_EST FM248941 ; Confirmed_EST OSTF051A3_1
CHROMOSOME_III  Coding_transcript       intron  9488840 9488890 .       +       .       Transcript "B0464.3.1" ; Confirmed_EST yk1241c10.5 ; Confirmed_EST OSTF051A3_1

UTRs

  • source = Coding_transcript; feature = five_prime_UTR
  • source = Coding_transcript; feature = three_prime_UTR

Each non-coding exon is represented as a single line

CHROMOSOME_III  Coding_transcript       five_prime_UTR  9488630 9488633 .       +       .       Transcript "B0464.3.1"
CHROMOSOME_III  Coding_transcript       three_prime_UTR 9488987 9489087 .       +       .       Transcript "B0464.3.1"