General feature format
The general feature format (gene-finding format, generic feature format, GFF) is a file format used for describing genes and other features of DNA, RNA and protein sequences. The filename extension associated with such files is .GFF
.
There are two versions of the GFF file format in general use:
- General Feature Format Version 2 (Sanger Institute)
- Generic Feature Format Version 3 (Sequence Ontology Project)
Servers that generate this format:
Server | Example file |
---|---|
UniProt | |
Clients that use this format:
Name | Description | Links |
---|---|---|
GBrowse | GMOD genome viewer | GBrowse |
IGB | Integrated Genome Browser | Integrated Genome Browser |
Jalview | A multiple sequence alignment editor & viewer | Jalview |
STRAP | Underlining sequence features in multiple alignments. Example output: | |
JBrowse | JBrowse is a fast, embeddable genome browser built completely with JavaScript and HTML5 | JBrowse.org |
ZENBU | A collaborative, omics data integration and interactive visualization system |
GFF Versions
GFF Version 2 has a number of deficiencies, notably that it can only represent two-level feature hierarchies and thus cannot handle the three-level hierarchy of gene → transcript → exon. GFF3 addresses this and other deficiencies. For example, it supports arbitrarily many hierarchical levels, and gives specific meanings to certain tags in the attributes field.
The Gene transfer format (GTF) is a refinement of GFF Version 2 and is sometimes referred to as GFF2.5.[1]
GFF general structure
All GFF formats (GFF2, GFF3 and GTF) are tabular files with 9 fields per line, separated by tabs. They all share the same structure for the first 7 fields, while differing in the definition of the eighth field and in the content and format of the ninth field. The general structure is as follows:
Position index | Position name | Description |
---|---|---|
1 | sequence | The name of the sequence where the feature is located. |
2 | source | Keyword identifying the source of the feature, like a program (e.g. Augustus or RepeatMasker) or an organization (like TAIR). |
3 | feature | The feature type name, like "gene" or "exon". In a well structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent "transcript" feature line and before any other parent transcript line). In GFF3, all features and their relationships should be compatible with the standards released by the Sequence Ontology Project. |
4 | start | Genomic start of the feature, with a 1-base offset. This is in contrast with other 0-offset half-open sequence formats, like BED files. |
5 | end | Genomic end of the feature, with a 1-base offset. This is the same end coordinate as it is in 0-offset half-open sequence formats, like BED files. |
6 | score | Numeric value that generally indicates the confidence of the source on the annotated feature. A value of "." (a dot) is used to define a null value. |
7 | strand | Single character that indicates the Sense (molecular biology) strand of the feature; it can assume the values of "+" (positive, or 5'->3'), "-", (negative, or 3'->5'), "." (undetermined). |
8 | frame (GTF, GFF2) or phase (GFF3) | Frame or phase of CDS features; it can be either one of 0, 1, 2 (for CDS features) or "." (for everything else). Frame and Phase are not the same, See following subsection. |
9 | Attributes. | All the other information pertaining to this feature. The format, structure and content of this field is the one which varies the most between the three competing file formats. |
The 8th field: frame or phase of CDS features
In GFF2 and GTF, the 8th field indicates the frame of the feature, that is, whether the first base of the CDS segment is the first (frame 0), second (frame 1) or third (frame 2) in the codon of the ORF. The formula to derive this attribute is therefore (sum of previous features) mod 3.
Simply put, CDS means "CoDing Sequence". The exact meaning of the term is defined by Sequence Ontology (SO). In GFF3, the 8th field indicates instead the phase of the CDS feature, i.e. according to SO:
where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon.
. [N.B.: can't find a reference to this in SO][Found this reference, but don't know how to add it: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md ]
It is therefore the reverse of the frame: (3 - (sum of previous features) mod 3) mod 3 = (3 - phase) mod 3.
Validation
The modENCODE project hosts an online GFF3 validation tool with generous limits of 286.10 MB and 15 million lines.
The Genome Tools software collection contains a gff3validator tool that can be used offline to validate and possibly tidy GFF3 files. An online validation service is also available.
See also
- Distributed Annotation System
- Variant Call Format
- Sequence alignment