FASTA format

From Wikipedia, the free encyclopedia

In bioinformatics, FASTA format is a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.

The simplicity of FASTA format makes it easy to manipulate and parse sequences using text-processing tools and scripting languages like Python and Perl.

Contents

[edit] Format

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. A simple example of one sequence in FASTA format:

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

FASTA files can be converted to or from MultiFASTA format using free tools like FASTA to multi-FASTA converter and multi-FASTA to FASTA converter


[edit] Header line

The header line, which begins with '>', gives a name and/or a unique identifier for the sequence, and often lots of other information too. Many different sequence databases use standardized headers, which helps when automatically extracting information from the header. The header line may contain more than one header, separated by a ^A (Control-A) character (as in [1]).

In the original Pearson FASTA format, one or more comments, distinguished by a semi-colon at the beginning of the line, may occur after the header. Most databases and bioinformatics applications do not recognize these comments and follow the NCBI FASTA specification. An example of a multiple sequence FASTA file follows:

>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

[edit] Sequence representation

After the header line and comments, one or more lines may follow describing the sequence: each line of a sequence should have fewer than 80 characters. Sequences may be protein sequences or nucleic acid sequences, and they can contain gaps or alignment characters (see sequence alignment). Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap character; and in amino acid sequences, U and * are acceptable letters (see below). Numerical digits are not allowed but are used in some databases to indicate the position in the sequence.

The nucleic acid codes supported are:

Nucleic Acid Code Meaning
A Adenosine
C Cytosine
G Guanine
T Thymidine
U Uracil
R G A (puRine)
Y T C (pYrimidine)
K G T (Ketone)
M A C (aMino group)
S G C (Strong interaction)
W A T (Weak interaction)
B G T C (not A) (B comes after A)
D G A T (not C) (D comes after C)
H A C T (not G) (H comes after G)
V G C A (not T, not U) (V comes after U)
N A G C T (aNy)
X masked
- gap of indeterminate length

The amino acid codes supported are:

Amino Acid Code Meaning
A Alanine
B Aspartic acid or Asparagine
C Cysteine
D Aspartic acid
E Glutamic acid
F Phenylalanine
G Glycine
H Histidine
I Isoleucine
K Lysine
L Leucine
M Methionine
N Asparagine
O Pyrrolysine
P Proline
Q Glutamine
R Arginine
S Serine
T Threonine
U Selenocysteine
V Valine
W Tryptophan
Y Tyrosine
Z Glutamic acid or Glutamine
X any
* translation stop
- gap of indeterminate length

[edit] Sequence identifiers

The NCBI defined a standard for the unique identifier used for the sequence (SeqID) in the header line. The formatdb man page has this to say on the subject: "formatdb will automatically parse the SeqID and create indexes, but the database identifiers in the FASTA definition line must follow the conventions of the FASTA Defline Format."

However they do not give a definitive description of the FASTA defline format. An attempt to create such a format is given below.

 GenBank                           gi|gi-number|gb|accession|locus
 EMBL Data Library                 gi|gi-number|emb|accession|locus
 DDBJ, DNA Database of Japan       gi|gi-number|dbj|accession|locus
 NBRF PIR                          pir||entry
 Protein Research Foundation       prf||name
 SWISS-PROT                        sp|accession|name
 Brookhaven Protein Data Bank (1)  pdb|entry|chain
 Brookhaven Protein Data Bank (2)  entry:chain|PDBID|CHAIN|SEQUENCE
 Patents                           pat|country|number 
 GenInfo Backbone Id               bbs|number 
 General database identifier       gnl|database|identifier
 NCBI Reference Sequence           ref|accession|locus
 Local Sequence identifier         lcl|identifier

The vertical bars in the above list are not separators in the sense of the Backus-Naur form, but are part of the format.

[edit] File extension

There is no standard file extension for a text file containing FASTA formatted sequences. FASTA format files often have file extensions like .fa, .mpfa, .fna, .fsa, .fas or .fasta

[edit] HUPO-PSI Format

There are several pitfalls to the traditional FASTA format that this format is meant to solve:

  • Definition lines vary widely for no good reason. This causes problems for end users who want to use these files with protein identification tools. The creators of these tools are faced with a significant challenge of either supporting all of these variations or enabling a user to cope with them.
  • Same database processed in different search engines -> different identifiers -> difficult to map (P00761 vs. ALBU_HUMAN)
  • Same protein in different databases can have very different identifiers (P00761 vs gi|3446572|sp|p00761 vs IPI:12345678)
  • The information extracted from the fasta formats is heterogeneous: parsability issues. Should come from the DB
  • Description and availability of taxonomy (Latin names, common names, NCBI TaxID)

[edit] Header block

Includes information about the included database(s). All lines in the block start with the '#' character. One header term from the list below per line:

Terms for the header Description Value
#\DbComponent= Count increment Integer
#\Name= Name of the database CV from database provider (UniprotKnowledgeBase)
#\PrimaryIdentifierType= Identifier to be used as prefix for individual protein entries CV
#\Decoy= Is it a decoy database? ?: true/false or description
#\Version= Database version, according to the database provider According to the database provider
#\ReleaseDate= The date of the source database
#\NumberOfEntries= Number of entries Integer
#\Sequence_type= Sequence type DNA, AA, RNA, EST, etc.

Example Header Block:

#\Dbcomponent=1
#\Name=UniProt_SwissProt
#\PrimaryIdentifierType=sp_ac
#\Version=52.3
#\ReleaseDate=20070425
#\NumberOfEntries=248942
#\Sequence_type=Protein_sequence

#\Dbcomponent=2
#\Name=ENSEMBL
#\PrimaryIdentifierType=sp_ac
#\Version=12.45.3.2
#\ReleaseDate=20070425
#\NumberOfEntries=1234567
#\Sequence_type=Protein_sequence

[edit] Sequence header line

Description of the individual entry header line Example
Header starts with >, followed by primary AC, preceded with the Database prefix (useful if more than one DB are concatenated). Mandatory field. >sp_ac|P000761
Description of all non-sequence information \term=value (terms are controlled vocabulary descriptors) \ID=ALBU_HUMAN
The order of the additional fields is not important
Value can be a list. The elements of the list are represented as (value_1)(value_2) \ALTERNATE_AC=(P00786)(Q22222)
Value can be imbedded into " " if needed \DE="Human serum albumin"
' can be used as separator for all individual fields \MODRES=(1|Acetyl)
Ctrl-A as separator for multi-header entries ? (NCBInr usecase) (NCBInr usecase)
Header Field Term Definition Format
ALT_AC Alternative AC
ID SwissProt_ID
DE Protein description
ALT_DE Alternative description
NCBITAXID NCBI taxonomy identifier (9606) Integer
TAX_LATIN Taxonomy in Latin name (Homo sapiens)
TAX_COM Taxonomy in common name format (human)
MODRES Modified residue (PTM) (position|modification) (PSI_MOD)
VARIANT Residue mutation (position|original residue|final residue)

Example Protein Entry:

>sp_ac|P02769_WOSIG0 \ID=ALBU_BOVIN \DE="Serum albumin precursor (Allergen Bos d 6) (BSA)" \NCBITAXID=9913 \MODRES=(1|Acetyl) \VARIANT=(196|A|T) \LENGTH=589
RGVFRRDTHKSEIAHRFKDLGEEHFKGLVLIAFSQYLQQCPFDEHVKLVNELTEFAKTCV
ADESHAGCEKSLHTLFGDELCKVASLRETYGDMADCCEKQEPERNECFLSHKDDSPDLPK
LKPDPNTLCDEFKADEKKFWGKYLYEIARRHPYFYAPELLYYANKYNGVFQECCQAEDKG
ACLLPKIETMREKVLASSARQRLRCASIQKFGERALKAWSVARLSQKFPKAEFVEVTKLV
TDLTKVHKECCHGDLLECADDRADLAKYICDNQDTISSKLKECCDKPLLEKSHCIAEVEK
DAIPENLPPLTADFAEDKDVCKNYQEAKDAFLGSFLYEYSRRHPEYAVSVLLRLAKEYEA
TLEECCAKDDPHACYSTVFDKLKHLVDEPQNLIKQNCDQFEKLGEYGFQNALIVRYTRKV
PQVSTPTLVEVSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTK
CCTESLVNRRPCFSALTPDETYVPKAFDEKLFTFHADICTLPDTEKQIKKQTALVELLKH
KPKATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPKLVVSTQTALA

[edit] See also

[edit] External links