Stockholm format
From Wikipedia, the free encyclopedia
Stockholm format is a Multiple sequence alignment format used by Pfam and Rfam to disseminate protein and RNA sequence alignments[1][2][3]. The alignment editors Ralee and Belvu support Stockholm format as do the probabilistic database search tools, Infernal and HMMER. A simple example of an Rfam alignment in Stockholm format is shown below:
# STOCKHOLM 1.0 AF035635.1/619-641 UGAGUUCUCGAUCUCUAAAAUCG M24804.1/82-104 UGAGUUCUCUAUCUCUAAAAUCG J04373.1/6212-6234 UAAGUUCUCGAUCUUUAAAAUCG M24803.1/1-23 UAAGUUCUCGAUCUCUAAAAUCG #=GC SS_cons .AAA....<<<<aaa....>>>> //
A minimal well formed Stockholm files should contain the header which states the format and version identifier, currently '# STOCKHOLM 1.0'. Followed by the sequences and corresponding unique sequence names:
<seqname> <aligned sequence> <seqname> <aligned sequence> <seqname> <aligned sequence>
'<seqname>' stands for "sequence name", typically in the form "name/start-end" or just "name". Finally, the "//" line indicates the end of the alignment. Sequence letters may include any characters except whitespace. Gaps may be indicated by "." or "-".
Contents |
[edit] The alignment mark-up:
Mark-up lines may include any characters except whitespace. Use underscore ("_") instead of space.
#=GF <feature> <Generic per-File annotation, free text> #=GC <feature> <Generic per-Column annotation, exactly 1 char per column> #=GS <seqname> <feature> <Generic per-Sequence annotation, free text> #=GR <seqname> <feature> <Generic per-Sequence AND per-Column markup, exactly 1 char per column>
[edit] Magic or recommended features:
#=GF
(See Pfam documentation, under "Description of fields")
For embedding trees:
#=GF NH <tree in New Hampshire eXtended format> #=GF TN <Unique identifier for the next tree>
- Notes: A tree may be stored on multiple #=GF NH lines.
- If multiple trees are stored in the same file, each tree must be preceded by a #=GF TN line with a unique tree identifier. If only one tree is included, the #=GF TN line may be omitted.
#=GC
The same features as for #=GR with "_cons" appended, meaning "consensus". Example: "SS_cons".
#=GS
Rfam and Pfam uses these features:
Feature Description --------------------- ----------- AC <accession> ACcession number DE <freetext> DEscription DR <db>; <accession>; Database Reference OS <organism> OrganiSm (species) OC <clade> Organism Classification (clade, etc.) LO <look> Look (Color, etc.)
#=GR
Feature Description Markup letters ------- ----------- -------------- SS Secondary Structure For RNA [.,;<>(){}[]AaBb...], For protein [HGIEBTSCX] SA Surface Accessibility [0-9X] (0=0%-10%; ...; 9=90%-100%) TM TransMembrane [Mio] PP Posterior Probability [0-9*] (0=0.00-0.05; 1=0.05-0.15; *=0.95-1.00) LI LIgand binding [*] AS Active Site [*] IN INtron (in or after) [0-2]
- Note: Do not use multiple lines with the same #=GR label. Only one unique feature assignment can be made for each sequence.
- "X" in SA and SS means "residue with unknown structure".
- In SS the letters are taken from DSSP: H=alpha-helix, G=3/10-helix, I=p-helix, E=extended strand, B=residue in isolated b-bridge, T=turn, S=bend, C=coil/loop.)
[edit] Recommended placements:
- #=GF Above the alignment
- #=GC Below the alignment
- #=GS Above the alignment or just below the corresponding sequence
- #=GR Just below the corresponding sequence
[edit] Size limits:
- No size limits on any field.
- However, a simple parser that uses fixed field sizes should work safely on Pfam alignments with these limits:
-
- Line length: 10000.
- <seqname>: 255.
- <feature>: 255.
[edit] References
- ^ Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A (2005). "Rfam: annotating non-coding RNAs in complete genomes.". Nucleic Acids Res 33 (Database issue): D121-4. PMID 15608160.
- ^ Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR (2003). "Rfam: an RNA family database.". Nucleic Acids Res 31 (1): 439-41. PMID 12520045.
- ^ Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A (2008). "The Pfam protein families database.". Nucleic Acids Res 36 (Database issue): D281-8. PMID 18039703.