Stockholm format

From Wikipedia, the free encyclopedia

Stockholm format is a Multiple sequence alignment format used by Pfam and Rfam to disseminate protein and RNA sequence alignments^[1]^[2]^[3]. The alignment editors Ralee and Belvu support Stockholm format as do the probabilistic database search tools, Infernal and HMMER. A simple example of an Rfam alignment in Stockholm format is shown below:

# STOCKHOLM 1.0

AF035635.1/619-641             UGAGUUCUCGAUCUCUAAAAUCG
M24804.1/82-104                UGAGUUCUCUAUCUCUAAAAUCG
J04373.1/6212-6234             UAAGUUCUCGAUCUUUAAAAUCG
M24803.1/1-23                  UAAGUUCUCGAUCUCUAAAAUCG
#=GC SS_cons                   .AAA....<<<<aaa....>>>>
//

A minimal well formed Stockholm files should contain the header which states the format and version identifier, currently '# STOCKHOLM 1.0'. Followed by the sequences and corresponding unique sequence names:

<seqname> <aligned sequence>
<seqname> <aligned sequence>
<seqname> <aligned sequence>

'<seqname>' stands for "sequence name", typically in the form "name/start-end" or just "name". Finally, the "//" line indicates the end of the alignment. Sequence letters may include any characters except whitespace. Gaps may be indicated by "." or "-".

1 The alignment mark-up:
2 Magic or recommended features:
3 Recommended placements:
4 Size limits:
5 References
6 See also
7 External links

[edit] The alignment mark-up:

Mark-up lines may include any characters except whitespace. Use underscore ("_") instead of space.

#=GF <feature> <Generic per-File annotation, free text>
#=GC <feature> <Generic per-Column annotation, exactly 1 char per column>
#=GS <seqname> <feature> <Generic per-Sequence annotation, free text>
#=GR <seqname> <feature> <Generic per-Sequence AND per-Column markup, exactly 1 char per column>

[edit] Magic or recommended features:

#=GF

(See Pfam documentation, under "Description of fields")

For embedding trees:

#=GF NH <tree in New Hampshire eXtended format>
#=GF TN <Unique identifier for the next tree>

Notes: A tree may be stored on multiple #=GF NH lines.
If multiple trees are stored in the same file, each tree must be preceded by a #=GF TN line with a unique tree identifier. If only one tree is included, the #=GF TN line may be omitted.

#=GC

The same features as for #=GR with "_cons" appended, meaning "consensus". Example: "SS_cons".

#=GS

Rfam and Pfam uses these features:

      Feature                    Description
      ---------------------      -----------
      AC <accession>             ACcession number
      DE <freetext>              DEscription
      DR <db>; <accession>;      Database Reference
      OS <organism>              OrganiSm (species)
      OC <clade>                 Organism Classification (clade, etc.)
      LO <look>                  Look (Color, etc.)

#=GR

      Feature   Description            Markup letters
      -------   -----------            --------------
      SS        Secondary Structure    For RNA [.,;<>(){}[]AaBb...], 
                                       For protein [HGIEBTSCX]
      SA        Surface Accessibility  [0-9X] 
                    (0=0%-10%; ...; 9=90%-100%)
      TM        TransMembrane          [Mio]
      PP        Posterior Probability  [0-9*] 
                    (0=0.00-0.05; 1=0.05-0.15; *=0.95-1.00)
      LI        LIgand binding         [*]
      AS        Active Site            [*]
      IN        INtron (in or after)   [0-2]

Note: Do not use multiple lines with the same #=GR label. Only one unique feature assignment can be made for each sequence.

"X" in SA and SS means "residue with unknown structure".

In SS the letters are taken from DSSP: H=alpha-helix, G=3/10-helix, I=p-helix, E=extended strand, B=residue in isolated b-bridge, T=turn, S=bend, C=coil/loop.)

[edit] Recommended placements:

#=GF Above the alignment
#=GC Below the alignment
#=GS Above the alignment or just below the corresponding sequence
#=GR Just below the corresponding sequence

[edit] Size limits:

No size limits on any field.

However, a simple parser that uses fixed field sizes should work safely on Pfam alignments with these limits:

- Line length: 10000.
- <seqname>: 255.
- <feature>: 255.

[edit] References

^ Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A (2005). "Rfam: annotating non-coding RNAs in complete genomes.". Nucleic Acids Res 33 (Database issue): D121-4. PMID 15608160.
^ Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR (2003). "Rfam: an RNA family database.". Nucleic Acids Res 31 (1): 439-41. PMID 12520045.
^ Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A (2008). "The Pfam protein families database.". Nucleic Acids Res 36 (Database issue): D281-8. PMID 18039703.