ProbCons

From Wikipedia, the free encyclopedia

ProbCons is an open source probabilistic consistency-based multiple alignment of amino acid sequences. It is an efficient protein multiple sequence alignment program, which has demonstrated a statistically significant improvement in accuracy compared to several leading alignment tools.[1]

Algorithm

The following describes the basic outline of the ProbCons algorithm. [2]

Step 1: Reliability of an alignment edge

For every pair of sequences compute the probability that letters x_{i} and y_{i} are paired in a^{*} an alignment that is generated by the model.

{\begin{aligned}P(x_{i}\sim y_{i}|x,y)&{\stackrel  {def}{=}}Pr[x_{i}\sim y_{i}{\text{ in some a }}|x,y]\\&=\sum _{{{\text{alignment a with }}x_{i}-y_{i}}}Pr[a|x,y]\\&=\sum _{{{\text{alignment a}}}}{\mathbf  {1}}\{x_{i}-y_{i}\in a\}Pr[a|x,y]\end{aligned}}

(Where {\mathbf  {1}}\{x_{i}\sim y_{i}\in a\} is equal to 1 if x_{i} and y_{i} are in the alignment and 0 otherwise.)

Step 2: Maximum expected accuracy

The accuracy of an alignment a^{*} with respect to another alignment a is defined as the number of common aligned pairs divided by the length of the shorter sequence.

Calculate expected accuracy of each sequence:

{\begin{aligned}E_{{Pr[a|x,y]}}(acc(a^{*},a))&=\sum _{{a}}Pr[a|x,y]acc(a^{*},a)\\&={\frac  {1}{min(|x|,|y|)}}\cdot \sum _{{a}}{\mathbf  {1}}\{x_{i}\sim y_{i}\in a\}Pr[a|x,y]\\&={\frac  {1}{min(|x|,|y|)}}\cdot \sum _{{x_{i}-y_{i}}}P(x_{i}\sim y_{j}|x,y)\end{aligned}}

This yields a maximum expected accuracy (MEA) alignment:

E(x,y)=\arg \max _{{a^{*}}}\;E_{{Pr[a|x,y]}}(acc(a^{*},a))

Step 3: Probabilistic Consistency Transformation

All pairs of sequences x,y from the set of all sequences {\mathcal  {S}} are now re-estimated using all intermediate sequences z:

P'(x_{i}-y_{i}|x,y)={\frac  {1}{|{\mathcal  {S}}|}}\sum _{{z}}\sum _{{1\leq k\leq |z|}}P(x_{i}\sim z_{i}|x,z)\cdot P(z_{i}\sim y_{i}|z,y)

This step can be iterated.

Step 4: Computation of guide tree

Construct a guide tree by hierarchical clustering using MEA score as sequence similarity score. Cluster similarity is defined using weighted average over pairwise sequence similarity.

Step 5: Compute MSA

Finally compute the MSA using progressive alignment or iterative alignment.

See also

References

  1. Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S (2005). "PROBCONS: Probabilistic Consistency-based Multiple Sequence Alignment". Genome Research 15 (2): 330–340. doi:10.1101/gr.2821705. PMC 546535. PMID 15687296. 
  2. Lecture "Bioinformatics II" at University of Freiburg

External links

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.