Sequence mining

From Wikipedia, the free encyclopedia

Sequence mining is concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence. It is usually presumed that the values are discrete, and thus Time series mining is closely related, but usually considered a different activity. Sequence mining is a special case of structured data mining.

There are two different kinds of sequence mining: string mining and itemset mining. String mining is widely used in biology, to examine gene and protein sequences, and is primarily concerned with sequences with a single member at each position. There exist a variety of prominent algorithms to perform alignment of a query sequence with those existing in databases. The kind of alignment could either involve matching a query with one subject e.g. BLAST or matching multiple query sets with each other e.g. ClustalW. Itemset mining is used more often in marketing and CRM applications, and is concerned with multiple-symbols at each position. Itemset mining is also a popular approach to text mining.

There are several key problems within this field. These include building efficient databases and indexes for sequence information, extracting the frequently occurring patterns, comparing sequences for similarity, and recovering missing sequence members.

Two common techniques that are applied to sequence databases for frequent itemset mining are the influential apriori algorithm and the more-recent FP-Growth technique. However, there is nothing in these techniques that restricts them to sequences, per se.

[edit] References

  • Han J. and M. Kamber: Data Mining: Concepts and Techniques Morgan Kaufmann Publishers, August 2000. ISBN 1-55860-489-8
  • Schölkopf, B., K. Tsuda and J. P. Vert: Kernel Methods in Computational Biology, MIT Press, Cambridge, MA, 2004.
  • Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, 1997. ISBN 0-521-58519-8