Tokenization

From Wikipedia, the free encyclopedia

In computer science, tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

Take, for example, the following string. Unlike humans, a computer cannot intuitively 'see' that there are 9 words. To a computer this is only a series of 43 characters.

The quick brown fox jumps over the lazy dog

A process of tokenization could be used to split the sentence into word tokens. Although the following example is given as XML there are many ways to store tokenized input:

<sentence>
  <word>The</word>
  <word>quick</word>
  <word>brown</word>
  <word>fox</word>
  <word>jumps</word>
  <word>over</word>
  <word>the</word>
  <word>lazy</word>
  <word>dog</word>
</sentence>

Contents

[edit] Alternate meanings

[edit] Compression

The term is also used when, during the parsing of source code of some programming languages, the symbols are converted into a shorter representation which uses less memory. Commands such as print may be mapped to a number representation. Most BASIC interpreters used this to save room, a command such as print would be replaced by a single number which uses much less room in memory. In fact most lossless compression systems use a form of tokenization, although it's typically not referred to as such.

[edit] Human perception

In human cognition tokenization is often used to refer to the process of converting a sensory stimulus into a cognitive "token" suitable for internal processing. A stimulus that is not correctly tokenized may not be processed or may be incorrectly merged with other stimuli.

[edit] See also

In other languages