Tokenization

From Wikipedia, the free encyclopedia

It has been suggested that this article or section be merged with Token (parser). (Discuss)

In computer science, tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

Take, for example, the following string. Unlike humans, a computer cannot intuitively 'see' that there are 9 words. To a computer this is only a series of 43 characters.

The quick brown fox jumps over the lazy dog

A process of tokenization could be used to split the sentence into word tokens. Although the following example is given as XML there are many ways to store tokenized input:

<sentence>
  <word>The</word>
  <word>quick</word>
  <word>brown</word>
  <word>fox</word>
  <word>jumps</word>
  <word>over</word>
  <word>the</word>
  <word>lazy</word>
  <word>dog</word>
</sentence>

1 Alternate meanings
- 1.1 Compression
- 1.2 Human perception
2 See also

[edit] Alternate meanings

[edit] Compression

The term is also used when, during the parsing of source code of some programming languages, the symbols are converted into a shorter representation which uses less memory. Commands such as print may be mapped to a number representation. Most BASIC interpreters used this to save room, a command such as print would be replaced by a single number which uses much less room in memory. In fact most lossless compression systems use a form of tokenization, although it's typically not referred to as such.

[edit] Human perception

In human cognition tokenization is often used to refer to the process of converting a sensory stimulus into a cognitive "token" suitable for internal processing. A stimulus that is not correctly tokenized may not be processed or may be incorrectly merged with other stimuli.