Tokenization
From Wikipedia, the free encyclopedia
In computer science, tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.
Take, for example, the following string. Unlike humans, a computer cannot intuitively 'see' that there are 9 words. To a computer this is only a series of 43 characters.
The quick brown fox jumps over the lazy dog
A process of tokenization could be used to split the sentence into word tokens. Although the following example is given as XML there are many ways to store tokenized input:
<sentence> <word>The</word> <word>quick</word> <word>brown</word> <word>fox</word> <word>jumps</word> <word>over</word> <word>the</word> <word>lazy</word> <word>dog</word> </sentence>
Contents |
[edit] Alternate meanings
[edit] Compression
The term is also used when, during the parsing of source code of some programming languages, the symbols are converted into a shorter representation which uses less memory. Commands such as print
may be mapped to a number representation. Most BASIC interpreters used this to save room, a command such as print
would be replaced by a single number which uses much less room in memory. In fact most lossless compression systems use a form of tokenization, although it's typically not referred to as such.
[edit] Human perception
In human cognition tokenization is often used to refer to the process of converting a sensory stimulus into a cognitive "token" suitable for internal processing. A stimulus that is not correctly tokenized may not be processed or may be incorrectly merged with other stimuli.
[edit] See also
- token (disambiguation)