Token (parser)

From Wikipedia, the free encyclopedia

In computing, a token is a categorized block of text, usually consisting of indivisible characters known as lexemes. A lexical analyser initially reads in lexemes and categorizes them according to function, giving them meaning. This assignment of meaning is known as tokenization. A token can look like anything: English, gibberish symbols, anything; it just needs to be a useful part of the structured text.

Consider the following table:

lexeme token type
sum IDENT
= ASSIGN_OP
3 NUMBER
+ ADD_OP
2 NUMBER
; SEMICOLON

Tokens are frequently defined by regular expressions, which are understood by a lexical analyzer such as lex. The lexical analyser reads in a stream of lexemes and categorises them into tokens. This is called "tokenizing." If the lexer finds an invalid token, it will report an error.

Following tokenizing is parsing. From there, the interpreted data may be loaded into data structures, for general use, interpretation, or compiling.

Consider a text describing a calculation: "46 - number_of(cows); ". The lexemes here might be: "46", "-", "number_of", "(", "cows", and ")" ';'. The lexical analyser will denote lexemes 4 and 6 as 'number' and - as character, and 'number_of ' as a separate token. Even the lexe ';' in some languages (such as C) has some special meaning.

The whitespace lexemes are sometimes ignored later by the syntax analyser. A token doesn't need to be valid, in order to be recognized as a token. "cows" may be nonsense to the language, "number_of" may be nonsense. But they are tokens none the less, in this example.

Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

Take, for example, the following string. Unlike humans, a computer cannot intuitively 'see' that there are 9 words. To a computer this is only a series of 43 characters.

The quick brown fox jumps over the lazy dog

A process of tokenization could be used to split the sentence into word tokens. Although the following example is given as XML there are many ways to store tokenized input:

<sentence>
  <word>The</word>
  <word>quick</word>
  <word>brown</word>
  <word>fox</word>
  <word>jumps</word>
  <word>over</word>
  <word>the</word>
  <word>lazy</word>
  <word>dog</word>
</sentence>

Contents

[edit] Alternate meanings

[edit] Compression

The term is also used when, during the parsing of source code of some programming languages, the symbols are converted into a shorter representation which uses less memory. Commands such as print may be mapped to a number representation. Most BASIC interpreters used this to save room, a command such as print would be replaced by a single number which uses much less room in memory. In fact most lossless compression systems use a form of tokenization, although it's typically not referred to as such.

[edit] Human perception

In human cognition tokenization is often used to refer to the process of converting a sensory stimulus into a cognitive "token" suitable for internal processing. A stimulus that is not correctly tokenized may not be processed or may be incorrectly merged with other stimuli.

[edit] See also

[edit] Reference

  • Sebesta, R. W. (2006). Concepts of programming languages (7th ed.) pp.177. Boston: Pearson/Addison-Wesley.

[edit] External links