Token (parser)

From Wikipedia, the free encyclopedia

It has been suggested that this article or section be merged with Tokenization. (Discuss)

In computing, a token is a categorized block of text, usually consisting of indivisible characters known as lexemes. A lexical analyser initially reads in lexemes and categorizes them according to function, giving them meaning. This assignment of meaning is known as tokenization. A token can look like anything: English, gibberish symbols, anything; it just needs to be a useful part of the structured text.

Consider the following table:

lexeme	token type
sum	IDENT
=	ASSIGN_OP
3	NUMBER
+	ADD_OP
2	NUMBER
;	SEMICOLON

Tokens are frequently defined by regular expressions, which are understood by a lexical analyzer such as lex. The lexical analyser reads in a stream of lexemes and categorises them into tokens. This is called "tokenizing." If the lexer finds an invalid token, it will report an error.

Following tokenizing is parsing. From there, the interpreted data may be loaded into data structures, for general use, interpretation, or compiling.

Consider a text describing a calculation: "46 - number_of(cows); ". The lexemes here might be: "46", "-", "number_of", "(", "cows", and ")" ';'. The lexical analyser will denote lexemes 4 and 6 as 'number' and - as character, and 'number_of ' as a separate token. Even the lexe ';' in some languages (such as C) has some special meaning.

The whitespace lexemes are sometimes ignored later by the syntax analyser. A token doesn't need to be valid, in order to be recognized as a token. "cows" may be nonsense to the language, "number_of" may be nonsense. But they are tokens none the less, in this example.