Lexerless parsing

From Wikipedia, the free encyclopedia

Lexerless parsing (also called "scannerless parsing") refers to the use of a single formalism to express both the lexical and context-free syntax used to parse a language.

1 Advantages
2 Required Extensions
3 Implementations
4 Related Material

[edit] Advantages

Only a single metasyntax is required
Non-regular lexical structure is handled easily
"Token classification" is not required (eliminating the need for "the lexer hack")
A clear lexer-parser distinction is not required; languages without one are handled easily (TeX, most wiki grammars, Makefiles)
Grammars are compositional (can be merged without human intervention) ^[1]
No "keywordification" -- keywords (such as "while" in C) can be used as identifiers anywhere that the keyword form would not be allowed

[edit] Required Extensions

Unfortunately, when parsed at the character level, most popular programming languages are no longer strictly context-free. Visser identified five key extensions to classical context-free syntax which handle almost all common non-context-free constructs arising in practice:

Follow restrictions, a limited form of "longest match"
Reject productions, a limited form of negative matching (as found in boolean grammars)
Preference attributes to handle the "dangling else" construct in C-like languages
Per-production transitions rather than per-nonterminal transitions in order to facilitate:
- Associativity attributes, which prevent a self-reference in a particular production of a nonterminal from producing that same production
- Precedence/priority rules, which prevent self-references in higher-precendence productions from producing lower-precedence productions