Scannerless parsing
In computer science, scannerless parsing (also called lexerless parsing) refers to the use of a single formalism to express both the lexical (word level syntax) and phrase level syntax used to parse a language: the lexical step and parsing step are combined.
This parsing strategy is suitable when a clear lexer–parser distinction is unneeded or unwanted. Examples of when this is appropriate include TeX, most wiki grammars, makefiles, simple per application control languages, and Perl 6.
Advantages
- Only one metasyntax is needed
- Non-regular lexical structure is handled easily
- "Token classification" is unneeded which removes the need for design accommodations such as "the lexer hack" and language keywords (such as "while" in C)
- Grammars can be compositional (can be merged without human intervention)
Disadvantages
- Since the lexical scanning and syntactic parsing processing is combined, the resulting parser tends to be harder to understand and debug for more complex languages
- Most parsers of character-level grammars are nondeterministic
- There is no guarantee that the language being parsed is unambiguous
Required extensions
Unfortunately, when parsed at the character level, most popular programming languages are no longer strictly context-free. Visser identified five key extensions to classical context-free syntax which handle almost all common non-context-free constructs arising in practice:
- Follow restrictions, a limited form of "longest match"
- Reject productions, a limited form of negative matching (as found in boolean grammars)
- Preference attributes to handle the dangling else construct in C-like languages
- Per-production transitions rather than per-nonterminal transitions in order to facilitate:
- Associativity attributes, which prevent a self-reference in a particular production of a nonterminal from producing that same production
- Precedence/priority rules, which prevent self-references in higher-precedence productions from producing lower-precedence productions
Implementations
- SGLR is a parser for the modular Syntax Definition Formalism SDF, and is part of the ASF+SDF Meta-Environment and the Stratego/XT program transformation system.
- JSGLR, a pure Java implementation of SGLR, also based on SDF.
- TXL supports character-level parsing.
- dparser generates ANSI C code for scannerless GLR parsers.
- Spirit allows for both scannerless and scanner-based parsing.
- SBP is a scannerless parser for boolean grammars (a superset of context-free grammars), written in Java.
- Laja is a two phase scannerless parser generator with support for mapping the grammar rules into objects, written in Java.
- Wormhole has an option to generate scannerless GLR in various languages.
- The Perl 6 Grammars feature of the general purpose programming language Perl 6.
- PyParsing is a scannerless parser written in pure Python.
Notes
- ^ This is because parsing at the character level makes the language recognized by the parser a single context-free language defined on characters, as opposed to a context-free language of sequences of strings in regular languages. Some lexerless parsers handle the entire class of context-free languages, which is closed under composition.
Further reading
- Visser, E. (August 1997). Scannerless Generalized-LR Parsing (in English). The Netherlands: University of Amsterdam. Retrieved 22 November 2012.