Scannerless parsing

In computer science, scannerless parsing (also called lexerless parsing) refers to the use of a single formalism to express both the lexical and context-free syntax used to parse a language.

This parsing strategy is suitable when a clear lexer–parser distinction is unneeded. Examples of when this is appropriate include TeX, most wiki grammars, makefiles, and simple per application control languages.

1 Advantages
2 Disadvantages
3 Required extensions
4 Implementations
5 Notes
6 Further reading

Advantages

Only one metasyntax is needed
Non-regular lexical structure is handled easily
"Token classification" is unneeded which removes the need for design accommodations such as "the lexer hack" and language keywords (such as "while" in C)
Grammars can be compositional (can be merged without human intervention) ^[1]

Disadvantages

Since the lexical scanning and syntactic parsing processing is combined, the resulting parser tends to be harder to understand and debug for more complex languages
Most parsers of character-level grammars are nondeterministic
There is no guarantee that the language being parsed is unambiguous

Required extensions

Unfortunately, when parsed at the character level, most popular programming languages are no longer strictly context-free. Visser identified five key extensions to classical context-free syntax which handle almost all common non-context-free constructs arising in practice:

Follow restrictions, a limited form of "longest match"
Reject productions, a limited form of negative matching (as found in boolean grammars)
Preference attributes to handle the dangling else construct in C-like languages
Per-production transitions rather than per-nonterminal transitions in order to facilitate:
- Associativity attributes, which prevent a self-reference in a particular production of a nonterminal from producing that same production
- Precedence/priority rules, which prevent self-references in higher-precedence productions from producing lower-precedence productions

Implementations

SGLR is a parser for the modular Syntax Definition Formalism SDF, and is part of the ASF+SDF Meta-Environment and the Stratego/XT program transformation system.
JSGLR, a pure Java implementation of SGLR, also based on SDF.
TXL supports character-level parsing.
dparser generates ANSI C code for scannerless GLR parsers.
Spirit allows for both scannerless and scanner-based parsing.
SBP is a scannerless parser for boolean grammars (a superset of context-free grammars), written in Java.
Laja is a two phase scannerless parser generator with support for mapping the grammar rules into objects, written in Java.
Wormhole has an option to generate scannerless GLR in various languages.

Notes

^ This is because parsing at the character level makes the language recognized by the parser a single context-free language defined on characters, as opposed to a context-free language of sequences of strings in regular languages. Some lexerless parsers handle the entire class of context-free languages, which is closed under composition.