Scannerless parsing

From Wikipedia, the free encyclopedia

Scannerless parsing (also called "lexerless parsing") refers to the use of a single formalism to express both the lexical and context-free syntax used to parse a language.

This parsing strategy is suitable when a clear lexer-parser distinction is not required. examples of when this is appropriate include TeX, most wiki grammars, makefiles, and simple per application control languages.

Contents

[edit] Advantages

  • Only a single metasyntax is required
  • Non-regular lexical structure is handled easily
  • "Token classification" is not required which eliminates the need for design accommodations such as "the lexer hack" and language keywords (such as "while" in C)
  • Grammars are compositional (can be merged without human intervention) [1]

[edit] Disadvantages

  • since the lexical scanning and syntactic parsing processing is combined, the resulting parser tends to be harder to understand and debug for more complex languages

[edit] Required Extensions

Unfortunately, when parsed at the character level, most popular programming languages are no longer strictly context-free. Visser identified five key extensions to classical context-free syntax which handle almost all common non-context-free constructs arising in practice:

  • Follow restrictions, a limited form of "longest match"
  • Reject productions, a limited form of negative matching (as found in boolean grammars)
  • Preference attributes to handle the dangling else construct in C-like languages
  • Per-production transitions rather than per-nonterminal transitions in order to facilitate:
    • Associativity attributes, which prevent a self-reference in a particular production of a nonterminal from producing that same production
    • Precedence/priority rules, which prevent self-references in higher-precedence productions from producing lower-precedence productions

[edit] Implementations

  • dparser generates ANSI C code for scannerless GLR parsers
  • the SGLR component of ASF+SDF generates and interprets parse tables for scannerless grammars
  • SGLR is also a component of the Stratego/XT program transformation system
  • Spirit allows for both scannerless and scanner-based parsing
  • SBP is a scannerless parser for boolean grammars (a superset of context-free grammars), written in Java.

[edit] Related Material

Visser, E. (1997b). Scannerless generalized-LR parsing. Technical Report P9707, Programming Research Group, University of Amsterdam

  1. ^  since lexerless parsers handle the entire class of context-free languages, which is closed under composition.