Talk:LR parser
From Wikipedia, the free encyclopedia
It's a bit long now. Perhaps there should be a separate article about LR(0) parsers that deals with the LR(0) algorithm. -- Jan Hidders 10:01 Aug 18, 2002 (PDT)
This article is one of the better explanations of LR parsing I have seen. My only complaint is the dual use of the "*" character as (1) the current position in a in the LR(0) item, and (2) as the multiplication operator. That leads to items like "E * * B" (an actual example from the text) which is confusing. Perhaps we could just use a period? --P3d0 22:35, 23 Aug 2003 (UTC)
(Scratch that; apparently Konqueror renders · as an asterisk. --P3d0 23:16, 23 Aug 2003 (UTC))
Looks like the "note on SLR and LALR" overlaps a bit with the section on conflicts. Perhaps they could be combined. --P3d0 01:38, 26 Oct 2003 (UTC)
It's a really minor thing, but could we maybe change the grammars which use nonterminals "E" and "B" to use some other letters? They often look awfully similar on my screen. Marnanel 05:53, 4 Feb 2004 (UTC)
[edit] Programming Languages not LR
I'm looking for a reference for the claim that Perl and C cannot be parsed by an LR parser. Does somebody know one? -- Jan Hidders 14:12, 21 Jul 2004 (UTC)
- The one thing that I know of is the dangling else problem. Because of this problem, not only can C not be parsed by a (pure) LR parser, it is not even an unambiguous grammar. I'm sure a Google search for dangling else will produce a large number of references, and if you're lucky, some of them may even have references with regards to other reasons C cannot be parsed by an LR parser (if there are indeed other reasons; I would imagine there are, but I don't know). CyborgTosser (Only half the battle) 07:51, 17 Apr 2005 (UTC)
-
- It's possible to create unambiguous grammar for languages with if-else construction, but requires introducing a new nonterminal symbol: all_instructions_but_if_with_else. However such grammar becomes unnatural and less readible and that's why most parsers use another ambiguousness handling techniques, like associativity declarations.
- Typedefs are another issue with C. They make the language context-dependent. It's discussed a little bit here [1], but I think there's a better reference somewhere. --Werdnagreb 19:06, 5 May 2005 (UTC)
- The article now refers to Perl and C++. C++ is probably a better example than C, as I know C parsers get around both of the points noted above with fairly inocuous kludges, whereas C++ is trickier. It might be worth mentioning that early versions of Fortran were even bigger offenders, although the problems go beyond difficulty of parsing. CyborgTosser (Only half the battle) 01:07, 8 May 2005 (UTC)
[edit] Minor improvements to the article
This is, indeed, a great explanation of the LR(0) algorithm. Nevertheless, I'd like to suggest a few changes that may make it better:
a) Change the term "begin state" to "start state", as it seems to be the most correct form to reference the, err, start state of a grammar.
b) the section "Conflicts in the constructed tables" needs some editing, as it seems. For example, it begins with the phrase "The automaton that is constructed is by the way it is constructed always a deterministic automaton", which, although correct, sounds confusing (at least to me). A better wording might be: "The steps presented so far always construct a deterministic automaton." or "The automaton contructed by looking for transitions between item sets will always be a deterministic one."
c) Besides (still in the section referred above), we find several references to the grammar not being LL(0). Because we're discussing LR(0), it looks like a typo. It would be better if an explanation of how the non-LL(0)-ness of the grammar afftects the LR(0) parser was given.
d) Finally -- and maybe this is just in my system, I guess that using • instead of · to represent the dot would make it more visble.
Great article. --Branco 13:39, 23 October 2005 (UTC)
- Go for it. --P3d0 22:57, 13 July 2007 (UTC)
[edit] Discussion of LR(1) Construction Needed
Although LR(0) item set construction is important bacause it is the basis for the SLR(1) parser construction algorithm as well as most LALR(1) construction algorithms, it should be pointed out that very few parsers are based on LR(0) itself. Indeed, LALR(1) parsers such as those produced by Yacc and GNU bison seem to be the most commonly used LR(k)-based parsers. Discussing LR(1) item set constuction would not only describe the LR(k) methodology when lookahead is a component of an item (as it always is except for the trivial case of k = 0), it would provide the basis for understanding the LALR(1) construction method and why an LR(1) grammar may not be LALR(1) or why an LALR(1) grammar may not be SLR(1).
It might also be pointed out that although a grammar might be LR(k) for some k > 1, there will always be LR(1), LALR(1), and SLR(1) grammars for the same language. --Raaronson 16:58, 27 December 2005 (UTC)
[edit] Discussion of construction of grammars for LR parsers?
It might be worthwhile to have a section about what sort of grammars work best with LR parsers: yes, they can parse all LR grammars which is a strictly-defined set, but some work better than others. E.g.:
list ::= item ',' list list ::= item '.'
is an LR(1) grammar that would need to shift the entire input onto the stack before reducing it, whereas:
list ::= list-items '.' list-items ::= item list-items ::= list-items ',' item
is a grammar for the same language that allows the parser to reduce after each item. I just added a note to left recursion that mentions this, although not in as much detail. The same reference [2] is probably useful here. JulesH 08:06, 4 August 2006 (UTC)
[edit] removed sentence
I removed from section 1.2:
In practice the parser will usually try to recover from a syntax error by ignoring unexpected symbols and/or inserting missing symbols, and continue parsing, but this is outside the scope of this article.
This paragraph is like taken from a tutorial. Well, I imagine it's only because of the "outside of scope" thing. Whatever. --euyyn 22:14, 4 September 2006 (UTC)
- It's also not outside the scope of this article. If someone wants to describe common techniques for error recovery, that's perfectly on-topic. --P3d0 22:55, 13 July 2007 (UTC)
[edit] Web resource for LR(k) parser construction
These are what I found when searching on the Internet for my project on implementing a LR(1) yacc:
- Practical LR(k) Parser Construction [3]. By David R. Tribble, 12/12/2004. This is a long and complete introduction, although many places are labeled as "+INCOMPLETE".
- The Honalee LR(k) Algorithm [4]. By David R. Tribble, 4/27/2006.
BTW, in "E → E + B", maybe B can be changed to T so it looks different from E. "Item" is called "Configuration" in some early papers in the 1970s.
-- oxc 10:04, 7 December 2006 (UTC)
[edit] Closure of item sets
I am not convinced that the example is entirely consistent with the given definition of closure.
In particular, in construction of the item sets the rule "S → E •" is added to Item set 3, and two others are added to complete the closure. However this rule has empty follow set, so closure as defined (as I understand) will not add a rule such as "E → E • * B" to the item set.
It is entirely possible that I have misunderstood some part of this -- if so perhaps this is an indicator that the explanation needs clearing up a little (or that I should pay attention in lectures :p ), but a comment either way by someone more knowledgable than I would be appreciated. JoeKearney 19:52, 6 April 2007 (UTC)
- With a bit more thought, I believe that what were trying to say is this:
-
If we're at the state of "S → E •" then we want (i.e. the closure includes) all rules to parse anything in the follow-set of E.
- In particular it includes rules 1 and 2: "E → E • * B" and "E → E • + B".
- I believe the definition of the closure doesn't fully account for this, but I'm not certain. If I have a spark of inspiration, or (more likely) if someone would care to point one way or the other I'll edit to make this more clear. JoeKearney 20:19, 6 April 2007 (UTC)
[edit] Parser is NOT a finite state machine
The article states that "The parser is a finite state machine". This statement is false, as the parser has a stack, which means it has a potentially infinite number of states. Perhaps the parser is a Pushdown Automaton, but i am not sure enough to change it. Someone who does know for sure, please change this. 62.131.189.89 07:55, 22 May 2007 (UTC)
- I don't think your assertion is correct. A state within a state machine is a node in a directed graph that represents the progress of the finite state machine up to that point. Unless the machine has an infinite number of nodes, it is indeed a finite state machine. — Loadmaster 20:07, 11 July 2007 (UTC)
-
- No, he's right. An LR parser requires a pushdown automaton (PDA), which is a FSM plus a stack. A PDA can't be modelled by an FSM. If would need an "infinite state machine". --P3d0 22:54, 13 July 2007 (UTC)
-
-
- My take on this has always been a little different. It is true that a straightforward execution of an FSM cannot be used to recognize sentences of an LR language (except for the special case of a regular language). But the parser does use an FSM to recognize viable prefixes of right sentential forms. When a final state of this FSM has been reached, the last N symbols scanned are the handle of the right sentential form and are replaced by a new (non-terminal) symbol creating a new right sentential form (the next step in the derivation in reverse). This replacement transformation is determined by tables that have been generated for the grammar. Conceptually, the FSM is then re-executed and the newly transformed sentential form re-scanned from the beginning looking for the next handle. This scanning and transforming is repeated until the transformed input consists only of the goal symbol. If you take a closer look at what happens with two successive executions of the FSM, you realize that the restarted FSM will make the same state transitions as the prior execution until it encounters the non-terminal symbol that was just replaced for the handle detected during the prior execution of the FSM. As an efficiency, then, the entire re-scan can be avoided by remembering (by using a stack) what those state transitions had been. So an LR language can be recognized by repeated executions of an FSM with input transformation performed between the executions (which, admittedly, is an altogether different thing than a simple application of an FSM). The use of a stack can be viewed as an optimization. -- Raaronson (talk) 13:01, 13 March 2008 (UTC)
-
[edit] Recursive ascent parsing
Prefaced the description of the stack machine with a remark about the alternative, functional, recursive ascent, view on LR parsing, and added a corresponding reference 81.5.0.80 18:45, 11 July 2007 (UTC).