Talk:Bottom-up parsing
From Wikipedia, the free encyclopedia
Description of Yacc
I wrote this up a few weeks ago, and perhaps some of it might be incorporated into this page. I hereby give permission for its incorporation, in whole or in part, under the terms of the GNU FDL. - Dominus 16:28, 25 May 2004 (UTC)
It's easy to understand if you understand the parsing algorithm. Otherwise, it's difficult. I learned the parsing algorithm from running the C debugger on yacc parsers, but I wouldn't recommend that.
I'll try to summarize what's happening.
The parser has two main parts. It has a state machine, and it has a stack. Items on the stack have three parts:
1. a grammar symbol 2. the value of the symbol 3. the state that the parser was in when it pushed the symbol onto the stack
The parser repeats the following loop over and over:
1. look ahead at the next token on the input stream 2. Consult a table to decide what to do next.
The table is indexed by the identity of the upcoming token, and by the current state of the state machine.
There are essentially four different things that the parser can do at each step. The simplest thing to do is to halt successfully. The next simplest is to signal a parse error.
Of the interesting behaviors, the simpler is to do a "shift", which just means that the next symbol is removed from the input stream, and pushed onto the stack. The current state is pushed onto the stack at the same time.
The other thing the parser can do is 'reduce', which corresponds to replacing the right-hand side of a grammar production with the left-hand side. In this case, the symbols on the stack that correspond to the RHS of the production are popped, and a value for the LHS is computed. The parser examines the state of the new top symbol, and uses it to index another table, which tells it what new state to go into. Then it pushes the LHS symbol and its value and the new state onto the stack.
Here's the simplest example grammar I can think of:
%token NUMBER
%% expr : NUMBER '+' expr { $$ = $1 + $3 } | NUMBER { $$ = $1 } ;
I'll now explain the bison .output file:
Grammar rule 1 expr -> NUMBER '+' expr rule 2 expr -> NUMBER
This is for debugging; it says what the gammar rules are. Also, later on it'll refer to the rules by number.
Terminals, with rules where they appear
$ (-1) '+' (43) 1 error (256) NUMBER (257) 1 2
'terminals' are 'terminal symbols' which are produced by the lexer. 'error' is a special fake symbol that the parser sometimes inserts when there is an error. It's an advanced feature, so ignore it. '$' represents the end-of-input condition. Inside of bison, tokens are represented by integers. The lexer function will return one of these integers to indicate what sort of token it has found; it will also set a global variable (I think named 'yylval') to contain the value of the token, if it has one. In this example grammar, NUMBER has a value which is a number.
Nonterminals, with rules where they appear
expr (5) on left: 1 2, on right: 1
The grammar has only one nonterminal symbol, which is 'expr'.
The rest of the .output file describes the state machine and what it will do on every possible input from every possible state. Initially, the state machine is in state 0.
state 0
NUMBER shift, and go to state 1
expr go to state 4
Each state has two or three sections. Here the first section is omitted for state 0; we will see it in the explanation of the next state. The second section says what to do if the upcoming token has a certain type. In this case, if the machine is in state 0 and the upcoming token is a NUMBER, the token is shifted (pushed onto the stack) and the state machine goes into state 1. The description doesn't say so explicitly, but if the upcoming token is anything else, the parser will signal an error.
This is what we expect, since valid inputs to this parser must always begin with a NUMBER.
The third section says what to do next if the parser finds itself back here after reducing an 'expr': it should go into state 4 and continue.
state 1
expr -> NUMBER . '+' expr (rule 1) expr -> NUMBER . (rule 2)
'+' shift, and go to state 2
$default reduce using rule 2 (expr)
Here we see all three sections. The first section describes the place in the input that the parser expects to be while in this state. The '.' indicates the current read position in the input stream. The notations here say that when the parser is in state 1, it is either because it is in the middle of processing rule 1 of the grammar, and has already read in the first NUMBER, and is expecting to see a '+' sign followed by an 'expr', or else it has finished processing rule 2, and has already read in the NUMBER.
If the next input token is a '+', the parser concludes that it's actually in the 'rule 1' case rather than the 'rule 2' case. It pushes the '+' onto the stack and continues in state 2. If it sees anything else coming up (such as '$', the end-of-input token) it "reduces using rule 2", which, as you will recall, is
rule 2 expr -> NUMBER
This says that an 'expr' can be a plain NUMBER. The parser has seen the NUMBER and has it on the stack. It pops the number off and pushes an 'expr' symbol on the stack in its place. At this point it runs the associated semantic action to determine the value of the 'expr'. The semantic action is:
{ $$ = $1 }
$1 here refers to the value of the NUMBER; $$ is the resulting value. We said that the value of this expression is the same as the value of the number it comprises.
state 2
expr -> NUMBER '+' . expr (rule 1)
NUMBER shift, and go to state 1
expr go to state 3
The parser will be in state 2 after having seen the NUMBER and '+' tokens of an expression of the form 'NUMBER + expr'. If it sees another number, it pushes the number onto the stack and returns to state 1. If it sees anything else coming up, , it signals an error.
Let's pause for a moment and look at an example input. Say the input is
NUMBER 3 '+' NUMBER 4 $
The parser starts in state 0. The next token is 'NUMBER', so the parser shifts the NUMBER onto the stack and goes into state 1.
In state 1, the next token is '+', so the parser shifts the '+' onto the stack and goes into state 2.
In state 2, the next token is 'NUMBER', so the parser shifts the NUMBER onto the stack, and goes back into state 1.
At this point, the stack has
top-> NUMBER (4) (state 2) '+' (state 1) NUMBER (3) (state 0)
Now we're in state 1, and the next token is '$', which isn't explicitly listed among the shiftable tokens. So the parser takes the '$default' action, which is "reduce using rule 2".
Rule 2 has:
rule 2 expr -> NUMBER
The parser pops the stack and get the value of 4 from the popped symbol. It runs the associated semantic action, { $$ = $1 }. $1 is 4, so $$ is also 4. It pushes the new 'expr' symbol onto the stack with an associated semantic value of 4. The stack has:
top-> expr (4) (state 2) '+' (state 1) NUMBER (3) (state 0)
Finally, the parser looks at the state of the last-popped symbol, which was 2, and checks the state 2 tables to see what transition is appropriate when it has just reduced an 'expr'. State 2 says:
expr go to state 3
so the parser does. It is now in state 3.
state 3
expr -> NUMBER '+' expr . (rule 1)
$default reduce using rule 1 (expr)
The upcoming token is '$', so the parser chooses the default action, which is to reduce using rule 1. Here's rule 1:
rule 1 expr -> NUMBER '+' expr
There are three symbols on the RHS, so the parser pops three items off the stack; they are indeed a number, a '+', and an 'expr'. $1 becomes the value of the NUMBER (3). $2 becomes the value of the '+' (garbage). $3 becomes the value of the expr (4). The semantic action associated with rule 1 is executed. It is
{ $$ = $1 + $3 }
so the value of the new symbol is 7. A new symbol of type 'expr' (the LHS of the rule) and value 7 is pushed on the stack:
top-> expr (7) (state 3)
Moreover, as in the last reduction, the parser looks at the state of the last-popped symbol, which was 0, and checks the state 0 tables to see what transition is appropriate when it has just reduced an 'expr'. State 0 says:
expr go to state 4
so the parser does.
In state 4, the upcoming token is still '$', the end-of-input token, and state 4 says:
state 4
$ go to state 5
The parser is now in state 5. From there, it goes to state 6.
state 5
$ go to state 6
In state 6, the parser 'accepts', which means that it signals success:
state 6
$default accept
The value of the entire expression is the value at the top of the stack, in this case 7.
There are a couple of other points of interest. Sometimes one writes a grammar which is ambiguous. For example, consider:
expr = expr '+' expr | NUMBER ;
If the input is (NUMBER 3) + (NUMBER 4) + (NUMBER 5), then there is an ambiguity in parsing. Here's what the parser will do. First it will shift the NUMBER (3) and immediately reduce it to an 'expr'. Then it will shift the '+', and then it will shift the NUMBER (4) and immeditely reduce that to an 'expr'. The stack now has:
top -> expr (4) '+' expr (3)
The upcoming token is another '+'. The parser has two possible strategies. It could reduce the tokens on the stack immediately, or it could shift the upcoming '+' immediately and reduce later.
The 'reduce first' strategy goes through the following stacks:
(reduce by rule expr -> expr '+' expr)
top -> expr (7)
top -> '+' expr (7)
top -> NUMBER (5) '+' expr (7)
(reduce by rule expr -> NUMBER)
top -> expr (5) '+' expr (7)
(reduce by rule expr -> expr '+' expr)
top -> expr (12)
The 'shift first' strategy foes through the following stacks:
top -> '+' expr (4) '+' expr (3)
top -> NUMBER (5) '+' expr (4) '+' expr (3)
(reduce by rule expr -> NUMBER)
top -> expr (5) '+' expr (4) '+' expr (3)
(reduce by rule expr -> expr '+' expr)
top -> expr (9) '+' expr (3)
(reduce by rule expr -> expr '+' expr)
top -> expr (12)
In this case the results are the same. But if the operator had been subtraction instead of addition, the results would have been different. The "reduce first" strategy would have said that "3 - 4 - 5" was -6, and the "shift first" strategy would have said that "3 - 4 - 5" was 4. In essence, the first strategy interprets the ambiguous expressions "3 + 4 + 5" as "(3 + 4) + 5" and the second as "3 + (4 + 5)". This ambiguity is called a "shift/reduce" conflict. By defeault, Yacc resolves it in favor of the shift. Yacc has directives
%left '+'
which says that '+' should be left-associative, which juts means that this shift/reduce conflict is resolved in favor of reduction instead.
%right '+'
says that '+' should be righty-associative, so this shift/reduce conflict is resolved in favor of 'shift'. (This is the default, but the %right directive also suppresses the warning that you would otherwise get.) Finally, there's a
%nonassoc '+'
which says that the shift/reduce conflict should be resolved by generating a parse error.
I hope this clears up enough of the details that you can trace through some more complicated bison .output files and understand what it going on. Equipped with this description, yo umay also be able to step through the py parser's runs using the Perl debugger and follow what is happening with the stack and the state machine.
The 'py' approach was very successful, and it would probably be easy to do in PHP. I recommend it.
Let me know if you have any questions.