In computer science, bottom-up parsing is a strategy for analyzing unknown information that attempts to identify the most fundamental units first, and then to infer higher-order structures from them. The name comes from the concept of a parse tree, in which the most fundamental units are at the bottom, and structures composed of them are in successively higher layers, until at the top of the tree a single unit, or start symbol, comprises all of the information being analyzed.
Bottom-up parsing occurs in the analysis of both natural languages and computer languages. It is contrasted with top-down parsing, in which complex units are divided into simpler parts until all of the information is parsed.
Contents |
In linguistics, an example of bottom-up parsing would be analyzing a sentence by identifying words first, and then using properties of the words to infer grammatical relations and phrase structures to build a parse tree of the complete sentence.
The English sentence "John hit the ball" may be analysed by bottom-up parsing. In this case, the most fundamental units (ignoring conjugation and declension) are the words: "John", "hit", "the" and "ball".
A simplified grammar for English, sufficient for this sentence, is:
Proceeding bottom-up from these words, we find:
Rule used |
Words parsed | Sentence now |
---|---|---|
5 | In this case, "John" and "ball" are nouns, "the" (the definite article) is a determiner, and "hit" is a verb. | N V Det N |
3 | "the ball" is a noun phrase. | N V NP |
4 | "hit the ball" is a verb phrase. | N VP |
2 | "John" is a noun phrase by itself. | NP VP |
1 | "John hit the ball" is a complete sentence. | S |
Other parsings are possible; for example, "ball" could have been parsed as a noun phrase on its own. However, no other parsing can lead to a complete sentence using these simplified rules, as once "ball" is chosen to be a noun phrase, there is no rule that lets us parse "the". A brute-force search can be used to find all possible parsings of the sentence; if more than one parsing is possible, the sentence is ambiguous -- not uncommon in natural languages.
In programming language compilers, bottom-up parsing is a parsing method that works by identifying terminal symbols first, and combines them successively to produce nonterminals. The productions of the parser can be used to build a parse tree of a program written in human-readable source code that can be compiled to assembly language or pseudocode.
Different computer languages require different parsing techniques, although it is not uncommon to use a parsing technique that is more powerful than that actually required.
It is common for bottom-up parsers to take the form of general parsing engines, which can either parse, or generate a parser for, a specific programming language given a specification of its grammar. Perhaps the most well known generalized parser generators are YACC and GNU bison.
This trivial grammar defines simple addition expressions:
(The plus sign in rule 5 signifies repetition: "one or more digits". To distinguish it from this special use, the plus sign in rule 1 is placed inside quotation marks.)
The fundamental units of this grammar are single letters, the plus sign, and groups of one or more digits. So, in the input "a + x + 12", the fundamental units are "a", "+", "x", "+", and "12". A parsing strategy might look like this:
Rule used |
Symbols parsed | Expression now |
---|---|---|
5 | "12" is a number. | a + x + N |
2, 3 | "a", "x" and "12" are addends. | A + A + A |
1 | "a + x" is a complete sum. | S + A |
4 | The sum "a + x" is also an addend. | A + A |
1 | "a + x + 12" is a complete sum. | S |
Note that even though a complete expression was found in the third step, there was still some input remaining, so the parsing had to continue.
As with the linguistics example above, other parsings are possible; specifically, at the third step, "x + 12" could also have been legitimately parsed as a sum. This would have also resulted in a successfully completed parsing; thus, the grammar given is ambiguous. Unlike natural languages, computer languages cannot be ambiguous, so bottom-up parsing for computer languages must have additional rules, such as "the leftmost possible replacement is the one chosen". Additionally, a computer language parser will rarely examine the entire input at once, since the input may be a very large computer program; almost all parsers read the input from start to end (usually called "left to right"). However, the ability to check one or more upcoming symbols ("lookahead") before applying a rule increases the expressive power of a grammar.
The common classes of bottom-up parsers are:
Bottom-up parsing can also be done by backtracking.
The most common bottom-up parsers are the shift-reduce parsers. These parsers examine the input tokens and either shift (push) them onto a stack or reduce elements at the top of the stack, replacing a right-hand side by a left-hand side.
Often an action (or parse) table is constructed which helps the parser determine what to do next. The following is a description of what can be held in an action table.
Actions
A shift-reduce parser uses a stack to hold the grammar symbols while awaiting reduction. During the operation of the parser, symbols from the input are shifted onto the stack. If a prefix of the symbols on top of the stack matches the right-hand side (RHS) of a grammar rule which is the correct rule to use within the current context, then the parser reduces the RHS of the rule to its left-hand side (LHS), replacing the RHS symbols on top of the stack with the nonterminal occurring on the LHS of the rule. This shift-reduce process continues until the parser terminates, reporting either success or failure. It terminates with success when the input is legal and is accepted by the parser. It terminates with failure if an error is detected in the input.
The parser is a stack automaton which is in one of several discrete states. In reality, in the case of LR parsing, the parse stack contains states, rather than grammar symbols. However, since each state corresponds to a unique grammar symbol, the state stack can be mapped onto the grammar symbol stack mentioned earlier.
In step 2.1 above we are "shifting" the input symbols to one side as we move through them; hence a parser which operates by repeatedly applying steps 2.1 and 2.2 above is known as a shift-reduce parser.
A shift-reduce parser is most commonly implemented using a stack, where we proceed as follows:
Take the grammar:
Sentence --> NounPhrase VerbPhrase NounPhrase --> Art Noun VerbPhrase --> Verb | Adverb Verb Art --> the | a | ... Verb --> jumps | sings | ... Noun --> dog | cat | ...
And the input:
Then the bottom up parsing is:
Stack Input Sequence () (the dog jumps) (the) (dog jumps) SHIFT word onto stack (Art) (dog jumps) REDUCE using grammar rule (Art dog) (jumps) SHIFT.. (Art Noun) (jumps) REDUCE.. (NounPhrase) (jumps) REDUCE (NounPhrase jumps) () SHIFT (NounPhrase Verb) () REDUCE (NounPhrase VerbPhrase)() REDUCE (Sentence) () SUCCESS