Table of Contents

This file describe the general concept of a compiler. I'm going to relearn it in the hard way.

1 Parser

Three category of parsers:

Cocke–Younger–Kasami algorithm (CYK) and the Earley parser. They can parse any grammar, but slow.
bottom up
build parse tree from bottom to top
top down
build parse tree from top to bottom

There are many different methods of each top-down and bottom-up parsers. The most efficient top-down and bottom-up methods work only for subclasses of grammars. Some important subclasses is LL and LR grammar.

LL grammar
often used by hand-implemented parser
LR grammar
accept more grammars, but usually construct by tools

Some terms:

parse tree
a graphical representation of a derivation that filters out the order in which productions are applied to replace non-terminals.

Of course, regular expression is a subset of context free grammar.

About how to divide lexer rules and parser rules: use regular expression.

  • Regular expression is most useful for describing structure of constructs, such as ID and keywords.
  • Grammar is most useful for describing nested structures such as parentheses, begin-end, if-then-else.

1.1 AST

  • In Syntax Tree, interior nodes represent programming constructs.
  • In parse tree, interior nodes represent non-terminals
  • many non-terminals of a grammar represent programming constructs, but others are helpers of one sort of another, such as term, factors.
  • In syntax tree, these helpers typically are not needed and hence dropped.

To conclude:

concrete syntax tree
the parse tree
concrete syntax
the underlying grammar of a parse tree

Some new understanding of them (quotes from dragon book):

A parse tree pictorially shows how the start symbol of a grammar derives a string in the language.
Abstract syntax trees, or simply syntax trees, differ from parse trees because superficial distinctions of form, unimportant for translation, do not appear in syntax trees.

That is to say, CST exactly reflect the grammar, and AST removes a lot of things but keep the structure. This includes:

  • remove non-value-carrying leaves
  • remove unary productions
  • compress spines caused by left or right recursive grammar rules into explicit list nodes.

The implementation of AST is different for different parser designers. For example as the Clang fold said1:

Clang’s AST is different from ASTs produced by some other compilers in that it closely resembles both the written C++ code and the C++ standard. For example, parenthesis expressions and compile time constants are available in an unreduced form in the AST. This makes Clang’s AST a good fit for refactoring tools.

var a = 2,
    b = (a + 2) * 3;

Here, the ( ) around a + 2 is not represented in the AST, because the structure of the tree, combined with operator precedence rules, absolutely implies that it must exist, and moreover a + 2 * 3 would have been a different tree structure. 2 , 3

Estools 4 is a good collection of repos that have many interesting discussion threads about AST and CST.

Also, a post from semantic design 5, mentioned:

Typical parser generators force the grammar engineer to specify not only the grammar, but also to explicitly specify how to procedurally build the AST as well.

In contrast, DMS automatically builds a syntax tree, either a concrete tree ("CST", mainly for debugging, containing all the language tokens), or what amounts to an AST (for production, in which non-value carrying terminals are eliminated, useless unary productions are removed, and lists are formed).

A very good comparison on a lecture note 6.

1.2 TODO Write/Fix the grammar

1.2.1 TODO Ambiguity

1.2.2 Left Recursion

Top down parser cannot handle left recursion.

The grammar

A ::= A alpha | beta

Can be rewritten as:

A ::= beta R
R ::= alpha | epsilon

1.2.3 Left Factoring

a technique to rewrite production rule to achieve the effect that we wait until enough input has been seen to make decision. It makes grammar more suitable for predictive or top-down parsing.

1.3 Top Down Parsing

Recusrive descent parsing
  • general form of top-down parsing
  • may require backtracking
Predictive parsing
  • a special case of recursive-descent parsing
  • do not require backtracking
  • By look ahead fixed number (usually 1) of tokens
A class of grammar, for which we can construct a predictive parser by looking k symbols ahead.

The general recursive descent parsing problem is:

void A() {
  choose an A-production
  for (i = 1 to k) {
    if (xi is nonterminal) call X();
    else if (xi = input symbol) advance_to_next_symbol();
    else error();

This is non-deterministic since it begins with choose a production. To augment backtracking to the algorithm, we need:

  • try different productions
  • at error, return to the line of choose production
  • we need a local variable to store where is the input symbol when choosing production.

Left recursive grammar can cause a recursive-descent parser (even the one with backtracking) into an infinite-loop. Because it try to expand A without consuming any input.

1.3.1 TODO LL(1)

What's the LL?

Scanning input from Left to right
producing Leftmost derivation
lookahead 1 symbol

It is rich enough to cover most programming constructs. However, left-recursive and ambiguous can not.

The parser will construct a predictive parsing table. To solve LL(1), we use non-recursive predictive parsing. Do not need recursive call (really??), because it constructs a parsing table. It is table-driven.

  • algorithm 1: construct predictive parsing table
  • algorithm 2: table driven predictive parsing

1.3.2 Recursive Decent Parser v.s. LR Parser generator

Well, In a word, this is actually important. See what the clang guys say 7

Clang is the "C Language Family Front-end", which means we intend to support the most popular members of the C family. We are convinced that the right parsing technology for this class of languages is a hand-built recursive-descent parser. Because it is plain C++ code, recursive descent makes it very easy for new developers to understand the code, it easily supports ad-hoc rules and other strange hacks required by C/C++, and makes it straight-forward to implement excellent diagnostics and error recovery.

1.4 Bottom Up Parsing

shift-reduce parsing
a general style of bottom-up parsing
LR grammar
the largest class of grammars for which shift-reduce parsers can be built

The bottom up parsing can think as reducing a string to the start symbol. At each reduction step, a substring is replaced by a non-terminal. Thus the key decisions are:

  • when to reduce
  • what production to apply

1.4.1 shift-reduce parsing

Think about a stack holding current string, and the input holding the rest input tokens.

move from input to stack
replace a substring at the top of the stack

The conflict here:

shift/reduce conflict
don't know to shift or reduce.
reduce/reduce conflict
don't know which production rule to use

Grammar that contains these conflicts are non-LR grammar.

1.4.2 LR(k) Parsing

left to right scanning
producing rightmost derivation
number of lookahead (when omitted, assume 1)

LR parsers are table driven, like the non-recursive LL parsers.

LR Grammar
a grammar for which we can construct a LR parser for it.

Over LL parsing, it is better because:

  • LR parsers can be constructed to recognize virtually all programming language constructs for which context-free grammars can be written.
  • the most general non-backtracking shift-reduce parsing, and can be implemented as efficient as others
  • can detect syntactic error as soon as it is possible to do so on a left-to-right scan of input
  • LR grammar is super set of LL grammar

The drawback: hard to construct by hand. Simple LR Parsing (SLR)
LR(0) Item
each production rule will be written in a dot format: put one dot somewhere in the rule. This will result in many items.
Set of LR(0) Items
a set of the items
Canonical LR(0) collection
a collection of sets of LR(0) Items, that is typically used (others are useless).

To construct Canonical LR(0) collection, introduce the CLOSURE and GOTO functions:

where I is a set of items, if \(A \rightarrow \alpha \cdot B \beta\) is in CLOSURE(I), and \(B \rightarrow \gamma\), \(B \rightarrow \cdot \gamma\) is in the set.
where I is a set of items, X is a grammar symbol. Produce a closure, if \(A \rightarrow \alpha \cdot X \beta\), \(A \rightarrow \alpha X \cdot \beta\) is in GOTO(T,X).

Now the algorithm to construct canonical LR(0) items

void items(G') {
  repeat until no new {
    for (each set I in C) {
      for (each grammar symbol X) {
        add GOTO(I,X) to C}}}}

Now we can define LR(0) Automata:

the canonical LR(0) collection
GOTO function

Set up for parsing: Now we have the components:

the remaining input
the stack holds the states. Note that each state corresponding to exactly one symbol (yes, but why??). So we can always convert to the symbols from states.
parsing table
contains two parts: ACTION and GOTO
state i, next terminal a. The result is
shift j
shift the terminal and go to state j
reduce \(A \rightarrow \beta\)
reduce β (on the top of stack) to A
GOTO(i, A)=j
map state i and non-terminal A to state j

Parsing algorithm:

action = shift s
do it
action = reduce \(A \rightarrow \beta\)
do the reduction by popping out \(|\beta|\) states, and then push state GOTO(, A).

The algorithm can be written as:

while (true) {
  s =;
  a = next input;
  if (ACTION(s,a) = shift t) {
  } else if (ACTION(s,a) = reduce A to beta) {
    t =
    output production A->beta
  } else if (ACTION=accept | error) {}

Algorithm for construct SLR parsing table:

  1. get canonical LR(0) collection
  2. ACTION(i,a) =
    shift j
    if \(A \rightarrow \alpha \cdot a \beta\) is in Ii, and GOTO(Ii,a)=Ij
    reduce A to α
    if \(A \rightarrow \alpha \cdot\) in Ii and a in FOLLOW(A).
    if \(S' \rightarrow S \cdot\) is in Ii and a = $ LR(1)

So we now allow lookahead. By this we can handle more grammars than LR(0). There're two methods:

canonical-LR (LR)
construct based on LR(1) items, a much larger set than LR(0) items. The parsing table is much bigger, so not good in practice.
lookahead-LR (LALR)
based on LR(0) (??? should be LR(1) here?) sets of items, but has many fewer states than LR(1) items. The parsing table is no bigger than SLR tables. The modern choice.

2 Syntax Directed Translation

Syntax Directed Definition (SDD)
a context-free grammar together with attributes and rules. Attributes are associated with grammar symbols, and rules are associated with productions.
synthesized attribute
for a non-terminal A at a parse tree node N is defined by a semantic rule associated with the production at N. This includes N and its children.
inherited attribute
for a non-terminal B at a parse tree node N is defined by a semantic rule associated with the production at the parent of N. This includes N's parent, N, and N's siblings.
an SDD is S-attributed if every attribute is synthesized. We can evaluate it in any bottom-up fashion, e.g. a post order traversal.
an SDD is L-attributed if each attribute is either synthesized or, inherited but only depends on the value of the parent and the symbols to the left of it on its siblings. This rule says the evaluation should go from left to right, but not right to left.
Syntax Directed Translation Scheme (SDT)
a context free grammar with program fragment embedded within production bodies. (This is the typical grammar file for a parser generator like ANTLR!)

Any SDT can be implemented by

  1. build the parse tree
  2. perform the actions in a left-to-right depth-first order, that is during a pre-order traversal.

Typically SDT's are implemented during parsing, without building a parse tree. We focus on two important classes of SDD:

  1. grammar is LR-parsable and SDD is S-attributed, using Postfix Translation Scheme. This scheme essentially do a bottom-up parsing and evaluate the attributes in place (right at ends of productions).
  2. grammar is LL-parsable and SDD is L-attributed. The L-attributed SDD is more general, but we must assume the grammar is LL-parsable. Otherwise it is "impossible to perform translation in connection with either an LL or an LR parser". The solution is to evaluate in a pre-order traversal of the tree.


The construction of both top-down and bottom-up parsers needs these two functions.

\(\alpha\) is a string of grammar symbols. The set of terminals that \(\alpha\) can begin with. E.g A::=cB, FIRST(A)=c
non-terminal A, to be the set of terminals that can appear immediately to the right of A.

4 Error Recovery

discard input symbols until synchronizing tokens are found. This is typically delimiters, such as semicolon or braces.
perform local correction, such as remove extra semicolon, replace coma with semicolon. This is not good.
use common errors
there are some algorithms to choose a minimal sequence of changes to obtain a globally least cost correction. (What are they??) [Dragon P196]

5 Tools

5.1 Elsa and Elkhound

Elkhound is an ancient parser generator, and Elsa is the C++ parser built upon it. It is clean docs, maybe clean code, worth to check out.

It implements the Generalized LR (GLR) parsing, which works with any context-free grammars. LR parsers (like bison) requires the grammar to be LALR(1).

Parsing with arbitrary context-free grammars has two key advantages: (1) unbounded lookahead, and (2) support for ambiguous grammars. Both of them are achieved by allowing multiple potential parses to coexist for as long as necessary.

The downside, since it is more general, is slower performance.

5.2 Semantic Design Inc

5.3 Parser generator

yacc & lex
generate LALR
bison & flex
open source for yacc, so also LALR
top down parser generator, generates recursive-descent parser

6 Samples

int main() {
  int a,b;
  int* arp[5];


Author: Hebi Li

Created: 2017-06-14 Wed 22:43