Compiler

1. Tmp
- 1.1. CYK parser
- 1.2. Earley's algorithm
2. GNU C Preprocessor
- 2.1. Invoking
  - 2.1.1. TODO The line control might be very useful for Helium
  - 2.1.2. TODO nostdinc, what is the standard?
- 2.2. Macro
3. Chomsky Hierarchy
4. Parser
5. Syntax Directed Translation
6. FIRST and FOLLOW
7. Error Recovery
8. Tools
9. Samples
10. Reference

This file describe the general concept of a compiler. I'm going to relearn it in the hard way.

1 Tmp

1.1 CYK parser

According to the length $n$ of the program measured by tokens, it construct a triangle table with length $n$.

T
T T
T T T
T T T T
a a b b

1.2 Earley's algorithm

Invented by Jay Earley, it is used to parse CFG. It can parse all CFG, while LR and LL can only handle restricted classes. It executes in O(n³) where n is length of string, O(n²) for unambiguous grammars, and linear for almost all LR(k) grammars. It performs particularly well for left-recursive grammars.

Notation: for every input position, it generates a state set (called S(k) for input position k), consisting of:

the production rule using
the position in that production rule
the original position in the string

The parser repeatedly executes three options:

Prediction: for every state in S(k) of the form $X->\alpha \cdot Y \beta$, add $Y->\cdot \gamma$ to S(k)
Scanning: for every state in S(k) of form $Y -> \alpha \cdot a \beta$, add $X -> \alpha a \cdot \beta$ to S(k+1)
Completion: for every state in S(k) of the form $X -> \gammar \cdot$, and … (TODO), add $Y -> \alpha X \cdot \beta$ to S(k)

  procedure PREDICTOR((A → α•B, i), j, grammar)
    for each (B → γ) in GRAMMAR-RULES-FOR(B, grammar) do
      ADD-TO-SET((B → •γ, j), chart[ j])
  procedure SCANNER((A → α•B, i), j)
    if B ⊂ PARTS-OF-SPEECH(word[j]) then
      ADD-TO-SET((B → word[j], i), chart[j + 1])
  procedure COMPLETER((B → γ•, j), k)
    for each (A → α•Bβ, i) in chart[j] do
      ADD-TO-SET((A → αB•β, i), chart[k])

2 GNU C Preprocessor

The followings are done separately:

Three transformations
- all comments are replaced with single spaces
- backslash-newline combo is removed
- predefined macros are replaced
include header files
macro expansion
conditional compilation
line control

It seems that these are not necessarily sequential. You can invoke any of them.

2.1 Invoking

-C: do not discard comments
-nostdinc: do not search standard system directory for header files, use only -I directories.
D, U, -undef: predefine or un-predefine macros. Note that this is pre-defined macros only, will not affect the macros in code
-dM: instead of normal preprocessed program, output all the #define directives (including both pre-defined and user-defined)
-dD: output preprocessed + user-defined define directive
-dI: output preprocessed + #include directive
-M, -MM, MD, MMD: There are also some commands dealing with source code dependencies

2.1.1 TODO The line control might be very useful for Helium

2.1.2 TODO nostdinc, what is the standard?

2.2 Macro

The simple form is a single identifier. The input is scanned sequentially. Each time a substitution happens, the result is added to the front of remaining text, allowing nested expansion. This also means, if later the nested macro is redefined to another value, the new value is always used.

In terms of macros with arguments, the substitution checks for comma to separate parameters. The parameters can be any arbitrary expression that does not contain comma. However, a single parameter can have comma. This is done by matching parenthesis for each parameter, i.e. the parameter must have balanced parenthesis, and a comma inside parenthesis does not end the parameter. However, the brackets and braces are not checked for balance. Like the simple macros, the output is added to the front of remaining text and get expanded accordingly.

When defining a argument macro, the argument names (should be specifically the open parenthesis) must follow macro name immediately without any space.

3 Chomsky Hierarchy

3.1 Attribute grammar

The evaluation of the node will benefit from the attributes. It is categories into synthesized attributes and inherited attributes. This is mentioned in the syntax-directed translation part of dragon book, so I don't really think it is syntax related.

3.2 Affix grammar

This seems to be synonyms of attribute grammar. Wikipedia has an example of using it to resolve the plural verbs, e.g. John likes children. The grammar does make perfect sense to me though.

Sentence → Subject + number Predicate+number
Subject + number → Noun + number
Predicate + number → Verb + number Object
Object → Noun + number
Noun + singular → John
Noun + singular → Mary
Noun + plural → children
Noun + plural → parents
Verb + singular → likes
Verb + plural → like
Verb + singular → helps
Verb + plural → help

3.3 two-level grammar

Two level grammar is used to generate another formal grammar. This is like meta programming, or abstraction. One advantage is that is can define a grammar with infinite number of rules, due that it is generated.

For the context-sensitive language {a^nb^nc^n}, the two level grammar is:

N ::= 1 | N1
X ::= a | b

Together with the schema:

S ::= <a^N><b^N><c^N>
<X^{N1}> ::= <X^N> X
<X^1> ::= X

3.4 Van Wijngaarden grammar

It is named by its inventor, and used in the language ALGOL 68. It is also called vW-grammar, or W-grammar. Also note that it is too powerful and also impractical as well.

It is:

affix grammar
two-level grammar

The ALGOL 68 grammar:

a) program : open symbol, standard prelude,
     library prelude option, particular program, exit,
     library postlude option, standard postlude, close symbol.
b) standard prelude : declaration prelude sequence.
c) library prelude : declaration prelude sequence.
d) particular program :
     label sequence option, strong CLOSED void clause.
e) exit : go on symbol, letter e letter x letter i letter t, label symbol.
f) library postlude : statement interlude.
g) standard postlude : strong void clause train

The idea is simple: provide non-terminal symbols with attributes (or affixes) that pass information between nodes of parse tree. This seems essentially to be using syntax to specify semantics. E.g. variable := value is in-complete but instead REF TYPE variable := TYPE value.

3.5 Generalized LR parser

LR cannot handle nondeteministic and ambiguous grammars. GLR allows shift/reduce and reduce/reduce conflicts. It does that by doing a breadth-first-search, whenever multiple choices can be taken. All the choices are explored simultaneously, until it reach an error state or merge into another (if they generate the same set of symbols).

3.6 About C Grammar

This is a easily misunderstood problem. There's a old 1989 discussion on comp.lang.misc:

https://groups.google.com/forum/#!topic/comp.lang.misc/MCZmQv56--Q

So here's my understanding: C language is context sensitive:

an identifier must be previously defined, to be able to be a typedef-name
- this is different from an undeclared variable: an identifier is always parsed into category "identifier", regardless of whether it has been declared. However, undefined typedefs are ambiguous.
- typedef-name : identifier
else must match the closest else-less if
continue and break must be in correct context

On the other hand, the C grammar in the back of K&R, is context-free, because it is in BNF format. However, it accepts more than C language.

Also in another word, the K&R grammar generates valid C programs and some invalid ones, and those are detected by the semantic analyzer.

A word about ambiguity: it does not affect the language it generates, but only generates (or recognizes) the same sentence in multiple ways.

Yacc generates LALR(1) parser. The C grammar is not a no-feedback yacc-able grammar. The thing is, when the lexer is trying the rule "typedef-name: identifier", the lexer should answer "typedef-name" iff the identifier happens to be a typedef name.. We also need to make sure that it tries this rule before the rule "primary: identifier". So the implementation is to add a predicate "identifier-is-really-typedef" to the action of the rule "typedef-name".

K&R has the following sentence:

With one further change, namely deleting the production typedef-name: identifier and making typedef-name a terminal symbol, this grammar is acceptable to the YACC parser-generator.

This is the fundamental reason why the yacc grammar differs from the K&R grammar. The typedef-name a terminal symbol also means that, C does have a infinite number of (non?-)terminals.

David Gudeman in that mailing archive said one sentence:

The grammar is intended to show syntactic categories with semantic attachment.

And someone said:

we generally do not write grammars that accept only strictly correct inputs. Instead, we take some shortcuts and patch things up via semantics.

Chris Torek:

Strictly speaking (as long as you stick with `typedef produces a new type specifier'), C's grammar is not context free; but it is indeed not seriously `screwed up'. Perfect generators are not necessary, and a language `close enough to C' exists that can be parsed simply.

Since typedef produce a new "terminal", can we really say C grammar has infinite number of "terminals", and the definition of Chomsky hierarchy context free grammar does not hold for C grammar.

The C grammar in back of K&R by no means is the correct C grammar, or the grammar used in compiler implementation.

The typedef-name problem is also called The Lexer Hack. There must be a feedback loop from the parser to insert the typedef-name into the symbol table, and the lexer will use this information to determine whether it is a typedef-name or token. The parser action must insert this table before the ,=; declarator terminator, because the lexer is allowed to lookahead 1 token, and the token might be an identifier.

4 Parser

Three category of parsers:

universal: Cocke–Younger–Kasami algorithm (CYK) and the Earley parser. They can parse any grammar, but slow.
bottom up: build parse tree from bottom to top
top down: build parse tree from top to bottom

There are many different methods of each top-down and bottom-up parsers. The most efficient top-down and bottom-up methods work only for subclasses of grammars. Some important subclasses is LL and LR grammar.

LL grammar: often used by hand-implemented parser
LR grammar: accept more grammars, but usually construct by tools

Some terms:

parse tree: a graphical representation of a derivation that filters out the order in which productions are applied to replace non-terminals.

Of course, regular expression is a subset of context free grammar.

About how to divide lexer rules and parser rules: use regular expression.

Regular expression is most useful for describing structure of constructs, such as ID and keywords.
Grammar is most useful for describing nested structures such as parentheses, begin-end, if-then-else.

4.1 AST

In Syntax Tree, interior nodes represent programming constructs.
In parse tree, interior nodes represent non-terminals
many non-terminals of a grammar represent programming constructs, but others are helpers of one sort of another, such as term, factors.
In syntax tree, these helpers typically are not needed and hence dropped.

To conclude:

concrete syntax tree: the parse tree
concrete syntax: the underlying grammar of a parse tree

Some new understanding of them (quotes from dragon book):

CST: A parse tree pictorially shows how the start symbol of a grammar derives a string in the language.
AST: Abstract syntax trees, or simply syntax trees, differ from parse trees because superficial distinctions of form, unimportant for translation, do not appear in syntax trees.

That is to say, CST exactly reflect the grammar, and AST removes a lot of things but keep the structure. This includes:

remove non-value-carrying leaves
remove unary productions
compress spines caused by left or right recursive grammar rules into explicit list nodes.

The implementation of AST is different for different parser designers. For example as the Clang fold said¹:

Clang’s AST is different from ASTs produced by some other compilers in that it closely resembles both the written C++ code and the C++ standard. For example, parenthesis expressions and compile time constants are available in an unreduced form in the AST. This makes Clang’s AST a good fit for refactoring tools.

var a = 2,
    b = (a + 2) * 3;

Here, the ( ) around a + 2 is not represented in the AST, because the structure of the tree, combined with operator precedence rules, absolutely implies that it must exist, and moreover a + 2 * 3 would have been a different tree structure. ² ^,³

Estools ⁴ is a good collection of repos that have many interesting discussion threads about AST and CST.

Also, a post from semantic design ⁵, mentioned:

Typical parser generators force the grammar engineer to specify not only the grammar, but also to explicitly specify how to procedurally build the AST as well.

In contrast, DMS automatically builds a syntax tree, either a concrete tree ("CST", mainly for debugging, containing all the language tokens), or what amounts to an AST (for production, in which non-value carrying terminals are eliminated, useless unary productions are removed, and lists are formed).

A very good comparison on a lecture note ⁶.

4.2 TODO Write/Fix the grammar

4.2.1 TODO Ambiguity

4.2.2 Left Recursion

Top down parser cannot handle left recursion.

The grammar

A ::= A alpha | beta

Can be rewritten as:

A ::= beta R
R ::= alpha | epsilon

4.2.3 Left Factoring

a technique to rewrite production rule to achieve the effect that we wait until enough input has been seen to make decision. It makes grammar more suitable for predictive or top-down parsing.

4.3 Top Down Parsing

Recusrive descent parsing

general form of top-down parsing
may require backtracking

Predictive parsing

a special case of recursive-descent parsing
do not require backtracking
By look ahead fixed number (usually 1) of tokens

LL(k)

A class of grammar, for which we can construct a predictive parser by looking k symbols ahead.

The general recursive descent parsing problem is:

void A() {
  choose an A-production
  for (i = 1 to k) {
    if (xi is nonterminal) call X();
    else if (xi = input symbol) advance_to_next_symbol();
    else error();
  }
}

This is non-deterministic since it begins with choose a production. To augment backtracking to the algorithm, we need:

try different productions
at error, return to the line of choose production
we need a local variable to store where is the input symbol when choosing production.

Left recursive grammar can cause a recursive-descent parser (even the one with backtracking) into an infinite-loop. Because it try to expand A without consuming any input.

4.3.1 TODO LL(1)

What's the LL?

L: Scanning input from Left to right
L: producing Leftmost derivation
1: lookahead 1 symbol

It is rich enough to cover most programming constructs. However, left-recursive and ambiguous can not.

The parser will construct a predictive parsing table. To solve LL(1), we use non-recursive predictive parsing. Do not need recursive call (really??), because it constructs a parsing table. It is table-driven.

algorithm 1: construct predictive parsing table
algorithm 2: table driven predictive parsing

4.3.2 Recursive Decent Parser v.s. LR Parser generator

Well, In a word, this is actually important. See what the clang guys say ⁷

Clang is the "C Language Family Front-end", which means we intend to support the most popular members of the C family. We are convinced that the right parsing technology for this class of languages is a hand-built recursive-descent parser. Because it is plain C++ code, recursive descent makes it very easy for new developers to understand the code, it easily supports ad-hoc rules and other strange hacks required by C/C++, and makes it straight-forward to implement excellent diagnostics and error recovery.

4.4 Bottom Up Parsing

shift-reduce parsing: a general style of bottom-up parsing
LR grammar: the largest class of grammars for which shift-reduce parsers can be built

The bottom up parsing can think as reducing a string to the start symbol. At each reduction step, a substring is replaced by a non-terminal. Thus the key decisions are:

when to reduce
what production to apply

4.4.1 shift-reduce parsing

Think about a stack holding current string, and the input holding the rest input tokens.

shift: move from input to stack
reduce: replace a substring at the top of the stack

The conflict here:

shift/reduce conflict: don't know to shift or reduce.
reduce/reduce conflict: don't know which production rule to use

Grammar that contains these conflicts are non-LR grammar.

4.4.1.1 The dangling else problem

This is a canonical example for shift/reduce conflict. The C grammar for selection-statement is like this. The bison way to deal with it is to "choose shift, unless otherwise directed by operator precedence declarations". This naturally groups the else with nearest if. But bison will STILL report 2 shift/reduce conflicts. You can suppress this warning by use %expect 2, but this is not recommended.

More elegant way is to use precedence or associativity. To do this, either make else to have higher precedence than then, OR make them equal precedence, but give both then and else right associativity. In the case there is no then keyword, like C, the terminal is r-paren.

Below is the code giving else higher precedence. Note that %token only declares the token exists, no precedence. %precedence does both declaring and precedence assigning. It follows the same rule: the lower the position, the higher the precedence.

%precedence "then"
%precedence "else"

Give right associativity:

%right "then" "else"

One more way to resolve this is to slightly modify the grammar:

  statement = ...
     | selection-statement

  statement-with-else = ...
     | selection-statement-with-else

  selection-statement = ...
     | IF ( expression ) statement
     | IF ( expression ) statement-with-else ELSE statement

  selection-statement-with-else = ...
     | IF ( expression ) statement-with-else ELSE statement-with-else

4.4.2 LR(k) Parsing

L: left to right scanning
R: producing rightmost derivation
k: number of lookahead (when omitted, assume 1)

LR parsers are table driven, like the non-recursive LL parsers.

LR Grammar: a grammar for which we can construct a LR parser for it.

Over LL parsing, it is better because:

LR parsers can be constructed to recognize virtually all programming language constructs for which context-free grammars can be written.
the most general non-backtracking shift-reduce parsing, and can be implemented as efficient as others
can detect syntactic error as soon as it is possible to do so on a left-to-right scan of input
LR grammar is super set of LL grammar

The drawback: hard to construct by hand.

4.4.2.1 Simple LR Parsing (SLR)

LR(0) Item: each production rule will be written in a dot format: put one dot somewhere in the rule. This will result in many items.
Set of LR(0) Items: a set of the items
Canonical LR(0) collection: a collection of sets of LR(0) Items, that is typically used (others are useless).

To construct Canonical LR(0) collection, introduce the CLOSURE and GOTO functions:

CLOSURE(I): where I is a set of items, if $A \rightarrow \alpha \cdot B \beta$ is in CLOSURE(I), and $B \rightarrow \gamma$, $B \rightarrow \cdot \gamma$ is in the set.
GOTO(I,X): where I is a set of items, X is a grammar symbol. Produce a closure, if $A \rightarrow \alpha \cdot X \beta$, $A \rightarrow \alpha X \cdot \beta$ is in GOTO(T,X).

Now the algorithm to construct canonical LR(0) items

void items(G') {
  C=CLOSURE({S->.S'});
  repeat until no new {
    for (each set I in C) {
      for (each grammar symbol X) {
        add GOTO(I,X) to C}}}}

Now we can define LR(0) Automata:

state: the canonical LR(0) collection
transition: GOTO function

Set up for parsing: Now we have the components:

input

the remaining input

stack

the stack holds the states. Note that each state corresponding to exactly one symbol (yes, but why??). So we can always convert to the symbols from states.

parsing table

contains two parts: ACTION and GOTO

ACTION(i,a)

state i, next terminal a. The result is

shift j: shift the terminal and go to state j
reduce $A \rightarrow \beta$: reduce β (on the top of stack) to A
accept
error

GOTO(i, A)=j

map state i and non-terminal A to state j

Parsing algorithm:

action = shift s: do it
action = reduce $A \rightarrow \beta$: do the reduction by popping out $|\beta|$ states, and then push state GOTO(stack.top, A).

The algorithm can be written as:

  while (true) {
    s = stack.top;
    a = next input;
    if (ACTION(s,a) = shift t) {
      stack.push(t)
      advance(a)
    } else if (ACTION(s,a) = reduce A to beta) {
      stack.pop(len(beta));
      t = stack.top
      stack.push(GOTO(t,A))
      output production A->beta
    } else if (ACTION=accept | error) {}
  }

Algorithm for construct SLR parsing table:

get canonical LR(0) collection
ACTION(i,a) =

shift j
if $A \rightarrow \alpha \cdot a \beta$ is in I_i, and GOTO(I_i,a)=I_j

reduce A to α
if $A \rightarrow \alpha \cdot$ in I_i and a in FOLLOW(A).

accept
if $S' \rightarrow S \cdot$ is in I_i and a = $

4.4.2.2 LR(1)

So we now allow lookahead. By this we can handle more grammars than LR(0). There're two methods:

canonical-LR (LR): construct based on LR(1) items, a much larger set than LR(0) items. The parsing table is much bigger, so not good in practice.
lookahead-LR (LALR): based on LR(0) (??? should be LR(1) here?) sets of items, but has many fewer states than LR(1) items. The parsing table is no bigger than SLR tables. The modern choice.

5 Syntax Directed Translation

Syntax Directed Definition (SDD)

a context-free grammar together with attributes and rules. Attributes are associated with grammar symbols, and rules are associated with productions.

synthesized attribute: for a non-terminal A at a parse tree node N is defined by a semantic rule associated with the production at N. This includes N and its children.
inherited attribute: for a non-terminal B at a parse tree node N is defined by a semantic rule associated with the production at the parent of N. This includes N's parent, N, and N's siblings.
S-attributed: an SDD is S-attributed if every attribute is synthesized. We can evaluate it in any bottom-up fashion, e.g. a post order traversal.
L-attributed: an SDD is L-attributed if each attribute is either synthesized or, inherited but only depends on the value of the parent and the symbols to the left of it on its siblings. This rule says the evaluation should go from left to right, but not right to left.

Syntax Directed Translation Scheme (SDT)

a context free grammar with program fragment embedded within production bodies. (This is the typical grammar file for a parser generator like ANTLR!)

Any SDT can be implemented by

build the parse tree
perform the actions in a left-to-right depth-first order, that is during a pre-order traversal.

Typically SDT's are implemented during parsing, without building a parse tree. We focus on two important classes of SDD:

grammar is LR-parsable and SDD is S-attributed, using Postfix Translation Scheme. This scheme essentially do a bottom-up parsing and evaluate the attributes in place (right at ends of productions).
grammar is LL-parsable and SDD is L-attributed. The L-attributed SDD is more general, but we must assume the grammar is LL-parsable. Otherwise it is "impossible to perform translation in connection with either an LL or an LR parser". The solution is to evaluate in a pre-order traversal of the tree.

6 FIRST and FOLLOW

The construction of both top-down and bottom-up parsers needs these two functions.

FIRST($\alpha$): $\alpha$ is a string of grammar symbols. The set of terminals that $\alpha$ can begin with. E.g A::=cB, FIRST(A)=c
FOLLOW(A): non-terminal A, to be the set of terminals that can appear immediately to the right of A.

7 Error Recovery

panic-mode: discard input symbols until synchronizing tokens are found. This is typically delimiters, such as semicolon or braces.
phrase-level: perform local correction, such as remove extra semicolon, replace coma with semicolon. This is not good.
error-production: use common errors
global-correction: there are some algorithms to choose a minimal sequence of changes to obtain a globally least cost correction. (What are they??) [Dragon P196]

8 Tools

8.1 Elsa and Elkhound

http://www.scottmcpeak.com/elkhound/

Elkhound is an ancient parser generator, and Elsa is the C++ parser built upon it. It is clean docs, maybe clean code, worth to check out.

It implements the Generalized LR (GLR) parsing, which works with any context-free grammars. LR parsers (like bison) requires the grammar to be LALR(1).

GLR: https://en.wikipedia.org/wiki/GLR_parser

Parsing with arbitrary context-free grammars has two key advantages: (1) unbounded lookahead, and (2) support for ambiguous grammars. Both of them are achieved by allowing multiple potential parses to coexist for as long as necessary.

The downside, since it is more general, is slower performance.

8.2 Semantic Design Inc

A Commercial Parser Front end:

http://www.semanticdesigns.com/Products/FrontEnds/CppFrontEnd.html

8.3 Parser generator

yacc & lex: generate LALR
bison & flex: open source for yacc, so also LALR
antlr: top down parser generator, generates recursive-descent parser

9 Samples

int main() {
  int a,b;
  b=a+b;
  int* arp[5];
  b=(a+b)*a;
}

10 Reference

A recursive descent parser generator in Common Lisp: http://www.informatimago.com/develop/lisp/com/informatimago/small-cl-pgms/rdp/
cl-python: A python compiler in Common Lisp. This is VERY good. https://github.com/metawilm/cl-python

Footnotes:

http://clang.llvm.org/docs/IntroductionToTheClangAST.html

https://github.com/estree/estree/issues/41

https://github.com/getify/concrete-syntax-tree

⁴

https://github.com/estools

⁵

http://www.semdesigns.com/Products/DMS/LifeAfterParsing.html?Home=DMSToolkit

⁶

http://www.cse.chalmers.se/edu/year/2011/course/TIN321/lectures/proglang-02.html

⁷

http://clang.llvm.org/features.html