Compiler
Table of Contents
This file describe the general concept of a compiler. I'm going to relearn it in the hard way.
1 Tmp
1.1 CYK parser
According to the length \(n\) of the program measured by tokens, it construct a triangle table with length \(n\).
T T T T T T T T T T a a b b
1.2 Earley's algorithm
Invented by Jay Earley, it is used to parse CFG. It can parse all CFG, while LR and LL can only handle restricted classes. It executes in O(n3) where n is length of string, O(n2) for unambiguous grammars, and linear for almost all LR(k) grammars. It performs particularly well for left-recursive grammars.
Notation: for every input position, it generates a state set (called S(k) for input position k), consisting of:
- the production rule using
- the position in that production rule
- the original position in the string
The parser repeatedly executes three options:
- Prediction: for every state in S(k) of the form \(X->\alpha \cdot Y \beta\), add \(Y->\cdot \gamma\) to S(k)
- Scanning: for every state in S(k) of form \(Y -> \alpha \cdot a \beta\), add \(X -> \alpha a \cdot \beta\) to S(k+1)
- Completion: for every state in S(k) of the form \(X -> \gammar \cdot\), and … (TODO), add \(Y -> \alpha X \cdot \beta\) to S(k)
procedure PREDICTOR((A → α•B, i), j, grammar) for each (B → γ) in GRAMMAR-RULES-FOR(B, grammar) do ADD-TO-SET((B → •γ, j), chart[ j]) procedure SCANNER((A → α•B, i), j) if B ⊂ PARTS-OF-SPEECH(word[j]) then ADD-TO-SET((B → word[j], i), chart[j + 1]) procedure COMPLETER((B → γ•, j), k) for each (A → α•Bβ, i) in chart[j] do ADD-TO-SET((A → αB•β, i), chart[k])
2 GNU C Preprocessor
The followings are done separately:
- Three transformations
- all comments are replaced with single spaces
- backslash-newline combo is removed
- predefined macros are replaced
- include header files
- macro expansion
- conditional compilation
- line control
It seems that these are not necessarily sequential. You can invoke any of them.
2.1 Invoking
-C
: do not discard comments-nostdinc
: do not search standard system directory for header files, use only -I directories.D, U, -undef
: predefine or un-predefine macros. Note that this is pre-defined macros only, will not affect the macros in code-dM
: instead of normal preprocessed program, output all the #define directives (including both pre-defined and user-defined)-dD
: output preprocessed + user-defined define directive-dI
: output preprocessed + #include directive-M, -MM, MD, MMD
: There are also some commands dealing with source code dependencies
2.1.1 TODO The line control might be very useful for Helium
2.1.2 TODO nostdinc, what is the standard?
2.2 Macro
The simple form is a single identifier. The input is scanned sequentially. Each time a substitution happens, the result is added to the front of remaining text, allowing nested expansion. This also means, if later the nested macro is redefined to another value, the new value is always used.
In terms of macros with arguments, the substitution checks for comma to separate parameters. The parameters can be any arbitrary expression that does not contain comma. However, a single parameter can have comma. This is done by matching parenthesis for each parameter, i.e. the parameter must have balanced parenthesis, and a comma inside parenthesis does not end the parameter. However, the brackets and braces are not checked for balance. Like the simple macros, the output is added to the front of remaining text and get expanded accordingly.
When defining a argument macro, the argument names (should be specifically the open parenthesis) must follow macro name immediately without any space.
3 Chomsky Hierarchy
3.1 Attribute grammar
The evaluation of the node will benefit from the attributes. It is categories into synthesized attributes and inherited attributes. This is mentioned in the syntax-directed translation part of dragon book, so I don't really think it is syntax related.
3.2 Affix grammar
This seems to be synonyms of attribute grammar. Wikipedia has an
example of using it to resolve the plural verbs, e.g. John likes
children
. The grammar does make perfect sense to me though.
Sentence → Subject + number Predicate+number Subject + number → Noun + number Predicate + number → Verb + number Object Object → Noun + number Noun + singular → John Noun + singular → Mary Noun + plural → children Noun + plural → parents Verb + singular → likes Verb + plural → like Verb + singular → helps Verb + plural → help
3.3 two-level grammar
Two level grammar is used to generate another formal grammar. This is like meta programming, or abstraction. One advantage is that is can define a grammar with infinite number of rules, due that it is generated.
For the context-sensitive language {a^nb^nc^n}
, the two level
grammar is:
N ::= 1 | N1 X ::= a | b
Together with the schema:
S ::= <a^N><b^N><c^N> <X^{N1}> ::= <X^N> X <X^1> ::= X
3.4 Van Wijngaarden grammar
It is named by its inventor, and used in the language ALGOL 68. It is also called vW-grammar, or W-grammar. Also note that it is too powerful and also impractical as well.
It is:
- affix grammar
- two-level grammar
The ALGOL 68 grammar:
a) program : open symbol, standard prelude, library prelude option, particular program, exit, library postlude option, standard postlude, close symbol. b) standard prelude : declaration prelude sequence. c) library prelude : declaration prelude sequence. d) particular program : label sequence option, strong CLOSED void clause. e) exit : go on symbol, letter e letter x letter i letter t, label symbol. f) library postlude : statement interlude. g) standard postlude : strong void clause train
The idea is simple: provide non-terminal symbols with attributes (or
affixes) that pass information between nodes of parse tree. This seems
essentially to be using syntax to specify semantics. E.g. variable :=
value
is in-complete but instead REF TYPE variable := TYPE value
.
3.5 Generalized LR parser
LR cannot handle nondeteministic and ambiguous grammars. GLR allows shift/reduce and reduce/reduce conflicts. It does that by doing a breadth-first-search, whenever multiple choices can be taken. All the choices are explored simultaneously, until it reach an error state or merge into another (if they generate the same set of symbols).
3.6 About C Grammar
This is a easily misunderstood problem. There's a old 1989 discussion on comp.lang.misc:
So here's my understanding: C language is context sensitive:
- an identifier must be previously defined, to be able to be a typedef-name
- this is different from an undeclared variable: an identifier is always parsed into category "identifier", regardless of whether it has been declared. However, undefined typedefs are ambiguous.
- typedef-name : identifier
- else must match the closest else-less if
- continue and break must be in correct context
On the other hand, the C grammar in the back of K&R, is context-free, because it is in BNF format. However, it accepts more than C language.
Also in another word, the K&R grammar generates valid C programs and some invalid ones, and those are detected by the semantic analyzer.
A word about ambiguity: it does not affect the language it generates, but only generates (or recognizes) the same sentence in multiple ways.
Yacc generates LALR(1) parser. The C grammar is not a no-feedback yacc-able grammar. The thing is, when the lexer is trying the rule "typedef-name: identifier", the lexer should answer "typedef-name" iff the identifier happens to be a typedef name.. We also need to make sure that it tries this rule before the rule "primary: identifier". So the implementation is to add a predicate "identifier-is-really-typedef" to the action of the rule "typedef-name".
K&R has the following sentence:
With one further change, namely deleting the production typedef-name: identifier and making typedef-name a terminal symbol, this grammar is acceptable to the YACC parser-generator.
This is the fundamental reason why the yacc grammar differs from the K&R grammar. The typedef-name a terminal symbol also means that, C does have a infinite number of (non?-)terminals.
David Gudeman in that mailing archive said one sentence:
The grammar is intended to show syntactic categories with semantic attachment.
And someone said:
we generally do not write grammars that accept only strictly correct inputs. Instead, we take some shortcuts and patch things up via semantics.
Chris Torek:
Strictly speaking (as long as you stick with `typedef produces a new type specifier'), C's grammar is not context free; but it is indeed not seriously `screwed up'. Perfect generators are not necessary, and a language `close enough to C' exists that can be parsed simply.
Since typedef produce a new "terminal", can we really say C grammar has infinite number of "terminals", and the definition of Chomsky hierarchy context free grammar does not hold for C grammar.
The C grammar in back of K&R by no means is the correct C grammar, or the grammar used in compiler implementation.
The typedef-name problem is also called The Lexer Hack. There must
be a feedback loop from the parser to insert the typedef-name into the
symbol table, and the lexer will use this information to determine
whether it is a typedef-name or token. The parser action must insert
this table before the ,=;
declarator terminator, because the lexer
is allowed to lookahead 1 token, and the token might be an identifier.
4 Parser
Three category of parsers:
- universal
- Cocke–Younger–Kasami algorithm (CYK) and the Earley parser. They can parse any grammar, but slow.
- bottom up
- build parse tree from bottom to top
- top down
- build parse tree from top to bottom
There are many different methods of each top-down and bottom-up parsers. The most efficient top-down and bottom-up methods work only for subclasses of grammars. Some important subclasses is LL and LR grammar.
- LL grammar
- often used by hand-implemented parser
- LR grammar
- accept more grammars, but usually construct by tools
Some terms:
- parse tree
- a graphical representation of a derivation that filters out the order in which productions are applied to replace non-terminals.
Of course, regular expression is a subset of context free grammar.
About how to divide lexer rules and parser rules: use regular expression.
- Regular expression is most useful for describing structure of constructs, such as ID and keywords.
- Grammar is most useful for describing nested structures such as parentheses, begin-end, if-then-else.
4.1 AST
- In Syntax Tree, interior nodes represent programming constructs.
- In parse tree, interior nodes represent non-terminals
- many non-terminals of a grammar represent programming constructs, but others are helpers of one sort of another, such as term, factors.
- In syntax tree, these helpers typically are not needed and hence dropped.
To conclude:
- concrete syntax tree
- the parse tree
- concrete syntax
- the underlying grammar of a parse tree
Some new understanding of them (quotes from dragon book):
- CST
- A parse tree pictorially shows how the start symbol of a grammar derives a string in the language.
- AST
- Abstract syntax trees, or simply syntax trees, differ from parse trees because superficial distinctions of form, unimportant for translation, do not appear in syntax trees.
That is to say, CST exactly reflect the grammar, and AST removes a lot of things but keep the structure. This includes:
- remove non-value-carrying leaves
- remove unary productions
- compress spines caused by left or right recursive grammar rules into explicit list nodes.
The implementation of AST is different for different parser designers. For example as the Clang fold said1:
Clang’s AST is different from ASTs produced by some other compilers in that it closely resembles both the written C++ code and the C++ standard. For example, parenthesis expressions and compile time constants are available in an unreduced form in the AST. This makes Clang’s AST a good fit for refactoring tools.
var a = 2, b = (a + 2) * 3;
Here, the ( ) around a + 2 is not represented in the AST, because the structure of the tree, combined with operator precedence rules, absolutely implies that it must exist, and moreover a + 2 * 3 would have been a different tree structure. 2 , 3
Estools 4 is a good collection of repos that have many interesting discussion threads about AST and CST.
Also, a post from semantic design 5, mentioned:
Typical parser generators force the grammar engineer to specify not only the grammar, but also to explicitly specify how to procedurally build the AST as well.
In contrast, DMS automatically builds a syntax tree, either a concrete tree ("CST", mainly for debugging, containing all the language tokens), or what amounts to an AST (for production, in which non-value carrying terminals are eliminated, useless unary productions are removed, and lists are formed).
A very good comparison on a lecture note 6.
4.2 TODO Write/Fix the grammar
4.2.1 TODO Ambiguity
4.2.2 Left Recursion
Top down parser cannot handle left recursion.
The grammar
A ::= A alpha | beta
Can be rewritten as:
A ::= beta R R ::= alpha | epsilon
4.2.3 Left Factoring
a technique to rewrite production rule to achieve the effect that we wait until enough input has been seen to make decision. It makes grammar more suitable for predictive or top-down parsing.
4.3 Top Down Parsing
- Recusrive descent parsing
- general form of top-down parsing
- may require backtracking
- Predictive parsing
- a special case of recursive-descent parsing
- do not require backtracking
- By look ahead fixed number (usually 1) of tokens
- LL(k)
- A class of grammar, for which we can construct a predictive parser by looking k symbols ahead.
The general recursive descent parsing problem is:
void A() { choose an A-production for (i = 1 to k) { if (xi is nonterminal) call X(); else if (xi = input symbol) advance_to_next_symbol(); else error(); } }
This is non-deterministic since it begins with choose a production. To augment backtracking to the algorithm, we need:
- try different productions
- at error, return to the line of choose production
- we need a local variable to store where is the input symbol when choosing production.
Left recursive grammar can cause a recursive-descent parser (even the one with backtracking) into an infinite-loop. Because it try to expand A without consuming any input.
4.3.1 TODO LL(1)
What's the LL?
- L
- Scanning input from Left to right
- L
- producing Leftmost derivation
- 1
- lookahead 1 symbol
It is rich enough to cover most programming constructs. However, left-recursive and ambiguous can not.
The parser will construct a predictive parsing table. To solve LL(1), we use non-recursive predictive parsing. Do not need recursive call (really??), because it constructs a parsing table. It is table-driven.
- algorithm 1: construct predictive parsing table
- algorithm 2: table driven predictive parsing
4.3.2 Recursive Decent Parser v.s. LR Parser generator
Well, In a word, this is actually important. See what the clang guys say 7
Clang is the "C Language Family Front-end", which means we intend to support the most popular members of the C family. We are convinced that the right parsing technology for this class of languages is a hand-built recursive-descent parser. Because it is plain C++ code, recursive descent makes it very easy for new developers to understand the code, it easily supports ad-hoc rules and other strange hacks required by C/C++, and makes it straight-forward to implement excellent diagnostics and error recovery.
4.4 Bottom Up Parsing
- shift-reduce parsing
- a general style of bottom-up parsing
- LR grammar
- the largest class of grammars for which shift-reduce parsers can be built
The bottom up parsing can think as reducing a string to the start symbol. At each reduction step, a substring is replaced by a non-terminal. Thus the key decisions are:
- when to reduce
- what production to apply
4.4.1 shift-reduce parsing
Think about a stack holding current string, and the input holding the rest input tokens.
- shift
- move from input to stack
- reduce
- replace a substring at the top of the stack
The conflict here:
- shift/reduce conflict
- don't know to shift or reduce.
- reduce/reduce conflict
- don't know which production rule to use
Grammar that contains these conflicts are non-LR grammar.
4.4.1.1 The dangling else problem
This is a canonical example for shift/reduce conflict. The C grammar
for selection-statement is like this. The bison way to deal with it is
to "choose shift, unless otherwise directed by operator precedence
declarations". This naturally groups the else with nearest if. But
bison will STILL report 2 shift/reduce conflicts. You can suppress
this warning by use %expect 2
, but this is not recommended.
More elegant way is to use precedence or associativity. To do this,
either make else
to have higher precedence than then
, OR make them
equal precedence, but give both then
and else
right
associativity. In the case there is no then
keyword, like C, the
terminal is r-paren
.
Below is the code giving else
higher precedence. Note that %token
only declares the token exists, no precedence. %precedence
does both
declaring and precedence assigning. It follows the same rule: the
lower the position, the higher the precedence.
%precedence "then" %precedence "else"
Give right associativity:
%right "then" "else"
One more way to resolve this is to slightly modify the grammar:
statement = ... | selection-statement statement-with-else = ... | selection-statement-with-else selection-statement = ... | IF ( expression ) statement | IF ( expression ) statement-with-else ELSE statement selection-statement-with-else = ... | IF ( expression ) statement-with-else ELSE statement-with-else
4.4.2 LR(k) Parsing
- L
- left to right scanning
- R
- producing rightmost derivation
- k
- number of lookahead (when omitted, assume 1)
LR parsers are table driven, like the non-recursive LL parsers.
- LR Grammar
- a grammar for which we can construct a LR parser for it.
Over LL parsing, it is better because:
- LR parsers can be constructed to recognize virtually all programming language constructs for which context-free grammars can be written.
- the most general non-backtracking shift-reduce parsing, and can be implemented as efficient as others
- can detect syntactic error as soon as it is possible to do so on a left-to-right scan of input
- LR grammar is super set of LL grammar
The drawback: hard to construct by hand.
4.4.2.1 Simple LR Parsing (SLR)
- LR(0) Item
- each production rule will be written in a dot format: put one dot somewhere in the rule. This will result in many items.
- Set of LR(0) Items
- a set of the items
- Canonical LR(0) collection
- a collection of sets of LR(0) Items, that is typically used (others are useless).
To construct Canonical LR(0) collection, introduce the CLOSURE and GOTO functions:
- CLOSURE(I)
- where I is a set of items, if \(A \rightarrow \alpha \cdot B \beta\) is in CLOSURE(I), and \(B \rightarrow \gamma\), \(B \rightarrow \cdot \gamma\) is in the set.
- GOTO(I,X)
- where I is a set of items, X is a grammar symbol. Produce a closure, if \(A \rightarrow \alpha \cdot X \beta\), \(A \rightarrow \alpha X \cdot \beta\) is in GOTO(T,X).
Now the algorithm to construct canonical LR(0) items
void items(G') { C=CLOSURE({S->.S'}); repeat until no new { for (each set I in C) { for (each grammar symbol X) { add GOTO(I,X) to C}}}}
Now we can define LR(0) Automata:
- state
- the canonical LR(0) collection
- transition
- GOTO function
Set up for parsing: Now we have the components:
- input
- the remaining input
- stack
- the stack holds the states. Note that each state corresponding to exactly one symbol (yes, but why??). So we can always convert to the symbols from states.
- parsing table
- contains two parts: ACTION and GOTO
- ACTION(i,a)
- state i, next terminal a. The result is
- shift j
- shift the terminal and go to state j
- reduce \(A \rightarrow \beta\)
- reduce β (on the top of stack) to A
- accept
- error
- GOTO(i, A)=j
- map state i and non-terminal A to state j
Parsing algorithm:
- action = shift s
- do it
- action = reduce \(A \rightarrow \beta\)
- do the reduction by popping out \(|\beta|\) states, and then push state GOTO(stack.top, A).
The algorithm can be written as:
while (true) { s = stack.top; a = next input; if (ACTION(s,a) = shift t) { stack.push(t) advance(a) } else if (ACTION(s,a) = reduce A to beta) { stack.pop(len(beta)); t = stack.top stack.push(GOTO(t,A)) output production A->beta } else if (ACTION=accept | error) {} }
Algorithm for construct SLR parsing table:
- get canonical LR(0) collection
- ACTION(i,a) =
- shift j
- if \(A \rightarrow \alpha \cdot a \beta\) is in Ii, and GOTO(Ii,a)=Ij
- reduce A to α
- if \(A \rightarrow \alpha \cdot\) in Ii and a in FOLLOW(A).
- accept
- if \(S' \rightarrow S \cdot\) is in Ii and a = $
4.4.2.2 LR(1)
So we now allow lookahead. By this we can handle more grammars than LR(0). There're two methods:
- canonical-LR (LR)
- construct based on LR(1) items, a much larger set than LR(0) items. The parsing table is much bigger, so not good in practice.
- lookahead-LR (LALR)
- based on LR(0) (??? should be LR(1) here?) sets of items, but has many fewer states than LR(1) items. The parsing table is no bigger than SLR tables. The modern choice.
5 Syntax Directed Translation
- Syntax Directed Definition (SDD)
- a context-free grammar together
with attributes and rules. Attributes are associated with grammar
symbols, and rules are associated with productions.
- synthesized attribute
- for a non-terminal A at a parse tree node N is defined by a semantic rule associated with the production at N. This includes N and its children.
- inherited attribute
- for a non-terminal B at a parse tree node N is defined by a semantic rule associated with the production at the parent of N. This includes N's parent, N, and N's siblings.
- S-attributed
- an SDD is S-attributed if every attribute is synthesized. We can evaluate it in any bottom-up fashion, e.g. a post order traversal.
- L-attributed
- an SDD is L-attributed if each attribute is either synthesized or, inherited but only depends on the value of the parent and the symbols to the left of it on its siblings. This rule says the evaluation should go from left to right, but not right to left.
- Syntax Directed Translation Scheme (SDT)
- a context free grammar with program fragment embedded within production bodies. (This is the typical grammar file for a parser generator like ANTLR!)
Any SDT can be implemented by
- build the parse tree
- perform the actions in a left-to-right depth-first order, that is during a pre-order traversal.
Typically SDT's are implemented during parsing, without building a parse tree. We focus on two important classes of SDD:
- grammar is LR-parsable and SDD is S-attributed, using Postfix Translation Scheme. This scheme essentially do a bottom-up parsing and evaluate the attributes in place (right at ends of productions).
- grammar is LL-parsable and SDD is L-attributed. The L-attributed SDD is more general, but we must assume the grammar is LL-parsable. Otherwise it is "impossible to perform translation in connection with either an LL or an LR parser". The solution is to evaluate in a pre-order traversal of the tree.
6 FIRST and FOLLOW
The construction of both top-down and bottom-up parsers needs these two functions.
- FIRST(\(\alpha\))
- \(\alpha\) is a string of grammar symbols. The set
of terminals that \(\alpha\) can begin with. E.g
A::=cB
,FIRST(A)=c
- FOLLOW(A)
- non-terminal A, to be the set of terminals that can appear immediately to the right of A.
7 Error Recovery
- panic-mode
- discard input symbols until synchronizing tokens are found. This is typically delimiters, such as semicolon or braces.
- phrase-level
- perform local correction, such as remove extra semicolon, replace coma with semicolon. This is not good.
- error-production
- use common errors
- global-correction
- there are some algorithms to choose a minimal sequence of changes to obtain a globally least cost correction. (What are they??) [Dragon P196]
8 Tools
8.1 Elsa and Elkhound
Elkhound is an ancient parser generator, and Elsa is the C++ parser built upon it. It is clean docs, maybe clean code, worth to check out.
It implements the Generalized LR (GLR) parsing, which works with any context-free grammars. LR parsers (like bison) requires the grammar to be LALR(1).
Parsing with arbitrary context-free grammars has two key advantages: (1) unbounded lookahead, and (2) support for ambiguous grammars. Both of them are achieved by allowing multiple potential parses to coexist for as long as necessary.
The downside, since it is more general, is slower performance.
8.2 Semantic Design Inc
A Commercial Parser Front end:
8.3 Parser generator
- yacc & lex
- generate LALR
- bison & flex
- open source for yacc, so also LALR
- antlr
- top down parser generator, generates recursive-descent parser
9 Samples
int main() { int a,b; b=a+b; int* arp[5]; b=(a+b)*a; }
10 Reference
- A recursive descent parser generator in Common Lisp: http://www.informatimago.com/develop/lisp/com/informatimago/small-cl-pgms/rdp/
- cl-python: A python compiler in Common Lisp. This is VERY good. https://github.com/metawilm/cl-python