Compiler Implementation
Table of Contents
1 Linker
ldd
- show the dynamic library used by an executable
2 AddressSanitizer
- wiki: https://github.com/google/sanitizers/wiki
- flags: https://github.com/google/sanitizers/wiki/AddressSanitizerFlags
To use it:
cc -fsanitize=address a.c
#include <stdio.h> #include <stdlib.h> #include <string.h> int main() { char buf[10]; // strcpy(buf, "hhhhhiiiiiooooo"); char *hbuf = (char*)malloc(10*sizeof(char)); strcpy(hbuf, "hhhhhddddd"); }
Note the heap buffer overflow, it will not crash by normal compilation. However, it will crash and print out report after using address sanitizer.
On mac, the default behavior is just hang, does not finish. To make it terminates the program:
ASAN_OPTIONS=halt_on_error=1 ./a.out
3 C Preprocessor (CPP)
The CPP Manual : https://gcc.gnu.org/onlinedocs/cpp/
3.1 The processing
The following textural transformation is done before everything:
- The input file is read into memory and broken into lines.
- Continued lines (line ends with a backslash and newline) are merged into one long line. There's NO way to prevent a backslash at the end of a line from being interpreted as a backslash-newline.
- All comments are replaced with single spaces.
Block comments (
/* */
) does not nest. Line comments (//
) can nest because it doesn't matter. It is safe to put line comments inside block comments, or vice versa.
After these steps, the tokenization is performed. Then the true "preprocessing" is performed.
Preprocessing directives are lines in your program that start with #
.
Whitespace is allowed before and after the #
.
The ‘#’ is followed by an identifier, the directive name.
The ‘#’ which begins a directive cannot come from a macro expansion.
Also, the directive name is not macro expanded.
The primary directives do:
- Inclusion of header files.
- Macro expansion.
- Conditional compilation.
- Line control.
- Diagnostics.
Macro has two kinds, object like (e.g. BUFFER
) and function like (i.e. takes parameters).
For function-like macros,
all arguments to a macro are completely macro-expanded before they are substituted into the macro body.
3.1.1 Compiler option to separate them
Generally the compiler will do preprocessing, compilation, assembling, and linking in order.
cc -c a.c # do not do link ==> a.o cc -S a.c # do not do assembling => a.s cc -E a.c # only do preprocessing, output to stdout > a.i
3.1.2 self-referential macro
A self-referential macro is one whose name appears in its definition
The self-references that do not expand in the first scan are marked so that they will not expand in the second scan either.
e.g. #define foo (4 + foo)
.
In most cases, it is a bad idea to take advantage of this feature.
3.2 Predefined macros
3.2.1 Standard (in language specification)
__FILE__
- expands to the name of the current input file, in the form of a C string constant This is the path by which the preprocessor opened the file, not the short name.
__LINE__
- expands to the current input line number
One typical use of these two macros are in log message.
fprintf (stderr, "Internal error: " "negative string length " "%d at %s, line %d.", length, __FILE__, __LINE__);
An ‘#include’ directive changes the expansions of FILE and LINE to correspond to the included file. Revert back when coming back. A ‘#line’ directive changes LINE, and may change FILE as well.
Note, for debugging purpose, it is nice to have the current function name.
However, the preprocessor does not know about what the function name is.
There does exist a __func__
and __FUNCTION__
, but they're not macros.
They are strings.
__DATE__
- expand to string constant, describing the date on which the preprocessor is being run. The string constant contains eleven characters and looks like "Feb 12 1996". If the day of the month is less than 10, it is padded with a space on the left.
__TIME__
- The string constant contains eight characters and looks like "23:59:01".
__STDC__
- most of the time equal to 1. I think just assume this.
__STDC_VERSION__
- something like
199409L
__STDC_HOSTED__
- should also be assumed to be 1
__cplusplus
- defined when c++ compiler is used.
__OBJC__
- defined when OBJ-C compiler is used.
__ASSEMBLER__
- defined when running on assembly.
3.2.2 Common GNU C extension
I only list some interesting ones. For the full list see the page in gcc manual.
__COUNTER__
- expands to sequential integral values starting from 0.
__GNUC__
- int, major version
__GNUC_MINOR__
- int, minor version
3.2.3 system specific
To find the macros that are defined in current system:
cpp -dM - # use standard input C-d # EOF, see result
3.2.3.1 MAC
#define OBJC_NEW_PROPERTIES 1 #define _LP64 1 #define __APPLE_CC__ 6000 #define __APPLE__ 1 #define __LP64__ 1 #define __MACH__ 1 #define __MMX__ 1 #define __clang__ 1 #define __clang_major__ 7 #define __clang_minor__ 3 #define __llvm__ 1 #define __x86_64 1 #define __x86_64__ 1
3.2.3.2 Ubuntu
#define __unix__ 1 #define __linux 1 #define __unix 1 #define __linux__ 1 #define unix 1 #define __x86_64__ 1
3.3 Stringification
https://gcc.gnu.org/onlinedocs/cpp/Stringification.html
Parameters are not replaced inside string constants.
When a macro parameter is used with a leading ‘#’, the preprocessor replaces it with the literal text of the actual argument, converted to a string constant. Unlike normal parameter replacement, the argument is not macro-expanded first. This is called stringification.
Stringification in C involves more than putting double-quote characters around the fragment. The preprocessor backslash-escapes the quotes surrounding embedded string constants, and all backslashes within string and character constants, in order to get a valid C string constant with the proper contents.
3.4 token-pasting
https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html
token pasting
or token concatenation
When a macro is expanded,
the two tokens on either side of each ##
operator are combined into a single token,
which then replaces the ##
and the two original tokens in the macro expansion.
Two tokens that don't together form a valid token cannot be pasted together. CPP will give warning.
struct command { char *name; void (*function) (void); }; struct command commands[] = { { "quit", quit_command }, { "help", help_command }, ... };
can be wrote as:
#define COMMAND(NAME) { #NAME, NAME ## _command } struct command commands[] = { COMMAND (quit), COMMAND (help), ... };
Another example:
#define paster( n ) printf_s( "token" #n " = %d", token##n ) int token9 = 9;
becomes
printf_s( "token" "9" " = %d", token9 ); // => printf_s( "token9 = %d", token9 );
3.5 Line Markers
# linenum filename flags
They mean that the following line originated in file filename at line linenum.
After the file name comes zero or more flags, which are ‘1’, ‘2’, ‘3’, or ‘4’. If there are multiple flags, spaces separate them, and must be in ascending order.
1
- This indicates the start of a new file.
2
- This indicates returning to a file (after having included another file).
3
- This indicates that the following text comes from a system header file, so certain warnings should be suppressed.
4
- This indicates that the following text should be treated as being wrapped in an implicit extern "C" block.
They are treated like the corresponding #line
directive,
except that trailing flags are permitted.
4 Special Notations
4.1 Line Control
It can have three formats:
#line linum
- a non-negative integer
#line linum filename
- a string constant
#line anything else
- This is just a dummy, anything else must be a macro, and expands to the above two format.
The only things that changed are __FILE__
and __LINE__
.
5 GCC options
-include
include file before parsing-include-pch
include precompiled header file (often names asheader.h.gch
) Note that generally the include directive will look for the.h.gch
version right before looking for.h
file in each directory.
6 Misc
nm a.o
list symbols from object files