Compiler Implementation

Table of Contents

1 Linker

ldd
show the dynamic library used by an executable

2 AddressSanitizer

To use it:

cc -fsanitize=address a.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main() {
  char buf[10];
  // strcpy(buf, "hhhhhiiiiiooooo");
  char *hbuf = (char*)malloc(10*sizeof(char));
  strcpy(hbuf, "hhhhhddddd");
}

Note the heap buffer overflow, it will not crash by normal compilation. However, it will crash and print out report after using address sanitizer.

On mac, the default behavior is just hang, does not finish. To make it terminates the program:

ASAN_OPTIONS=halt_on_error=1 ./a.out

3 C Preprocessor (CPP)

3.1 The processing

The following textural transformation is done before everything:

  1. The input file is read into memory and broken into lines.
  2. Continued lines (line ends with a backslash and newline) are merged into one long line. There's NO way to prevent a backslash at the end of a line from being interpreted as a backslash-newline.
  3. All comments are replaced with single spaces. Block comments (/* */) does not nest. Line comments (//) can nest because it doesn't matter. It is safe to put line comments inside block comments, or vice versa.

After these steps, the tokenization is performed. Then the true "preprocessing" is performed.

Preprocessing directives are lines in your program that start with #. Whitespace is allowed before and after the #. The ‘#’ is followed by an identifier, the directive name. The ‘#’ which begins a directive cannot come from a macro expansion. Also, the directive name is not macro expanded.

The primary directives do:

  • Inclusion of header files.
  • Macro expansion.
  • Conditional compilation.
  • Line control.
  • Diagnostics.

Macro has two kinds, object like (e.g. BUFFER) and function like (i.e. takes parameters). For function-like macros, all arguments to a macro are completely macro-expanded before they are substituted into the macro body.

3.1.1 Compiler option to separate them

Generally the compiler will do preprocessing, compilation, assembling, and linking in order.

cc -c a.c # do not do link ==> a.o
cc -S a.c # do not do assembling => a.s
cc -E a.c # only do preprocessing, output to stdout > a.i

3.1.2 self-referential macro

A self-referential macro is one whose name appears in its definition

The self-references that do not expand in the first scan are marked so that they will not expand in the second scan either. e.g. #define foo (4 + foo). In most cases, it is a bad idea to take advantage of this feature.

3.2 Predefined macros

3.2.1 Standard (in language specification)

__FILE__
expands to the name of the current input file, in the form of a C string constant This is the path by which the preprocessor opened the file, not the short name.
__LINE__
expands to the current input line number

One typical use of these two macros are in log message.

fprintf (stderr, "Internal error: "
         "negative string length "
         "%d at %s, line %d.",
         length, __FILE__, __LINE__);

An ‘#include’ directive changes the expansions of FILE and LINE to correspond to the included file. Revert back when coming back. A ‘#line’ directive changes LINE, and may change FILE as well.

Note, for debugging purpose, it is nice to have the current function name. However, the preprocessor does not know about what the function name is. There does exist a __func__ and __FUNCTION__, but they're not macros. They are strings.

__DATE__
expand to string constant, describing the date on which the preprocessor is being run. The string constant contains eleven characters and looks like "Feb 12 1996". If the day of the month is less than 10, it is padded with a space on the left.
__TIME__
The string constant contains eight characters and looks like "23:59:01".
__STDC__
most of the time equal to 1. I think just assume this.
__STDC_VERSION__
something like 199409L
__STDC_HOSTED__
should also be assumed to be 1
__cplusplus
defined when c++ compiler is used.
__OBJC__
defined when OBJ-C compiler is used.
__ASSEMBLER__
defined when running on assembly.

3.2.2 Common GNU C extension

I only list some interesting ones. For the full list see the page in gcc manual.

__COUNTER__
expands to sequential integral values starting from 0.
__GNUC__
int, major version
__GNUC_MINOR__
int, minor version

3.2.3 system specific

To find the macros that are defined in current system:

cpp -dM - # use standard input
C-d # EOF, see result
3.2.3.1 MAC
#define OBJC_NEW_PROPERTIES 1
#define _LP64 1
#define __APPLE_CC__ 6000
#define __APPLE__ 1
#define __LP64__ 1
#define __MACH__ 1
#define __MMX__ 1
#define __clang__ 1
#define __clang_major__ 7
#define __clang_minor__ 3
#define __llvm__ 1
#define __x86_64 1
#define __x86_64__ 1
3.2.3.2 Ubuntu
#define __unix__ 1
#define __linux 1
#define __unix 1
#define __linux__ 1
#define unix 1
#define __x86_64__ 1

3.3 Stringification

https://gcc.gnu.org/onlinedocs/cpp/Stringification.html

Parameters are not replaced inside string constants.

When a macro parameter is used with a leading ‘#’, the preprocessor replaces it with the literal text of the actual argument, converted to a string constant. Unlike normal parameter replacement, the argument is not macro-expanded first. This is called stringification.

Stringification in C involves more than putting double-quote characters around the fragment. The preprocessor backslash-escapes the quotes surrounding embedded string constants, and all backslashes within string and character constants, in order to get a valid C string constant with the proper contents.

3.4 token-pasting

https://gcc.gnu.org/onlinedocs/cpp/Concatenation.html

token pasting or token concatenation

When a macro is expanded, the two tokens on either side of each ## operator are combined into a single token, which then replaces the ## and the two original tokens in the macro expansion.

Two tokens that don't together form a valid token cannot be pasted together. CPP will give warning.

struct command
{
  char *name;
  void (*function) (void);
};
struct command commands[] =
  {
    { "quit", quit_command },
    { "help", help_command },
    ...
  };

can be wrote as:

#define COMMAND(NAME)  { #NAME, NAME ## _command }
struct command commands[] =
  {
    COMMAND (quit),
    COMMAND (help),
    ...
  };

Another example:

#define paster( n ) printf_s( "token" #n " = %d", token##n )
int token9 = 9;

becomes

printf_s( "token" "9" " = %d", token9 );
// =>
printf_s( "token9 = %d", token9 );

3.5 Line Markers

# linenum filename flags

They mean that the following line originated in file filename at line linenum.

After the file name comes zero or more flags, which are ‘1’, ‘2’, ‘3’, or ‘4’. If there are multiple flags, spaces separate them, and must be in ascending order.

1
This indicates the start of a new file.
2
This indicates returning to a file (after having included another file).
3
This indicates that the following text comes from a system header file, so certain warnings should be suppressed.
4
This indicates that the following text should be treated as being wrapped in an implicit extern "C" block.

They are treated like the corresponding #line directive, except that trailing flags are permitted.

4 Special Notations

4.1 Line Control

It can have three formats:

#line linum
a non-negative integer
#line linum filename
a string constant
#line anything else
This is just a dummy, anything else must be a macro, and expands to the above two format.

The only things that changed are __FILE__ and __LINE__.

5 GCC options

  • -include include file before parsing
  • -include-pch include precompiled header file (often names as header.h.gch) Note that generally the include directive will look for the .h.gch version right before looking for .h file in each directory.

6 Misc

  • nm a.o list symbols from object files

Author: Hebi Li

Created: 2017-03-27 Mon 14:36

Validate