Clang

1. Use As Command
2. Project Setup
- 2.1. Compilation Database
3. libclang
4. LibTooling
5. Lexer
6. Reference

1 Use As Command

-ast-dump: dump ast
-ast-dump-filter: filter to only dump part of the AST
-ast-list: list ast nodes

clang -Xclang -ast-dump -fsyntax-only a.c
clang -emit-ast a.c
clang-check -ast-list lib/parser.cpp | grep AddValue
clang-check a.cc -ast-dump -ast-dump-filter=StdStringA --

2 Project Setup

This mainly talks about CMake configuration.

# llvm
find_package(LLVM REQUIRED CONFIG)
message(STATUS "Found LLVM ${LLVM_PACKAGE_VERSION}")
message(STATUS "Using LLVMCOnfig.cmake in: ${LLVM_DIR}")
add_definitions(${LLVM_DEFINITIONS})
include_directories(${LLVM_INCLUDE_DIRS})
set(LLVM_LINK_COMPONENTS support)
# clang
find_package(Clang REQUIRED CONFIG)
# linking
link_libraries(clang clangTooling clangFrontend clangFrontendTool)
link_libraries(libclang gtest)

2.1 Compilation Database

To create a compilation_commands.json from a Makefile, use

bear make

To create from a cmake project, use

cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON

The command line option -- at the end to invoke the tool will not trying to find compilation database. Use -p BuildDir to read database from the folder.

3 libclang

The document is simply the doxygen page: http://clang.llvm.org/doxygen/group__CINDEX.html

This library is nice to parse the file (given the command line arguments), get the AST, and traverse it. During the traversal, you can get the kind of the AST node, the underline tokens, and raw text. However, you cannot get the AST class, i.e. you cannot get a IfStmt, and call its then method. But this is enough for dumping the AST into a plain text file, right?

3.1 Source Location

In this stage, there are three important classes: the cursor and the token, and location.

A source location can be generated from line and column number pair.

-- line,columne -> location
clang_getLocation ::  CXTranslationUnit, CXFile, Line, Column -> CXSourceLocation
clang_getRange :: CXSourceLocation, CXSourceLocation -> CXSourceRange
-- location -> line,column (#line is respected)
clang_getPresumedLocation :: CXSourceLocation, FileName -> Line, Column
-- predicates
clang_Location_isInSystemHeader :: CXSourceLocation -> Int
clang_Location_isFromMainFile :: CXSourceLocation -> Int

cursor <-> location

clang_getCursor :: CXTranslationUnit, CXSourceLocation -> CXCursor
clang_getCursorLocation :: CXCursor -> CXSourceLocation
clang_getCursorExtent :: CXCursor -> CXSourceRange

token <-> location

clang_tokenize :: CXTranslationUnit, CXSourceRange -> CXToken
-- get location
clang_getTokenLocation :: CXTranslationUnit, CXToken -> CXSourceLocation
clang_getTokenExtent :: CXTranslationUnit, CXToken -> CXSourceRange

token -> cursor: you cannot get token directly from a cursor

-- roughly equivalent to clang_getCursor() with the source range
clang_annotateTokens :: CXTranslationUnit, CXToken -> CXCursor

Finally, you might want to know the conversion from SourceLocation to line,column pairs. It is done through pretty strange methods.

context->getFullLoc(loc) -> FullSourceLoc full
full.getSpellingLinenumber

getFullLoc :: ASTContext -> SourceLoc -> FullSourceLoc
getSpellingLineNumber :: FullSourceLoc -> Unsigned

SourceManager can be used to get the main file ID.

getMainFileID :: SourceManager -> FileID

3.2 Translation unit manipulation

When using libclang, the first thing is to parse the file, i.e. create a translation unit from a file, using some command line options. The first seems to be better, but I don't think they have difference.

clang_parseTranslationUnit :: FileName, CmdArgs -> CXTranslationUnit
clang_createTranslationUnitFromSourceFile :: FileName, CmdArgs -> CXTranslationUnit

It can also be created from an AST file. It is typically emitted by -emit-ast, but can also be generated using the clang_saveTranslationUnit API. This is a binary file, and seems not so interesting to me

clang_createTranslationUnit :: ASTFileName -> CXTranslationUnit

There's also utility to get the inclusion files.

// this call visitor function on each included file
void clang_getInclusions (CXTranslationUnit tu, CXInclusionVisitor visitor, CXClientData client_data)
typedef void (*CXInclusionVisitor) (CXFile included_file, CXSourceLocation *inclusion_stack, unsigned include_len, CXClientData client_data)

libclang supports reading Compilation Database.

-- read database file
clang_CompilationDatabase_fromDirectory :: BuildDir -> Cb
-- get commands by file
clang_CompilationDatabase_getCompileCommands :: (Db, Filename) -> CMDs
-- get all commands
clang_CompilationDatabase_getAllCompileCommands :: Db -> CMDs
clang_CompileCommands_getCommand :: (CMDs, Int) -> CMD
-- get the 3 components
clang_CompileCommand_getDirectory :: CMD -> CXString
clang_CompileCommand_getFilename :: CMD -> CXString
clang_CompileCommand_getNumArgs :: CMD -> Unsigned
clang_CompileCommand_getArg :: (CMD, Int) -> CXString

3.3 Token manipulation

You can get tokens from translation unit. The text and kind of token is available.

-- get text
clang_getTokenSpelling :: (CXTranslationUnit, CXToken) -> CXString
-- kind can be: CXToken_ prefixed Punctuation, Keyword, Identifier, Literal, Comment
clang_getTokenKind CXToken -> CXTokenKind

3.4 Cursors

3.4.1 Traversing

unsigned clang_visitChildren (CXCursor parent, CXCursorVisitor visitor, CXClientData client_data)
typedef enum CXChildVisitResult(* CXCursorVisitor)
 (CXCursor cursor, CXCursor parent, CXClientData client_data)

Clearly the visitor is a function, accepting the cursor, parent, and optionally some data (it is void*), and return a result indicating continue traversal or not (in my case I would want to stop at the expression level, for example). The result can have three values:

CXChildVisit_Break: terminate
CXChildVisit_Continue: to sibling, without visiting children (skipping children)
CXChildVisit_Recurse: depth first for children

3.4.2 type information

These are the type (e.g. float, typedef) for the a cursor. Important ones include (prefixed with CXType_): Void, Bool, Short, Int, Long, Float, Double, Record, Enum, Typedef.

clang_getCursorType :: CXCursor -> CXType
-- pretty print the type
clang_getTypeSpelling :: CXType -> CXString
-- type conversion
clang_getCanonicalType :: CXType -> CXType
clang_getTypedefName :: CXType -> CXString
clang_getPointeeType :: CXType -> CXType
clang_getTypeDeclaration :: CXType -> CXCursor
-- predicates
clang_isConstQualifiedType :: CXType -> Unsigned
clang_isVolatileQualifiedType :: CXType -> Unsigned
clang_isRestrictQualifiedType :: CXType -> Unsigned
-- for function type
clang_getResultType :: CXType -> CXType
clang_getNumArgTypes :: CXType -> Int
clang_getArgType CXType -> Unsigned -> CXType
-- array
clang_getElementType :: CXType -> CXType
clang_getNumElements :: CXType -> Long Long
clang_getArrayElementType :: CXType -> CXType
clang_getArraySize :: CXType -> Long Long

3.4.3 manipulation

clang_getTranslationUnitCursor :: (CXTranslationUnit) -> CXCursor
clang_Cursor_getTranslationUnit :: (CXCursor) -> CXTranslationUnit
-- cursor kinds can be, e.g. 
-- CXCursor_VarDecl, CXCursor_IfStmt
clang_getCursorKind :: (CXCursor) -> enum CXCursorKind
-- some predicates
clang_isDeclaration :: (enum CXCursorKind) -> unsigned
clang_isReference :: (enum CXCursorKind) -> unsigned
clang_isExpression :: (enum CXCursorKind) -> unsigned
clang_isStatement :: (enum CXCursorKind) -> unsigned
-- seems to be related to C++ namespace
clang_getCursorSemanticParent :: (CXCursor cursor) -> CXCursor
clang_getCursorLexicalParent :: (CXCursor cursor) -> CXCursor
-- the cursor must be a include directive
clang_getIncludedFile :: (CXCursor cursor) -> CXFile

4 LibTooling

4.1 In-memory code parsing

LIbTooling can be used to parse code in memory or disk. The in-memory code parsing seems to support less setup (i.e. what command line arguments to use), and is intended to test initial results. It is invoked through a function runToolOnCode, with the code as string and an action. The function has several variations.

runToolOnCode :: FrontendAction -> Code -> bool
runToolOnCodeWithArgs :: FrontendAction -> Code -> Args -> bool
buildASTFromCode :: Code -> ASTUnit
buildASTFromCodeWithArgs :: Code -> Args -> ASTUnit

4.2 On-disk code parsing

The real tool building of LibTooling starts by creating an instance of ClangTool, with compilation database and an array of source files as parameters. The tool can than run any number of actions called ToolAction.

run :: ToolAction -> ()
buildASTs :: [ASTUnit]

4.3 Command Line

Compilation database is supported. In general, a compilation data base specifies what are the commands used for the files to compile. This can be specified in the command line, or read from a file (typically through a -p option). There's of course a parser for it, called CommonOptionsParser.

_ :: Argc -> Argv -> CommonOptionParser (Parser)
getCompilations :: Parser -> Db
getSourcePathList :: Parser -> [String]

Or, you can use the static functions to create the DB directly. I believe this is a wrapper around CommonOptionsParser. The CMD got from the it contains directory, filename, command line, output, as expected.

loadFromDirectory :: BuildDir -> Db
loadFromFile :: FilePath -> Db
getAllFiles :: Db -> [String]
getCompileCommands :: Db -> String -> [CMD]

Once we got the compilation database, we basically knows how to compile all the files in the project.

4.4 FrontendAction

The Tool would run on some front-end action. FrontendAction (abstract) derives ASTFrontendAction (abstract) derives SyntaxOnlyAction (concrete). Typically, when we work on AST, we create a class deriving from ASTFrontendAction, and overwrite its CreateASTConsumer interface. The created consumer is called on the AST.

The Consumer would derive from ASTConsumer and override HandleTranslationUnit. Inside this function, we got the Translation Unit. This function is called when the whole translation unit is parsed. This provides the entry point of the AST by the top most decl by Context.getTranslationUnitDecl().

You can handle the AST manually, but clang also provides a visitor traversal helper class RecursiveASTVisitor. You simply create a new instance of the visitor, and let it visit the translation unit decl. The visitor itself implement what to do with each AST node. Override the list of VisitXXX method for each type of AST node.

Under the hood, the visitor will automatically call WalkUpFromXXX(x) to recursively visit child nodes of x returning false of TraverseXXX or WalkUpFromXXX will terminate the traversal. By default this will be a pre-order traversal. Calling a method to change to post-order.

4.5 Type

The raw type will be whatever appeared in the source code. If a type is a typedef to another type (may be pointer), then the "type" will not record the pointer information.

4.5.1 canonical type

Every instance of type has a canonical type pointer.

If the type is a simple primitive type, the pointer points to itself
If any part of the type has typedef, the pointer will point to a type instance that is equivalent to it but without typedefs. You can check whether two types are the same by comparing this pointer.

You should not use isa/cast/dyn_cast on types (e.g. isa<PointerType>(expr->getType())). The reason is it is not canonical. So use help functions instead: expr->getType()->isPointerType().

4.5.2 QualType

The type and its qualifiers (const, volatile, restrict) are seperate. That is the QualType. It is designed to be small and pass-by-value. It is essentially a pair of (Type*, bits) where the bits stores the qualifiers.

This helps making only one type for each kind, e.g. int, const int, volatile const int.

const Type* getTypePtr() const;
const Type& operator*() const;
const Type* operator->() const;

SplitQualType split() const;
class SplitQualType {
public:
  const Type *Ty;
  Qualifiers Quals;
};

bool isCanonical();
QualType getCanonicalType() const;
bool isNull();

bool isConstQualified();
bool isVolatileQualified();
bool isRestrictQualified();
bool hasLocalQualifiers();
bool hasQualifiers();

Qualifiers getQualifiers();

QualType withConst();
QualType withVolatile();
QualType withRestrict();

void dump();
std::string getAsString();

static std::string getAsString(SplitQualType split);
static std::string getAsString(const Type *ty, Qualifiers qs);

4.6 Clang AST

Declarations contains two parts, the Decl class, and the DeclContext. From the Decl, you can get the context by getDeclContext. Decl supports getting location and kind.

getLocStart :: () -> Loc
getLocEnd :: () -> Loc
getLocation :: () -> Loc
getKind :: () -> Kind

DeclContext basically is a block of statements. It provides support of iterating children nodes. Thus the classes deriving from it includes: BlockDecl, FunctionDecl, EnumDecl, RecordDecl, TranslationUnitDecl. Some notes: in clang in general, the XXX_range will provide two method: begin and end.

decls :: DeclContext -> decl_range
decls_begin :: DeclContext -> decl_iterator
decls_end :: DeclContext -> decl_iterator

The most important class under Decl is NamedDecl, containing two main classes: ValueDecl and TypeDecl.

-- NamedDecl
getIdentifier :: NamedDecl -> IdInfo
getName :: NamedDecl -> String
-- ValueDecl
class NamedDecl => ValueDecl
getType :: ValueDecl -> Type
class ValueDecl => EnumConstantDecl
getInitVal :: EnumConstantDecl -> Int
class ValueDecl => DeclaratorDecl
class DeclaratorDecl => FunctionDecl
getReturnTypeSourceRange :: FunctionDecl -> SourceRange
getNameInfo :: FunctionDecl -> NameInfo
getBody :: FunctionDecl -> Stmt
parameters :: FunctionDecl -> ParmVarDecl
getReturnType :: FunctionDecl -> QualType

class DeclaratorDecl => FieldDecl
class DeclaratorDecl => VarDecl
isStaticLocal :: -> bool
hasExternalStorage :: -> bool
hasGlobalStorage :: -> bool
hasInit :: -> bool
getInit :: -> Expr
getStorageClass :: -> StorageClass
-- TypeDecl
class NamedDecl => TypeDecl
class TypeDecl => TypedefNameDecl
class TypeDecl => TypedefDecl
class TypeDecl => TagDecl
getKindName :: TagDecl -> String
getTagKind :: TagDecl -> Kind
class TagDecl => EnumDecl -- struct, union, enum
enumerators :: EnumDecl -> Range
class TagDecl => RecordDecl -- struct, union
fields :: RecordDecl -> Range

Every Stmt has children method, seemingly regardless of whether it can have a child. Very common classes here.

Class	method
BreakStmt
ReturnStmt	getRetValue
ContinueStmt
IfStmt	getInit	getCond	getThen	getElse
SwitchCase	getNextCase	getSubStmt
> CaseStmt	getLHS	getRHS
> DefaultStmt
SwitchStmt	getCondVar	getInit	getCond	getBody	getCaseList
LabelStmt	getDecl	getName	getSubStmt
GotoStmt	getLabel	getGotoLoc	getLabelLoc
DoStmt	getCond	getBody
ForStmt	getCondVar	getInit	getCond	getInc	getBody
WhileStmt	getCondVar	getCond	getBody
CompoundStmt	body
DeclStmt	decls

Expressions in Clang.

First, we can check the value type of the expression.

isLValue :: -> bool
isXValue :: -> bool
isGLValue :: -> bool

Class	Method				Note
CallExpr	getCallee	getArgs	getReturnType
BinaryOperator	getOpcode	getLHS	getRHS
CastExpr	getCastKind	getSubExpr
> ExplicitCastExpr
> ImplicitCastExpr
ParenExpr	getSubExpr				does NOT include conditionals
MemberExpr	getBase	getMemberDecl	getNameInfo	isArrow
UnaryOperator	getOpcode	getSubExpr	isPrefix
DeclRefExpr	getDecl	getNameInfo			A reference to a declared variable, function, enum
ConditionalOperator	getCond	getTrueExpr	getFalseExpr		?: ternary operator.

ImplicitCastExpr appears very often because it represent many type of cast. For example

call a function needs to use the cast FunctionToPointerDecay
use a value in the righ hand side will need the cast LValueToRValue

5 Lexer

Use lexer when you want to get the token level information, such as raw source code.

The getExpansionLocation family can support macro, otherwise you will get a location in the macro definition file.

There's a class called Rewriter. It is used like this. However, this will have bug when generating a TypedefNameDecl, resulting in part of seemingly binary data. It might cause other problems.

  Rewriter rewriter;
  rewriter.setSourceMgr(src_mgr, LangOptions());
  StringRef str = rewriter.getRewrittenText(range);

The most reliable way is to use Lexer. It has a static method getSourceText that generate text from a CharSourceRange. You can get CharSourceRange from SourceRange using getCharRange method, no problem. However, note that clang has two concepts of source range: the token source range and the character source range. Token source range is typically used in AST, and meant to be the start of token. So a typical source range in AST's perspective will be the beginning locations of the begin and end token. When converting this directly to char source range, you are guaranteed to miss the last token. The correct way is to first convert the end location to the end of the token, using getLocForEndOfToken. Be careful that don't call this twice.

The Lexer provides a method getAsCharRange that seems to want to do exactly this. There is probably a clang bug though. In Lexer::getAsCharRange, the getLocForEndOfToken is called, but when constructing CharSourceRange, the end is adjusted by getLocWithOffset(-1), and cause the getSourceText to miss one character.

end = Lexer::getLocForEndOfToken(end, 0, mgr, LangOptions());
CharSourceRange char_range = CharSourceRange::getCharRange(begin, end);
StringRef text = Lexer::getSourceText(char_range, mgr, LangOptions());

6 Reference

A article as tutorial: http://bastian.rieck.ru/blog/posts/2016/baby_steps_libclang_function_extents/
a repo of samples: https://github.com/eliben/llvm-clang-samples