Table of Contents

1 Use As Command

dump ast
filter to only dump part of the AST
list ast nodes
clang -Xclang -ast-dump -fsyntax-only a.c
clang -emit-ast a.c
clang-check -ast-list lib/parser.cpp | grep AddValue
clang-check -ast-dump -ast-dump-filter=StdStringA --

2 Project Setup

This mainly talks about CMake configuration.

# llvm
message(STATUS "Using LLVMCOnfig.cmake in: ${LLVM_DIR}")
# clang
find_package(Clang REQUIRED CONFIG)
# linking
link_libraries(clang clangTooling clangFrontend clangFrontendTool)
link_libraries(libclang gtest)

2.1 Compilation Database

To create a compilation_commands.json from a Makefile, use

bear make

To create from a cmake project, use


The command line option -- at the end to invoke the tool will not trying to find compilation database. Use -p BuildDir to read database from the folder.

3 libclang

The document is simply the doxygen page:

This library is nice to parse the file (given the command line arguments), get the AST, and traverse it. During the traversal, you can get the kind of the AST node, the underline tokens, and raw text. However, you cannot get the AST class, i.e. you cannot get a IfStmt, and call its then method. But this is enough for dumping the AST into a plain text file, right?

3.1 Source Location

In this stage, there are three important classes: the cursor and the token, and location.

A source location can be generated from line and column number pair.

-- line,columne -> location
clang_getLocation ::  CXTranslationUnit, CXFile, Line, Column -> CXSourceLocation
clang_getRange :: CXSourceLocation, CXSourceLocation -> CXSourceRange
-- location -> line,column (#line is respected)
clang_getPresumedLocation :: CXSourceLocation, FileName -> Line, Column
-- predicates
clang_Location_isInSystemHeader :: CXSourceLocation -> Int
clang_Location_isFromMainFile :: CXSourceLocation -> Int
  • cursor <-> location
clang_getCursor :: CXTranslationUnit, CXSourceLocation -> CXCursor
clang_getCursorLocation :: CXCursor -> CXSourceLocation
clang_getCursorExtent :: CXCursor -> CXSourceRange
  • token <-> location
clang_tokenize :: CXTranslationUnit, CXSourceRange -> CXToken
-- get location
clang_getTokenLocation :: CXTranslationUnit, CXToken -> CXSourceLocation
clang_getTokenExtent :: CXTranslationUnit, CXToken -> CXSourceRange
  • token -> cursor: you cannot get token directly from a cursor
-- roughly equivalent to clang_getCursor() with the source range
clang_annotateTokens :: CXTranslationUnit, CXToken -> CXCursor

Finally, you might want to know the conversion from SourceLocation to line,column pairs. It is done through pretty strange methods.

  • context->getFullLoc(loc) -> FullSourceLoc full
  • full.getSpellingLinenumber
getFullLoc :: ASTContext -> SourceLoc -> FullSourceLoc
getSpellingLineNumber :: FullSourceLoc -> Unsigned

SourceManager can be used to get the main file ID.

getMainFileID :: SourceManager -> FileID

3.2 Translation unit manipulation

When using libclang, the first thing is to parse the file, i.e. create a translation unit from a file, using some command line options. The first seems to be better, but I don't think they have difference.

clang_parseTranslationUnit :: FileName, CmdArgs -> CXTranslationUnit
clang_createTranslationUnitFromSourceFile :: FileName, CmdArgs -> CXTranslationUnit

It can also be created from an AST file. It is typically emitted by -emit-ast, but can also be generated using the clang_saveTranslationUnit API. This is a binary file, and seems not so interesting to me

clang_createTranslationUnit :: ASTFileName -> CXTranslationUnit

There's also utility to get the inclusion files.

// this call visitor function on each included file
void clang_getInclusions (CXTranslationUnit tu, CXInclusionVisitor visitor, CXClientData client_data)
typedef void (*CXInclusionVisitor) (CXFile included_file, CXSourceLocation *inclusion_stack, unsigned include_len, CXClientData client_data)

libclang supports reading Compilation Database.

-- read database file
clang_CompilationDatabase_fromDirectory :: BuildDir -> Cb
-- get commands by file
clang_CompilationDatabase_getCompileCommands :: (Db, Filename) -> CMDs
-- get all commands
clang_CompilationDatabase_getAllCompileCommands :: Db -> CMDs
clang_CompileCommands_getCommand :: (CMDs, Int) -> CMD
-- get the 3 components
clang_CompileCommand_getDirectory :: CMD -> CXString
clang_CompileCommand_getFilename :: CMD -> CXString
clang_CompileCommand_getNumArgs :: CMD -> Unsigned
clang_CompileCommand_getArg :: (CMD, Int) -> CXString

3.3 Token manipulation

You can get tokens from translation unit. The text and kind of token is available.

-- get text
clang_getTokenSpelling :: (CXTranslationUnit, CXToken) -> CXString
-- kind can be: CXToken_ prefixed Punctuation, Keyword, Identifier, Literal, Comment
clang_getTokenKind CXToken -> CXTokenKind

3.4 Cursors

3.4.1 Traversing

unsigned clang_visitChildren (CXCursor parent, CXCursorVisitor visitor, CXClientData client_data)
typedef enum CXChildVisitResult(* CXCursorVisitor)
 (CXCursor cursor, CXCursor parent, CXClientData client_data)

Clearly the visitor is a function, accepting the cursor, parent, and optionally some data (it is void*), and return a result indicating continue traversal or not (in my case I would want to stop at the expression level, for example). The result can have three values:

  • CXChildVisit_Break: terminate
  • CXChildVisit_Continue: to sibling, without visiting children (skipping children)
  • CXChildVisit_Recurse: depth first for children

3.4.2 type information

These are the type (e.g. float, typedef) for the a cursor. Important ones include (prefixed with CXType_): Void, Bool, Short, Int, Long, Float, Double, Record, Enum, Typedef.

clang_getCursorType :: CXCursor -> CXType
-- pretty print the type
clang_getTypeSpelling :: CXType -> CXString
-- type conversion
clang_getCanonicalType :: CXType -> CXType
clang_getTypedefName :: CXType -> CXString
clang_getPointeeType :: CXType -> CXType
clang_getTypeDeclaration :: CXType -> CXCursor
-- predicates
clang_isConstQualifiedType :: CXType -> Unsigned
clang_isVolatileQualifiedType :: CXType -> Unsigned
clang_isRestrictQualifiedType :: CXType -> Unsigned
-- for function type
clang_getResultType :: CXType -> CXType
clang_getNumArgTypes :: CXType -> Int
clang_getArgType CXType -> Unsigned -> CXType
-- array
clang_getElementType :: CXType -> CXType
clang_getNumElements :: CXType -> Long Long
clang_getArrayElementType :: CXType -> CXType
clang_getArraySize :: CXType -> Long Long

3.4.3 manipulation

clang_getTranslationUnitCursor :: (CXTranslationUnit) -> CXCursor
clang_Cursor_getTranslationUnit :: (CXCursor) -> CXTranslationUnit
-- cursor kinds can be, e.g. 
-- CXCursor_VarDecl, CXCursor_IfStmt
clang_getCursorKind :: (CXCursor) -> enum CXCursorKind
-- some predicates
clang_isDeclaration :: (enum CXCursorKind) -> unsigned
clang_isReference :: (enum CXCursorKind) -> unsigned
clang_isExpression :: (enum CXCursorKind) -> unsigned
clang_isStatement :: (enum CXCursorKind) -> unsigned
-- seems to be related to C++ namespace
clang_getCursorSemanticParent :: (CXCursor cursor) -> CXCursor
clang_getCursorLexicalParent :: (CXCursor cursor) -> CXCursor
-- the cursor must be a include directive
clang_getIncludedFile :: (CXCursor cursor) -> CXFile

4 LibTooling

4.1 In-memory code parsing

LIbTooling can be used to parse code in memory or disk. The in-memory code parsing seems to support less setup (i.e. what command line arguments to use), and is intended to test initial results. It is invoked through a function runToolOnCode, with the code as string and an action. The function has several variations.

runToolOnCode :: FrontendAction -> Code -> bool
runToolOnCodeWithArgs :: FrontendAction -> Code -> Args -> bool
buildASTFromCode :: Code -> ASTUnit
buildASTFromCodeWithArgs :: Code -> Args -> ASTUnit

4.2 On-disk code parsing

The real tool building of LibTooling starts by creating an instance of ClangTool, with compilation database and an array of source files as parameters. The tool can than run any number of actions called ToolAction.

run :: ToolAction -> ()
buildASTs :: [ASTUnit]

4.3 Command Line

Compilation database is supported. In general, a compilation data base specifies what are the commands used for the files to compile. This can be specified in the command line, or read from a file (typically through a -p option). There's of course a parser for it, called CommonOptionsParser.

_ :: Argc -> Argv -> CommonOptionParser (Parser)
getCompilations :: Parser -> Db
getSourcePathList :: Parser -> [String]

Or, you can use the static functions to create the DB directly. I believe this is a wrapper around CommonOptionsParser. The CMD got from the it contains directory, filename, command line, output, as expected.

loadFromDirectory :: BuildDir -> Db
loadFromFile :: FilePath -> Db
getAllFiles :: Db -> [String]
getCompileCommands :: Db -> String -> [CMD]

Once we got the compilation database, we basically knows how to compile all the files in the project.

4.4 FrontendAction

The Tool would run on some front-end action. FrontendAction (abstract) derives ASTFrontendAction (abstract) derives SyntaxOnlyAction (concrete). Typically, when we work on AST, we create a class deriving from ASTFrontendAction, and overwrite its CreateASTConsumer interface. The created consumer is called on the AST.

The Consumer would derive from ASTConsumer and override HandleTranslationUnit. Inside this function, we got the Translation Unit. This function is called when the whole translation unit is parsed. This provides the entry point of the AST by the top most decl by Context.getTranslationUnitDecl().

You can handle the AST manually, but clang also provides a visitor traversal helper class RecursiveASTVisitor. You simply create a new instance of the visitor, and let it visit the translation unit decl. The visitor itself implement what to do with each AST node. Override the list of VisitXXX method for each type of AST node.

Under the hood, the visitor will automatically call WalkUpFromXXX(x) to recursively visit child nodes of x returning false of TraverseXXX or WalkUpFromXXX will terminate the traversal. By default this will be a pre-order traversal. Calling a method to change to post-order.

4.5 Type

The raw type will be whatever appeared in the source code. If a type is a typedef to another type (may be pointer), then the "type" will not record the pointer information.

4.5.1 canonical type

Every instance of type has a canonical type pointer.

  • If the type is a simple primitive type, the pointer points to itself
  • If any part of the type has typedef, the pointer will point to a type instance that is equivalent to it but without typedefs. You can check whether two types are the same by comparing this pointer.

You should not use isa/cast/dyncast on types (e.g. isa<PointerType>(expr->getType())). The reason is it is not canonical. So use help functions instead: expr->getType()->isPointerType().

4.5.2 QualType

The type and its qualifiers (const, volatile, restrict) are seperate. That is the QualType. It is designed to be small and pass-by-value. It is essentially a pair of (Type*, bits) where the bits stores the qualifiers.

This helps making only one type for each kind, e.g. int, const int, volatile const int.

const Type* getTypePtr() const;
const Type& operator*() const;
const Type* operator->() const;

SplitQualType split() const;
class SplitQualType {
  const Type *Ty;
  Qualifiers Quals;

bool isCanonical();
QualType getCanonicalType() const;
bool isNull();

bool isConstQualified();
bool isVolatileQualified();
bool isRestrictQualified();
bool hasLocalQualifiers();
bool hasQualifiers();

Qualifiers getQualifiers();

QualType withConst();
QualType withVolatile();
QualType withRestrict();

void dump();
std::string getAsString();

static std::string getAsString(SplitQualType split);
static std::string getAsString(const Type *ty, Qualifiers qs);

4.6 Clang AST

Declarations contains two parts, the Decl class, and the DeclContext. From the Decl, you can get the context by getDeclContext. Decl supports getting location and kind.

getLocStart :: () -> Loc
getLocEnd :: () -> Loc
getLocation :: () -> Loc
getKind :: () -> Kind

DeclContext basically is a block of statements. It provides support of iterating children nodes. Thus the classes deriving from it includes: BlockDecl, FunctionDecl, EnumDecl, RecordDecl, TranslationUnitDecl. Some notes: in clang in general, the XXX_range will provide two method: begin and end.

decls :: DeclContext -> decl_range
decls_begin :: DeclContext -> decl_iterator
decls_end :: DeclContext -> decl_iterator

The most important class under Decl is NamedDecl, containing two main classes: ValueDecl and TypeDecl.

-- NamedDecl
getIdentifier :: NamedDecl -> IdInfo
getName :: NamedDecl -> String
-- ValueDecl
class NamedDecl => ValueDecl
getType :: ValueDecl -> Type
class ValueDecl => EnumConstantDecl
getInitVal :: EnumConstantDecl -> Int
class ValueDecl => DeclaratorDecl
class DeclaratorDecl => FunctionDecl
getReturnTypeSourceRange :: FunctionDecl -> SourceRange
getNameInfo :: FunctionDecl -> NameInfo
getBody :: FunctionDecl -> Stmt
parameters :: FunctionDecl -> ParmVarDecl
getReturnType :: FunctionDecl -> QualType

class DeclaratorDecl => FieldDecl
class DeclaratorDecl => VarDecl
isStaticLocal :: -> bool
hasExternalStorage :: -> bool
hasGlobalStorage :: -> bool
hasInit :: -> bool
getInit :: -> Expr
getStorageClass :: -> StorageClass
-- TypeDecl
class NamedDecl => TypeDecl
class TypeDecl => TypedefNameDecl
class TypeDecl => TypedefDecl
class TypeDecl => TagDecl
getKindName :: TagDecl -> String
getTagKind :: TagDecl -> Kind
class TagDecl => EnumDecl -- struct, union, enum
enumerators :: EnumDecl -> Range
class TagDecl => RecordDecl -- struct, union
fields :: RecordDecl -> Range

Every Stmt has children method, seemingly regardless of whether it can have a child. Very common classes here.

Class method        
ReturnStmt getRetValue        
IfStmt getInit getCond getThen getElse  
SwitchCase getNextCase getSubStmt      
> CaseStmt getLHS getRHS      
> DefaultStmt          
SwitchStmt getCondVar getInit getCond getBody getCaseList
LabelStmt getDecl getName getSubStmt    
GotoStmt getLabel getGotoLoc getLabelLoc    
DoStmt getCond getBody      
ForStmt getCondVar getInit getCond getInc getBody
WhileStmt getCondVar getCond getBody    
CompoundStmt body        
DeclStmt decls        

Expressions in Clang.

First, we can check the value type of the expression.

isLValue :: -> bool
isXValue :: -> bool
isGLValue :: -> bool
Class Method       Note
CallExpr getCallee getArgs getReturnType    
BinaryOperator getOpcode getLHS getRHS    
CastExpr getCastKind getSubExpr      
> ExplicitCastExpr          
> ImplicitCastExpr          
ParenExpr getSubExpr       does NOT include conditionals
MemberExpr getBase getMemberDecl getNameInfo isArrow  
UnaryOperator getOpcode getSubExpr isPrefix    
DeclRefExpr getDecl getNameInfo     A reference to a declared variable, function, enum
ConditionalOperator getCond getTrueExpr getFalseExpr   ?: ternary operator.

ImplicitCastExpr appears very often because it represent many type of cast. For example

  • call a function needs to use the cast FunctionToPointerDecay
  • use a value in the righ hand side will need the cast LValueToRValue

5 Lexer

Use lexer when you want to get the token level information, such as raw source code.

The getExpansionLocation family can support macro, otherwise you will get a location in the macro definition file.

There's a class called Rewriter. It is used like this. However, this will have bug when generating a TypedefNameDecl, resulting in part of seemingly binary data. It might cause other problems.

  Rewriter rewriter;
  rewriter.setSourceMgr(src_mgr, LangOptions());
  StringRef str = rewriter.getRewrittenText(range);

The most reliable way is to use Lexer. It has a static method getSourceText that generate text from a CharSourceRange. You can get CharSourceRange from SourceRange using getCharRange method, no problem. However, note that clang has two concepts of source range: the token source range and the character source range. Token source range is typically used in AST, and meant to be the start of token. So a typical source range in AST's perspective will be the beginning locations of the begin and end token. When converting this directly to char source range, you are guaranteed to miss the last token. The correct way is to first convert the end location to the end of the token, using getLocForEndOfToken. Be careful that don't call this twice.

The Lexer provides a method getAsCharRange that seems to want to do exactly this. There is probably a clang bug though. In Lexer::getAsCharRange, the getLocForEndOfToken is called, but when constructing CharSourceRange, the end is adjusted by getLocWithOffset(-1), and cause the getSourceText to miss one character.

end = Lexer::getLocForEndOfToken(end, 0, mgr, LangOptions());
CharSourceRange char_range = CharSourceRange::getCharRange(begin, end);
StringRef text = Lexer::getSourceText(char_range, mgr, LangOptions());

6 Reference

Author: Hebi Li

Created: 2018-01-04 Thu 21:28