Clang
Table of Contents
1 Use As Command
-ast-dump
- dump ast
-ast-dump-filter
- filter to only dump part of the AST
-ast-list
- list ast nodes
clang -Xclang -ast-dump -fsyntax-only a.c clang -emit-ast a.c clang-check -ast-list lib/parser.cpp | grep AddValue clang-check a.cc -ast-dump -ast-dump-filter=StdStringA --
2 Project Setup
This mainly talks about CMake configuration.
# llvm find_package(LLVM REQUIRED CONFIG) message(STATUS "Found LLVM ${LLVM_PACKAGE_VERSION}") message(STATUS "Using LLVMCOnfig.cmake in: ${LLVM_DIR}") add_definitions(${LLVM_DEFINITIONS}) include_directories(${LLVM_INCLUDE_DIRS}) set(LLVM_LINK_COMPONENTS support) # clang find_package(Clang REQUIRED CONFIG) # linking link_libraries(clang clangTooling clangFrontend clangFrontendTool) link_libraries(libclang gtest)
2.1 Compilation Database
To create a compilation_commands.json
from a Makefile, use
bear make
To create from a cmake project, use
cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=ON
The command line option --
at the end to invoke the tool will not
trying to find compilation database. Use -p BuildDir
to read
database from the folder.
3 libclang
The document is simply the doxygen page: http://clang.llvm.org/doxygen/group__CINDEX.html
This library is nice to parse the file (given the command line
arguments), get the AST, and traverse it. During the traversal, you
can get the kind of the AST node, the underline tokens, and raw
text. However, you cannot get the AST class, i.e. you cannot get a
IfStmt, and call its then
method. But this is enough for dumping the
AST into a plain text file, right?
3.1 Source Location
In this stage, there are three important classes: the cursor and the token, and location.
A source location can be generated from line and column number pair.
-- line,columne -> location clang_getLocation :: CXTranslationUnit, CXFile, Line, Column -> CXSourceLocation clang_getRange :: CXSourceLocation, CXSourceLocation -> CXSourceRange -- location -> line,column (#line is respected) clang_getPresumedLocation :: CXSourceLocation, FileName -> Line, Column -- predicates clang_Location_isInSystemHeader :: CXSourceLocation -> Int clang_Location_isFromMainFile :: CXSourceLocation -> Int
- cursor <-> location
clang_getCursor :: CXTranslationUnit, CXSourceLocation -> CXCursor clang_getCursorLocation :: CXCursor -> CXSourceLocation clang_getCursorExtent :: CXCursor -> CXSourceRange
- token <-> location
clang_tokenize :: CXTranslationUnit, CXSourceRange -> CXToken -- get location clang_getTokenLocation :: CXTranslationUnit, CXToken -> CXSourceLocation clang_getTokenExtent :: CXTranslationUnit, CXToken -> CXSourceRange
- token -> cursor: you cannot get token directly from a cursor
-- roughly equivalent to clang_getCursor() with the source range clang_annotateTokens :: CXTranslationUnit, CXToken -> CXCursor
Finally, you might want to know the conversion from SourceLocation to line,column pairs. It is done through pretty strange methods.
- context->getFullLoc(loc) -> FullSourceLoc full
- full.getSpellingLinenumber
getFullLoc :: ASTContext -> SourceLoc -> FullSourceLoc getSpellingLineNumber :: FullSourceLoc -> Unsigned
SourceManager
can be used to get the main file ID.
getMainFileID :: SourceManager -> FileID
3.2 Translation unit manipulation
When using libclang, the first thing is to parse the file, i.e. create a translation unit from a file, using some command line options. The first seems to be better, but I don't think they have difference.
clang_parseTranslationUnit :: FileName, CmdArgs -> CXTranslationUnit clang_createTranslationUnitFromSourceFile :: FileName, CmdArgs -> CXTranslationUnit
It can also be created from an AST file. It is typically emitted by
-emit-ast
, but can also be generated using the
clang_saveTranslationUnit
API. This is a binary file, and seems not
so interesting to me
clang_createTranslationUnit :: ASTFileName -> CXTranslationUnit
There's also utility to get the inclusion files.
// this call visitor function on each included file void clang_getInclusions (CXTranslationUnit tu, CXInclusionVisitor visitor, CXClientData client_data) typedef void (*CXInclusionVisitor) (CXFile included_file, CXSourceLocation *inclusion_stack, unsigned include_len, CXClientData client_data)
libclang supports reading Compilation Database.
-- read database file clang_CompilationDatabase_fromDirectory :: BuildDir -> Cb -- get commands by file clang_CompilationDatabase_getCompileCommands :: (Db, Filename) -> CMDs -- get all commands clang_CompilationDatabase_getAllCompileCommands :: Db -> CMDs clang_CompileCommands_getCommand :: (CMDs, Int) -> CMD -- get the 3 components clang_CompileCommand_getDirectory :: CMD -> CXString clang_CompileCommand_getFilename :: CMD -> CXString clang_CompileCommand_getNumArgs :: CMD -> Unsigned clang_CompileCommand_getArg :: (CMD, Int) -> CXString
3.3 Token manipulation
You can get tokens from translation unit. The text and kind of token is available.
-- get text clang_getTokenSpelling :: (CXTranslationUnit, CXToken) -> CXString -- kind can be: CXToken_ prefixed Punctuation, Keyword, Identifier, Literal, Comment clang_getTokenKind CXToken -> CXTokenKind
3.4 Cursors
3.4.1 Traversing
unsigned clang_visitChildren (CXCursor parent, CXCursorVisitor visitor, CXClientData client_data) typedef enum CXChildVisitResult(* CXCursorVisitor) (CXCursor cursor, CXCursor parent, CXClientData client_data)
Clearly the visitor is a function, accepting the cursor, parent, and optionally some data (it is void*), and return a result indicating continue traversal or not (in my case I would want to stop at the expression level, for example). The result can have three values:
CXChildVisit_Break
: terminateCXChildVisit_Continue
: to sibling, without visiting children (skipping children)CXChildVisit_Recurse
: depth first for children
3.4.2 type information
These are the type (e.g. float, typedef) for the a cursor. Important
ones include (prefixed with CXType_
): Void
, Bool
, Short
,
Int
, Long
, Float
, Double
, Record
, Enum
, Typedef
.
clang_getCursorType :: CXCursor -> CXType -- pretty print the type clang_getTypeSpelling :: CXType -> CXString -- type conversion clang_getCanonicalType :: CXType -> CXType clang_getTypedefName :: CXType -> CXString clang_getPointeeType :: CXType -> CXType clang_getTypeDeclaration :: CXType -> CXCursor -- predicates clang_isConstQualifiedType :: CXType -> Unsigned clang_isVolatileQualifiedType :: CXType -> Unsigned clang_isRestrictQualifiedType :: CXType -> Unsigned -- for function type clang_getResultType :: CXType -> CXType clang_getNumArgTypes :: CXType -> Int clang_getArgType CXType -> Unsigned -> CXType -- array clang_getElementType :: CXType -> CXType clang_getNumElements :: CXType -> Long Long clang_getArrayElementType :: CXType -> CXType clang_getArraySize :: CXType -> Long Long
3.4.3 manipulation
clang_getTranslationUnitCursor :: (CXTranslationUnit) -> CXCursor clang_Cursor_getTranslationUnit :: (CXCursor) -> CXTranslationUnit -- cursor kinds can be, e.g. -- CXCursor_VarDecl, CXCursor_IfStmt clang_getCursorKind :: (CXCursor) -> enum CXCursorKind -- some predicates clang_isDeclaration :: (enum CXCursorKind) -> unsigned clang_isReference :: (enum CXCursorKind) -> unsigned clang_isExpression :: (enum CXCursorKind) -> unsigned clang_isStatement :: (enum CXCursorKind) -> unsigned -- seems to be related to C++ namespace clang_getCursorSemanticParent :: (CXCursor cursor) -> CXCursor clang_getCursorLexicalParent :: (CXCursor cursor) -> CXCursor -- the cursor must be a include directive clang_getIncludedFile :: (CXCursor cursor) -> CXFile
4 LibTooling
4.1 In-memory code parsing
LIbTooling can be used to parse code in memory or disk. The in-memory
code parsing seems to support less setup (i.e. what command line
arguments to use), and is intended to test initial results. It is
invoked through a function runToolOnCode
, with the code as string
and an action. The function has several variations.
runToolOnCode :: FrontendAction -> Code -> bool runToolOnCodeWithArgs :: FrontendAction -> Code -> Args -> bool buildASTFromCode :: Code -> ASTUnit buildASTFromCodeWithArgs :: Code -> Args -> ASTUnit
4.2 On-disk code parsing
The real tool building of LibTooling starts by creating an instance of
ClangTool
, with compilation database and an array of source files as
parameters. The tool can than run any number of actions called
ToolAction
.
run :: ToolAction -> () buildASTs :: [ASTUnit]
4.3 Command Line
Compilation database is supported. In general, a compilation data base
specifies what are the commands used for the files to compile. This
can be specified in the command line, or read from a file (typically
through a -p
option). There's of course a parser for it, called
CommonOptionsParser
.
_ :: Argc -> Argv -> CommonOptionParser (Parser) getCompilations :: Parser -> Db getSourcePathList :: Parser -> [String]
Or, you can use the static functions to create the DB directly. I
believe this is a wrapper around CommonOptionsParser
. The CMD got
from the it contains directory, filename, command line, output, as
expected.
loadFromDirectory :: BuildDir -> Db loadFromFile :: FilePath -> Db getAllFiles :: Db -> [String] getCompileCommands :: Db -> String -> [CMD]
Once we got the compilation database, we basically knows how to compile all the files in the project.
4.4 FrontendAction
The Tool would run on some front-end action. FrontendAction
(abstract) derives ASTFrontendAction
(abstract) derives
SyntaxOnlyAction
(concrete). Typically, when we work on AST, we
create a class deriving from ASTFrontendAction
, and overwrite its
CreateASTConsumer
interface. The created consumer is called on the
AST.
The Consumer would derive from ASTConsumer
and override
HandleTranslationUnit
. Inside this function, we got the Translation
Unit. This function is called when the whole translation unit is
parsed. This provides the entry point of the AST by the top most
decl by Context.getTranslationUnitDecl()
.
You can handle the AST manually, but clang also provides a visitor
traversal helper class RecursiveASTVisitor
. You simply create a new
instance of the visitor, and let it visit the translation unit decl.
The visitor itself implement what to do with each AST node. Override
the list of VisitXXX
method for each type of AST node.
Under the hood, the visitor will automatically call WalkUpFromXXX(x)
to recursively visit child nodes of x returning false of TraverseXXX
or WalkUpFromXXX
will terminate the traversal. By default this will
be a pre-order traversal. Calling a method to change to post-order.
4.5 Type
The raw type will be whatever appeared in the source code. If a type is a typedef to another type (may be pointer), then the "type" will not record the pointer information.
4.5.1 canonical type
Every instance of type has a canonical type pointer.
- If the type is a simple primitive type, the pointer points to itself
- If any part of the type has typedef, the pointer will point to a type instance that is equivalent to it but without typedefs. You can check whether two types are the same by comparing this pointer.
You should not use isa/cast/dyncast on types
(e.g. isa<PointerType>(expr->getType())
). The reason is it is not
canonical. So use help functions instead:
expr->getType()->isPointerType()
.
4.5.2 QualType
The type and its qualifiers (const, volatile, restrict) are seperate. That is the QualType. It is designed to be small and pass-by-value. It is essentially a pair of (Type*, bits) where the bits stores the qualifiers.
This helps making only one type for each kind, e.g. int, const int, volatile const int.
const Type* getTypePtr() const; const Type& operator*() const; const Type* operator->() const; SplitQualType split() const; class SplitQualType { public: const Type *Ty; Qualifiers Quals; }; bool isCanonical(); QualType getCanonicalType() const; bool isNull(); bool isConstQualified(); bool isVolatileQualified(); bool isRestrictQualified(); bool hasLocalQualifiers(); bool hasQualifiers(); Qualifiers getQualifiers(); QualType withConst(); QualType withVolatile(); QualType withRestrict(); void dump(); std::string getAsString(); static std::string getAsString(SplitQualType split); static std::string getAsString(const Type *ty, Qualifiers qs);
4.6 Clang AST
Declarations contains two parts, the Decl
class, and the
DeclContext
. From the Decl, you can get the context by
getDeclContext
. Decl
supports getting location and kind.
getLocStart :: () -> Loc getLocEnd :: () -> Loc getLocation :: () -> Loc getKind :: () -> Kind
DeclContext
basically is a block of statements. It provides support
of iterating children nodes. Thus the classes deriving from it
includes: BlockDecl
, FunctionDecl
, EnumDecl
, RecordDecl
,
TranslationUnitDecl
. Some notes: in clang in general, the
XXX_range
will provide two method: begin
and end
.
decls :: DeclContext -> decl_range decls_begin :: DeclContext -> decl_iterator decls_end :: DeclContext -> decl_iterator
The most important class under Decl
is NamedDecl
, containing two
main classes: ValueDecl
and TypeDecl
.
-- NamedDecl getIdentifier :: NamedDecl -> IdInfo getName :: NamedDecl -> String -- ValueDecl class NamedDecl => ValueDecl getType :: ValueDecl -> Type class ValueDecl => EnumConstantDecl getInitVal :: EnumConstantDecl -> Int class ValueDecl => DeclaratorDecl class DeclaratorDecl => FunctionDecl getReturnTypeSourceRange :: FunctionDecl -> SourceRange getNameInfo :: FunctionDecl -> NameInfo getBody :: FunctionDecl -> Stmt parameters :: FunctionDecl -> ParmVarDecl getReturnType :: FunctionDecl -> QualType class DeclaratorDecl => FieldDecl class DeclaratorDecl => VarDecl isStaticLocal :: -> bool hasExternalStorage :: -> bool hasGlobalStorage :: -> bool hasInit :: -> bool getInit :: -> Expr getStorageClass :: -> StorageClass -- TypeDecl class NamedDecl => TypeDecl class TypeDecl => TypedefNameDecl class TypeDecl => TypedefDecl class TypeDecl => TagDecl getKindName :: TagDecl -> String getTagKind :: TagDecl -> Kind class TagDecl => EnumDecl -- struct, union, enum enumerators :: EnumDecl -> Range class TagDecl => RecordDecl -- struct, union fields :: RecordDecl -> Range
Every Stmt
has children
method, seemingly regardless of whether it
can have a child. Very common classes here.
Class | method | ||||
---|---|---|---|---|---|
BreakStmt | |||||
ReturnStmt | getRetValue | ||||
ContinueStmt | |||||
IfStmt | getInit | getCond | getThen | getElse | |
SwitchCase | getNextCase | getSubStmt | |||
> CaseStmt | getLHS | getRHS | |||
> DefaultStmt | |||||
SwitchStmt | getCondVar | getInit | getCond | getBody | getCaseList |
LabelStmt | getDecl | getName | getSubStmt | ||
GotoStmt | getLabel | getGotoLoc | getLabelLoc | ||
DoStmt | getCond | getBody | |||
ForStmt | getCondVar | getInit | getCond | getInc | getBody |
WhileStmt | getCondVar | getCond | getBody | ||
CompoundStmt | body | ||||
DeclStmt | decls |
Expressions in Clang.
First, we can check the value type of the expression.
isLValue :: -> bool isXValue :: -> bool isGLValue :: -> bool
Class | Method | Note | |||
---|---|---|---|---|---|
CallExpr | getCallee | getArgs | getReturnType | ||
BinaryOperator | getOpcode | getLHS | getRHS | ||
CastExpr | getCastKind | getSubExpr | |||
> ExplicitCastExpr | |||||
> ImplicitCastExpr | |||||
ParenExpr | getSubExpr | does NOT include conditionals | |||
MemberExpr | getBase | getMemberDecl | getNameInfo | isArrow | |
UnaryOperator | getOpcode | getSubExpr | isPrefix | ||
DeclRefExpr | getDecl | getNameInfo | A reference to a declared variable, function, enum | ||
ConditionalOperator | getCond | getTrueExpr | getFalseExpr | ?: ternary operator. |
ImplicitCastExpr appears very often because it represent many type of cast. For example
- call a function needs to use the cast FunctionToPointerDecay
- use a value in the righ hand side will need the cast LValueToRValue
5 Lexer
Use lexer when you want to get the token level information, such as raw source code.
The getExpansionLocation
family can support macro, otherwise you
will get a location in the macro definition file.
There's a class called Rewriter
. It is used like this. However, this
will have bug when generating a TypedefNameDecl
, resulting in part
of seemingly binary data. It might cause other problems.
Rewriter rewriter; rewriter.setSourceMgr(src_mgr, LangOptions()); StringRef str = rewriter.getRewrittenText(range);
The most reliable way is to use Lexer. It has a static method
getSourceText that generate text from a CharSourceRange
. You can get
CharSourceRange
from SourceRange
using getCharRange
method, no
problem. However, note that clang has two concepts of source range:
the token source range and the character source range. Token source
range is typically used in AST, and meant to be the start of
token. So a typical source range in AST's perspective will be the
beginning locations of the begin and end token. When converting this
directly to char source range, you are guaranteed to miss the last
token. The correct way is to first convert the end location to the
end of the token, using getLocForEndOfToken
. Be careful that don't
call this twice.
The Lexer provides a method getAsCharRange
that seems to want to do
exactly this. There is probably a clang bug though. In
Lexer::getAsCharRange
, the getLocForEndOfToken
is called, but when
constructing CharSourceRange
, the end is adjusted by
getLocWithOffset(-1)
, and cause the getSourceText to miss one
character.
end = Lexer::getLocForEndOfToken(end, 0, mgr, LangOptions()); CharSourceRange char_range = CharSourceRange::getCharRange(begin, end); StringRef text = Lexer::getSourceText(char_range, mgr, LangOptions());
6 Reference
- A article as tutorial: http://bastian.rieck.ru/blog/posts/2016/baby_steps_libclang_function_extents/
- a repo of samples: https://github.com/eliben/llvm-clang-samples