ASTImporter: Merging Clang ASTs

The ASTImporter class is part of Clang’s core library, the AST library. It imports nodes of an ASTContext into another ASTContext.

In this document, we assume basic knowledge about the Clang AST. See the Introduction to the Clang AST if you want to learn more about how the AST is structured. Knowledge about matching the Clang AST and the reference for the matchers are also useful.

Introduction

ASTContext holds long-lived AST nodes (such as types and decls) that can be referred to throughout the semantic analysis of a file. In some cases it is preferable to work with more than one ASTContext. For example, we’d like to parse multiple different files inside the same Clang tool. It may be convenient if we could view the set of the resulting ASTs as if they were one AST resulting from the parsing of each file together. ASTImporter provides the way to copy types or declarations from one ASTContext to another. We refer to the context from which we import as the “from” context or source context; and the context into which we import as the “to” context or destination context.

Existing clients of the ASTImporter library are Cross Translation Unit (CTU) static analysis and the LLDB expression parser. CTU static analysis imports a definition of a function if its definition is found in another translation unit (TU). This way the analysis can breach out from the single TU limitation. LLDB’s expr command parses a user-defined expression, creates an ASTContext for that and then imports the missing definitions from the AST what we got from the debug information (DWARF, etc).

Algorithm of the import

Importing one AST node copies that node into the destination ASTContext. Why do we have to copy the node? Isn’t enough to insert the pointer to that node into the destination context? One reason is that the “from” context may outlive the “to” context. Also, the Clang AST consider nodes (or certain properties of nodes) equivalent if they have the same address!

The import algorithm has to ensure that the structurally equivalent nodes in the different translation units are not getting duplicated in the merged AST. E.g. if we include the definition of the vector template (#include <vector>) in two translation units, then their merged AST should have only one node which represents the template. Also, we have to discover one definition rule (ODR) violations. For instance, if there is a class definition with the same name in both translation units, but one of the definition contains a different number of fields. So, we look up existing definitions, and then we check the structural equivalency on those nodes. The following pseudo-code demonstrates the basics of the import mechanism:

// Pseudo-code(!) of import:
ErrorOrDecl Import(Decl *FromD) {
  Decl *ToDecl = nullptr;
  FoundDeclsList = Look up all Decls in the "to" Ctx with the same name of FromD;
  for (auto FoundDecl : FoundDeclsList) {
    if (StructurallyEquivalentDecls(FoundDecl, FromD)) {
      ToDecl = FoundDecl;
      Mark FromD as imported;
      break;
    } else {
      Report ODR violation;
      return error;
    }
  }
  if (FoundDeclsList is empty) {
    Import dependent declarations and types of ToDecl;
    ToDecl = create a new AST node in "to" Ctx;
    Mark FromD as imported;
  }
  return ToDecl;
}

Two AST nodes are structurally equivalent if they are

  • builtin types and refer to the same type, e.g. int and int are structurally equivalent,
  • function types and all their parameters have structurally equivalent types,
  • record types and all their fields in order of their definition have the same identifier names and structurally equivalent types,
  • variable or function declarations and they have the same identifier name and their types are structurally equivalent.

We could extend the definition of structural equivalency to templates similarly.

If A and B are AST nodes and A depends on B, then we say that A is a dependant of B and B is a dependency of A. The words “dependant” and “dependency” are nouns in British English. Unfortunately, in American English, the adjective “dependent” is used for both meanings. In this document, with the “dependent” adjective we always address the dependencies, the B node in the example.

API

Let’s create a tool which uses the ASTImporter class! First, we build two ASTs from virtual files; the content of the virtual files are synthesized from string literals:

std::unique_ptr<ASTUnit> ToUnit = buildASTFromCode(
    "", "to.cc"); // empty file
std::unique_ptr<ASTUnit> FromUnit = buildASTFromCode(
    R"(
    class MyClass {
      int m1;
      int m2;
    };
    )",
    "from.cc");

The first AST corresponds to the destination (“to”) context - which is empty - and the second for the source (“from”) context. Next, we define a matcher to match MyClass in the “from” context:

auto Matcher = cxxRecordDecl(hasName("MyClass"));
auto *From = getFirstDecl<CXXRecordDecl>(Matcher, FromUnit);

Now we create the Importer and do the import:

ASTImporter Importer(ToUnit->getASTContext(), ToUnit->getFileManager(),
                     FromUnit->getASTContext(), FromUnit->getFileManager(),
                     /*MinimalImport=*/true);
llvm::Expected<Decl *> ImportedOrErr = Importer.Import(From);

The Import call returns with llvm::Expected, so, we must check for any error. Please refer to the error handling documentation for details.

if (!ImportedOrErr) {
  llvm::Error Err = ImportedOrErr.takeError();
  llvm::errs() << "ERROR: " << Err << "\n";
  consumeError(std::move(Err));
  return 1;
}

If there’s no error then we can get the underlying value. In this example we will print the AST of the “to” context.

Decl *Imported = *ImportedOrErr;
Imported->getTranslationUnitDecl()->dump();

Since we set minimal import in the constructor of the importer, the AST will not contain the declaration of the members (once we run the test tool).

TranslationUnitDecl 0x68b9a8 <<invalid sloc>> <invalid sloc>
`-CXXRecordDecl 0x6c7e30 <line:2:7, col:13> col:13 class MyClass definition
  `-DefinitionData pass_in_registers standard_layout trivially_copyable trivial literal
    |-DefaultConstructor exists trivial needs_implicit
    |-CopyConstructor simple trivial has_const_param needs_implicit implicit_has_const_param
    |-MoveConstructor exists simple trivial needs_implicit
    |-CopyAssignment trivial has_const_param needs_implicit implicit_has_const_param
    |-MoveAssignment exists simple trivial needs_implicit
    `-Destructor simple irrelevant trivial needs_implicit

We’d like to get the members too, so, we use ImportDefinition to copy the whole definition of MyClass into the “to” context. Then we dump the AST again.

if (llvm::Error Err = Importer.ImportDefinition(From)) {
  llvm::errs() << "ERROR: " << Err << "\n";
  consumeError(std::move(Err));
  return 1;
}
llvm::errs() << "Imported definition.\n";
Imported->getTranslationUnitDecl()->dump();

This time the AST is going to contain the members too.

TranslationUnitDecl 0x68b9a8 <<invalid sloc>> <invalid sloc>
`-CXXRecordDecl 0x6c7e30 <line:2:7, col:13> col:13 class MyClass definition
  |-DefinitionData pass_in_registers standard_layout trivially_copyable trivial literal
  | |-DefaultConstructor exists trivial needs_implicit
  | |-CopyConstructor simple trivial has_const_param needs_implicit implicit_has_const_param
  | |-MoveConstructor exists simple trivial needs_implicit
  | |-CopyAssignment trivial has_const_param needs_implicit implicit_has_const_param
  | |-MoveAssignment exists simple trivial needs_implicit
  | `-Destructor simple irrelevant trivial needs_implicit
  |-CXXRecordDecl 0x6c7f48 <col:7, col:13> col:13 implicit class MyClass
  |-FieldDecl 0x6c7ff0 <line:3:9, col:13> col:13 m1 'int'
  `-FieldDecl 0x6c8058 <line:4:9, col:13> col:13 m2 'int'

We can spare the call for ImportDefinition if we set up the importer to do a “normal” (not minimal) import.

ASTImporter Importer( ....  /*MinimalImport=*/false);

With normal import, all dependent declarations are imported normally. However, with minimal import, the dependent Decls are imported without definition, and we have to import their definition for each if we later need that.

Putting this all together here is how the source of the tool looks like:

#include "clang/AST/ASTImporter.h"
#include "clang/ASTMatchers/ASTMatchFinder.h"
#include "clang/ASTMatchers/ASTMatchers.h"
#include "clang/Tooling/Tooling.h"

using namespace clang;
using namespace tooling;
using namespace ast_matchers;

template <typename Node, typename Matcher>
Node *getFirstDecl(Matcher M, const std::unique_ptr<ASTUnit> &Unit) {
  auto MB = M.bind("bindStr"); // Bind the to-be-matched node to a string key.
  auto MatchRes = match(MB, Unit->getASTContext());
  // We should have at least one match.
  assert(MatchRes.size() >= 1);
  // Get the first matched and bound node.
  Node *Result =
      const_cast<Node *>(MatchRes[0].template getNodeAs<Node>("bindStr"));
  assert(Result);
  return Result;
}

int main() {
  std::unique_ptr<ASTUnit> ToUnit = buildASTFromCode(
      "", "to.cc");
  std::unique_ptr<ASTUnit> FromUnit = buildASTFromCode(
      R"(
      class MyClass {
        int m1;
        int m2;
      };
      )",
      "from.cc");
  auto Matcher = cxxRecordDecl(hasName("MyClass"));
  auto *From = getFirstDecl<CXXRecordDecl>(Matcher, FromUnit);

  ASTImporter Importer(ToUnit->getASTContext(), ToUnit->getFileManager(),
                       FromUnit->getASTContext(), FromUnit->getFileManager(),
                       /*MinimalImport=*/true);
  llvm::Expected<Decl *> ImportedOrErr = Importer.Import(From);
  if (!ImportedOrErr) {
    llvm::Error Err = ImportedOrErr.takeError();
    llvm::errs() << "ERROR: " << Err << "\n";
    consumeError(std::move(Err));
    return 1;
  }
  Decl *Imported = *ImportedOrErr;
  Imported->getTranslationUnitDecl()->dump();

  if (llvm::Error Err = Importer.ImportDefinition(From)) {
    llvm::errs() << "ERROR: " << Err << "\n";
    consumeError(std::move(Err));
    return 1;
  }
  llvm::errs() << "Imported definition.\n";
  Imported->getTranslationUnitDecl()->dump();

  return 0;
};

We may extend the CMakeLists.txt under let’s say clang/tools with the build and link instructions:

add_clang_executable(astimporter-demo ASTImporterDemo.cpp)
clang_target_link_libraries(astimporter-demo
  PRIVATE
  LLVMSupport
  clangAST
  clangASTMatchers
  clangBasic
  clangFrontend
  clangSerialization
  clangTooling
  )

Then we can build and execute the new tool.

$ ninja astimporter-demo && ./bin/astimporter-demo

Errors during the import process

Normally, either the source or the destination context contains the definition of a declaration. However, there may be cases when both of the contexts have a definition for a given symbol. If these definitions differ, then we have a name conflict, in C++ it is known as ODR (one definition rule) violation. Let’s modify the previous tool we had written and try to import a ClassTemplateSpecializationDecl with a conflicting definition:

int main() {
  std::unique_ptr<ASTUnit> ToUnit = buildASTFromCode(
      R"(
      // primary template
      template <typename T>
      struct X {};
      // explicit specialization
      template<>
      struct X<int> { int i; };
      )",
      "to.cc");
  ToUnit->enableSourceFileDiagnostics();
  std::unique_ptr<ASTUnit> FromUnit = buildASTFromCode(
      R"(
      // primary template
      template <typename T>
      struct X {};
      // explicit specialization
      template<>
      struct X<int> { int i2; };
      // field mismatch:  ^^
      )",
      "from.cc");
  FromUnit->enableSourceFileDiagnostics();
  auto Matcher = classTemplateSpecializationDecl(hasName("X"));
  auto *From = getFirstDecl<ClassTemplateSpecializationDecl>(Matcher, FromUnit);
  auto *To = getFirstDecl<ClassTemplateSpecializationDecl>(Matcher, ToUnit);

  ASTImporter Importer(ToUnit->getASTContext(), ToUnit->getFileManager(),
                       FromUnit->getASTContext(), FromUnit->getFileManager(),
                       /*MinimalImport=*/false);
  llvm::Expected<Decl *> ImportedOrErr = Importer.Import(From);
  if (!ImportedOrErr) {
    llvm::Error Err = ImportedOrErr.takeError();
    llvm::errs() << "ERROR: " << Err << "\n";
    consumeError(std::move(Err));
    To->getTranslationUnitDecl()->dump();
    return 1;
  }
  return 0;
};

When we run the tool we have the following warning:

to.cc:7:14: warning: type 'X<int>' has incompatible definitions in different translation units [-Wodr]
      struct X<int> { int i; };
             ^
to.cc:7:27: note: field has name 'i' here
      struct X<int> { int i; };
                          ^
from.cc:7:27: note: field has name 'i2' here
      struct X<int> { int i2; };
                        ^

Note, because of these diagnostics we had to call enableSourceFileDiagnostics on the ASTUnit objects.

Since we could not import the specified declaration (From), we get an error in the return value. The AST does not contain the conflicting definition, so we are left with the original AST.

ERROR: NameConflict
TranslationUnitDecl 0xe54a48 <<invalid sloc>> <invalid sloc>
|-ClassTemplateDecl 0xe91020 <to.cc:3:7, line:4:17> col:14 X
| |-TemplateTypeParmDecl 0xe90ed0 <line:3:17, col:26> col:26 typename depth 0 index 0 T
| |-CXXRecordDecl 0xe90f90 <line:4:7, col:17> col:14 struct X definition
| | |-DefinitionData empty aggregate standard_layout trivially_copyable pod trivial literal has_constexpr_non_copy_move_ctor can_const_default_init
| | | |-DefaultConstructor exists trivial constexpr needs_implicit defaulted_is_constexpr
| | | |-CopyConstructor simple trivial has_const_param needs_implicit implicit_has_const_param
| | | |-MoveConstructor exists simple trivial needs_implicit
| | | |-CopyAssignment trivial has_const_param needs_implicit implicit_has_const_param
| | | |-MoveAssignment exists simple trivial needs_implicit
| | | `-Destructor simple irrelevant trivial needs_implicit
| | `-CXXRecordDecl 0xe91270 <col:7, col:14> col:14 implicit struct X
| `-ClassTemplateSpecialization 0xe91340 'X'
`-ClassTemplateSpecializationDecl 0xe91340 <line:6:7, line:7:30> col:14 struct X definition
  |-DefinitionData pass_in_registers aggregate standard_layout trivially_copyable pod trivial literal
  | |-DefaultConstructor exists trivial needs_implicit
  | |-CopyConstructor simple trivial has_const_param needs_implicit implicit_has_const_param
  | |-MoveConstructor exists simple trivial needs_implicit
  | |-CopyAssignment trivial has_const_param needs_implicit implicit_has_const_param
  | |-MoveAssignment exists simple trivial needs_implicit
  | `-Destructor simple irrelevant trivial needs_implicit
  |-TemplateArgument type 'int'
  |-CXXRecordDecl 0xe91558 <col:7, col:14> col:14 implicit struct X
  `-FieldDecl 0xe91600 <col:23, col:27> col:27 i 'int'

Error propagation

If there is a dependent node we have to import before we could import a given node then the import error associated to the dependency propagates to the dependant node. Let’s modify the previous example and import a FieldDecl instead of the ClassTemplateSpecializationDecl.

auto Matcher = fieldDecl(hasName("i2"));
auto *From = getFirstDecl<FieldDecl>(Matcher, FromUnit);

In this case we can see that an error is associated (getImportDeclErrorIfAny) to the specialization also, not just to the field:

llvm::Expected<Decl *> ImportedOrErr = Importer.Import(From);
if (!ImportedOrErr) {
  llvm::Error Err = ImportedOrErr.takeError();
  consumeError(std::move(Err));

  // check that the ClassTemplateSpecializationDecl is also marked as
  // erroneous.
  auto *FromSpec = getFirstDecl<ClassTemplateSpecializationDecl>(
      classTemplateSpecializationDecl(hasName("X")), FromUnit);
  assert(Importer.getImportDeclErrorIfAny(FromSpec));
  // Btw, the error is also set for the FieldDecl.
  assert(Importer.getImportDeclErrorIfAny(From));
  return 1;
}

Polluted AST

We may recognize an error during the import of a dependent node. However, by that time, we had already created the dependant. In these cases we do not remove the existing erroneous node from the “to” context, rather we associate an error to that node. Let’s extend the previous example with another class Y. This class has a forward definition in the “to” context, but its definition is in the “from” context. We’d like to import the definition, but it contains a member whose type conflicts with the type in the “to” context:

std::unique_ptr<ASTUnit> ToUnit = buildASTFromCode(
    R"(
    // primary template
    template <typename T>
    struct X {};
    // explicit specialization
    template<>
    struct X<int> { int i; };

    class Y;
    )",
    "to.cc");
ToUnit->enableSourceFileDiagnostics();
std::unique_ptr<ASTUnit> FromUnit = buildASTFromCode(
    R"(
    // primary template
    template <typename T>
    struct X {};
    // explicit specialization
    template<>
    struct X<int> { int i2; };
    // field mismatch:  ^^

    class Y { void f() { X<int> xi; } };
    )",
    "from.cc");
FromUnit->enableSourceFileDiagnostics();
auto Matcher = cxxRecordDecl(hasName("Y"));
auto *From = getFirstDecl<CXXRecordDecl>(Matcher, FromUnit);
auto *To = getFirstDecl<CXXRecordDecl>(Matcher, ToUnit);

This time we create a shared_ptr for ASTImporterSharedState which owns the associated errors for the “to” context. Note, there may be several different ASTImporter objects which import into the same “to” context but from different “from” contexts; they should share the same ASTImporterSharedState. (Also note, we have to include the corresponding ASTImporterSharedState.h header file.)

auto ImporterState = std::make_shared<ASTImporterSharedState>();
ASTImporter Importer(ToUnit->getASTContext(), ToUnit->getFileManager(),
                     FromUnit->getASTContext(), FromUnit->getFileManager(),
                     /*MinimalImport=*/false, ImporterState);
llvm::Expected<Decl *> ImportedOrErr = Importer.Import(From);
if (!ImportedOrErr) {
  llvm::Error Err = ImportedOrErr.takeError();
  consumeError(std::move(Err));

  // ... but the node had been created.
  auto *ToYDef = getFirstDecl<CXXRecordDecl>(
      cxxRecordDecl(hasName("Y"), isDefinition()), ToUnit);
  ToYDef->dump();
  // An error is set for "ToYDef" in the shared state.
  Optional<ImportError> OptErr =
      ImporterState->getImportDeclErrorIfAny(ToYDef);
  assert(OptErr);

  return 1;
}

If we take a look at the AST, then we can see that the Decl with the definition is created, but the field is missing.

|-CXXRecordDecl 0xf66678 <line:9:7, col:13> col:13 class Y
`-CXXRecordDecl 0xf66730 prev 0xf66678 <:10:7, col:13> col:13 class Y definition
  |-DefinitionData pass_in_registers empty aggregate standard_layout trivially_copyable pod trivial literal has_constexpr_non_copy_move_ctor can_const_default_init
  | |-DefaultConstructor exists trivial constexpr needs_implicit defaulted_is_constexpr
  | |-CopyConstructor simple trivial has_const_param needs_implicit implicit_has_const_param
  | |-MoveConstructor exists simple trivial needs_implicit
  | |-CopyAssignment trivial has_const_param needs_implicit implicit_has_const_param
  | |-MoveAssignment exists simple trivial needs_implicit
  | `-Destructor simple irrelevant trivial needs_implicit
  `-CXXRecordDecl 0xf66828 <col:7, col:13> col:13 implicit class Y

We do not remove the erroneous nodes because by the time when we recognize the error it is too late to remove the node, there may be additional references to that already in the AST. This is aligned with the overall design principle of the Clang AST: Clang AST nodes (types, declarations, statements, expressions, and so on) are generally designed to be immutable once created. Thus, clients of the ASTImporter library should always check if there is any associated error for the node which they inspect in the destination context. We recommend skipping the processing of those nodes which have an error associated with them.

Using the -ast-merge Clang front-end action

The -ast-merge <pch-file> command-line switch can be used to merge from the given serialized AST file. This file represents the source context. When this switch is present then each top-level AST node of the source context is being merged into the destination context. If the merge was successful then ASTConsumer::HandleTopLevelDecl is called for the Decl. This results that we can execute the original front-end action on the extended AST.

Example for C

Let’s consider the following three files:

// bar.h
#ifndef BAR_H
#define BAR_H
int bar();
#endif /* BAR_H */

// bar.c
#include "bar.h"
int bar() {
  return 41;
}

// main.c
#include "bar.h"
int main() {
    return bar();
}

Let’s generate the AST files for the two source files:

$ clang -cc1 -emit-pch -o bar.ast bar.c
$ clang -cc1 -emit-pch -o main.ast main.c

Then, let’s check how the merged AST would look like if we consider only the bar() function:

$ clang -cc1 -ast-merge bar.ast -ast-merge main.ast /dev/null -ast-dump
TranslationUnitDecl 0x12b0738 <<invalid sloc>> <invalid sloc>
|-FunctionDecl 0x12b1470 </path/bar.h:4:1, col:9> col:5 used bar 'int ()'
|-FunctionDecl 0x12b1538 prev 0x12b1470 </path/bar.c:3:1, line:5:1> line:3:5 used bar 'int ()'
| `-CompoundStmt 0x12b1608 <col:11, line:5:1>
|   `-ReturnStmt 0x12b15f8 <line:4:3, col:10>
|     `-IntegerLiteral 0x12b15d8 <col:10> 'int' 41
|-FunctionDecl 0x12b1648 prev 0x12b1538 </path/bar.h:4:1, col:9> col:5 used bar 'int ()'

We can inspect that the prototype of the function and the definition of it is merged into the same redeclaration chain. What’s more there is a third prototype declaration merged to the chain. The functions are merged in a way that prototypes are added to the redecl chain if they refer to the same type, but we can have only one definition. The first two declarations are from bar.ast, the third is from main.ast.

Now, let’s create an object file from the merged AST:

$ clang -cc1 -ast-merge bar.ast -ast-merge main.ast /dev/null -emit-obj -o main.o

Next, we may call the linker and execute the created binary file.

$ clang -o a.out main.o
$ ./a.out
$ echo $?
41
$

Example for C++

In the case of C++, the generation of the AST files and the way how we invoke the front-end is a bit different. Assuming we have these three files:

// foo.h
#ifndef FOO_H
#define FOO_H
struct foo {
    virtual int fun();
};
#endif /* FOO_H */

// foo.cpp
#include "foo.h"
int foo::fun() {
  return 42;
}

// main.cpp
#include "foo.h"
int main() {
    return foo().fun();
}

We shall generate the AST files, merge them, create the executable and then run it:

$ clang++ -x c++-header -o foo.ast foo.cpp
$ clang++ -x c++-header -o main.ast main.cpp
$ clang++ -cc1 -x c++ -ast-merge foo.ast -ast-merge main.ast /dev/null -ast-dump
$ clang++ -cc1 -x c++ -ast-merge foo.ast -ast-merge main.ast /dev/null -emit-obj -o main.o
$ clang++ -o a.out main.o
$ ./a.out
$ echo $?
42
$