A derivation of a string for a grammar is a sequence of grammar rule applications that transform the start symbol into the string. A derivation proves that the string belongs to the grammar's language. (Source)
A strategy is followed that deterministically determines the next nonterminal to rewrite:
- in a leftmost derivation, it is always the leftmost nonterminal;
- in a rightmost derivation, it is always the rightmost nonterminal. (Source)
- Also known as parse tree
- Output of a parser
- Complete representation of the input in the leafs
- Names of Rules in the interior nodes
- How it is matched
- Also known as syntax tree
- Output of a parser
- Free of inessential nodes
- e.g. grouping parentheses are implicit in the tree structure
- Incomplete representation of the input in the leafs
- Names of Rules in the interior nodes
- How it is matched
Front end
- Preprocessing
- Lexical analysis (lexing)
- Syntax analysis (parsing)
- leads to an CST or AST
- Semantic analysis
- Generation of intermediate representation
Middle end
- Analysis
- Target independent optimizations
Back end
- Target specific optimizations
- Code generation or interpretation
- Assembly of object files (in case the generated code belongs to an assembly language)
- Linking
Often an existing toolchain for a language such as C/C++ is used as middle and back end. In this case the front end would transpile the custom language into C/C++.
This approach might provide a simple solution in case the custom language is close to C/C++. Otherwise the mapping might turn out to be complex (e.g. if memory management by garbage collection is required), which is a downside of the approach.
This problem is a motivation for the LLVM project.
Top-down parser
first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar Top-down parsing can be viewed as an attempt to find left-most derivations (Source)
Bottom-up parser
recognizes the text's lowest-level small details first, before its mid-level structures, and leaving the highest-level overall structure to last. (Source)
Shift-reduce parser
- Kind of bottom-up parser
Recursive descent parser
- Kind of top-down parser
- Uses mutually-recursive functions rather than tables
- Lower complexity (in many cases, no parser generator is required)
- Faster
Recursive ascent parser
- Kind of bottom-up parser
- Uses mutually-recursive functions rather than tables
- Lower complexity (in many cases, no parser generator is required)
- Faster
Table based parser
- Lower memory footprint
- Higher complexity (usually parser generator is required)
Packrat parser
Motivation
- Placing restrictions on a context-free language grammar keeps runtime complexity at bay
LL(k)
- Top-Down
- Processes input left to right
- Leftmost derivation
- k is a lookahead to amke a decision
LL
- Synonym for LL(1)
LL(*)
- LL(k) which is not restricted to a finite number k
LF(k)
- Also known as strong LL-parser
- Top-Down
LL(k)
- Top-Down
- Processes input left to right
- Leftmost derivation
- k is a lookahead
LR(k)
- Bottom-Up-Parser
- Processes input left to right
- Rightmost derivation
- k is a lookahead to amke a decision
LR
- Synonym for LR(1)
LALR(1)
- Look-Ahead LR parser
-
Also known as compiler compiler
-
Usually supports generation of a lexer and parser based on a given grammar
-
Among the most popular generators are Flex+Bison (LALR) and ANTLR (LL(*))
- Generates a lexer
- Supported target language is C++
License
Tool | License |
---|---|
Lex | various |
Flex | BSD |
- Generates a parser
- Supported target language is C++
- Table driven (smaller memory footprint)
License
Tool | License |
---|---|
Yacc | various |
Bison | GPL |
- Generates a lexer, parser, syntax analyzer, visitor (tree walker) and/or listener based on a given LL(*) grammer (
.g4
format) - Recursive descent parser
- Requires Java Runtime
- Supported target languages: Java, C#, Python, JavaScript, Go, C++, Swift
- An IDE tool, called ANTLRworks, is available
- May be used along with StringTemplates as shown in this article
References
- Let's build a compiler - wir entwickeln praxisnah einen Compiler mit Java und ANTLR4
- Antlr4 - Visitor vs Listener Pattern
- Supports labels for rules (
rule # label
) - Supports labels for rule elements (
label=token
)
Resources
Feature | Visitors | Listeners |
---|---|---|
Explicit call required to include children | Yes | No |
Use a return value | Yes | No |
Support for large parse trees | No | Yes |
References
Follow instuctions on antlr.org
antlr4 -o src Demo.g4
javac -cp lib/antlr.jar src/*.java
grun src/Demo expr
1+2*3
Emmit EOF (Linux: Ctrl+D, win: Ctrl+Z)
antlr4 -o src -Dlanguage=Cpp -package theNamespace Demo.g4
Resources
- Initially started by Chris Latner
- Motivated by the way JIT compilers worked at the time, which performed optimization during run-time. This would therefore be repeatet each time the application was executed unless the result is cached. Thus the working hypothesis was to move as much of the optimization to compile-time.
- Since typically programs are optimized by manipulating an intermediate representation (IR), the term Low Level Virtual Machine (LLVM) was coined.
- Since then LLVM developed to an umbrella project, featuring A core project as well as other subprojects concerned with the development of the
clang
front-end and various tools
Resources
- Optimizer
- JIT
- GC
The IL can be provided in 2 representations:
- human readable ascii representation
- binary bitcode representation
Bidirectional transformation between representations is possible using the following commands:
llvm-as
(transforms IL in ascii representation to bitcode representation)llvm-dis
(transforms IL in bitcode representation to ascii representation)
Language aspects:
;
introduces a line comment- identifiers are prefixed according to their nature:
@
for global identifiers (functions, global variables)%
for local identifiers (register names, types)$
for comdats!
for metadata#
for attribute groups^
for a entry of a module summary
- Pointer types are speficied by
<type> *
- Array types are speficied by
[ <size> x <type> ]
- Vector types are speficied by
< <size> x <type> >
- Structure types are specified by either
{ <type list> }
(regular) or<{ <type list> }>
(packed)
In order to compile for a specific target additional information must be provided:
- a datalayout which specifies
- Byte order (endianness)
- Type specific allocation sizes
- Type specific memory alignments
- Style of name mangling
- …
- a target triple which specifies
- Architecture
- Vendor
- Operating system
- optional: Environment
- The source filename
A module may also specify flags for upcoming stages in the toolchain such as e.g. linker options and magic lobal variables.
- A front-end that outputs the IL is required
- This may be achieved by utilization of the core libraries (C++) whilch provide classes such as e.g. an
IRBuilder
- This may be achieved by utilization of the core libraries (C++) whilch provide classes such as e.g. an
References
Commands (excerpt):
llvm-link
(linker for IL in bitcode representation)lli
(executes IL in bitcode representation using JIT compilation)lldb
(debugger)opt
(optimizer)llc
(compiles IL in any representation to assembly code)
See also LLVM Command Guide for a full listing
Build from source: Requirements: Make, CMake, Compiler Repository: https://github.com/llvm/llvm-project
Pre-build: http://releases.llvm.org/download.html
Based on multiple passes of different types:
- Module Pass
- Call Graph Pass
- Function Pass
- Basic Block Pass
- Immutable Pass, Region Pass, MachineFunction Pass
Within each of the mentioned passes:
- Analysis Pass
- Transform Pass
clang++ -S -emit-llvm source.cpp
Observations
- IL uses SSA
opt -dot-cfg-only input.ll > /dev/null
xdot cfg.main.dot
#include "llvm/Pass.h"
#include "llvm/IR/Function.h"
#include "llvm/Support/raw_ostream.h"
using namespace llvm;
namespace {
struct DemoPass : public FunctionPass {
static char ID;
DemoPass() : FunctionPass(ID) {}
bool runOnFunction(Function &F) override {
errs().write_escaped(F.getName()) << '\n';
return false;
}
}
}
char DemoPass::ID = 0;
static RegisterPass<Hello> X("name_of_the_pass",
"Description of the pass",
false,
false);
opt -load path/to/DemoPass.so -name_of_the_pass < input.bc
- LLVM Compiler Infrastructure
- LLVM Tutorial: Table of Contents
- LLVM
- LLVM for Grad Students
- Introduction to LLVM Building simple program analysis tools and instrumentation
- YOW! 2016 Erik Corry - Building Your Own Compiler The Slightly Easier Way With LLVM
- My First LLVM Compiler
Build demo
- Install ANTLR into
demo
directory - Install LLVM from source
- Build demo using the following commands
cd demo
cmake .
cmake --build .