What Is a Parser Definition and Implementation in C++

Recap

In the previous blog post, we explored how a lexer (or lexical analyzer) transforms raw source code into a stream of tokens. This token stream forms the backbone of the compiler’s input for further processing. However, this flat sequence of categorized words and symbols doesn’t yet reveal the actual structure or meaning of the code.

For example, from the lexer we may receive something like:

[KEYWORD: claim] [IDENTIFIER: x] [OPERATOR: =] [NUMBER: 10] [DELIMITER: ;]

But how do we know what this really means? Is it a variable declaration? An assignment? A statement?

This is where the parser steps in — the second stage of compilation. Let’s explore how this crucial component transforms a stream of tokens into a structured representation of code.

What Is a Parser? Parser Definition

A parser (or syntax analyzer) is a program that takes the token stream produced by a lexer and turns it into a structured representation—most commonly an Abstract Syntax Tree (AST).

The Parser’s Job:

It validates that the code follows the grammar of the language.
It captures hierarchical structure (like nesting and operator precedence).
It identifies meaningful syntactic constructs like expressions, statements, functions, loops, etc.

Let’s think of the compiler like a language teacher:

The lexer highlights parts of speech (noun, verb, adjective).
The parser turns those parts into grammatically correct sentences.

In technical terms, the parser consumes terminals (tokens) and builds non-terminals (higher-level constructs like expressions or statements), all while adhering to rules defined by the language’s context-free grammar (CFG).

Why Do We Need a Parser?

You might ask, “If we already have tokens, why not work with them directly?”

Here’s why the parser is essential:

1. Structure and Hierarchy

A program is not just a linear stream of tokens. There are nested expressions, conditionals, loops, etc. The parser builds this hierarchy so that the compiler can understand what goes where.

2. Grammar Enforcement

Parsers ensure code follows language rules:

Does the if statement have a condition?
Is every create function followed by parameters and a block?
Are parentheses balanced?

3. Bridge to Semantics

Once the syntax is validated, we can move to semantic checks like:

Is this variable declared?
Are the types compatible?
Is the function being called correctly?

4. Tooling Support

Parsers power features like:

Syntax highlighting
Autocompletion
Real-time syntax error detection in IDEs

Types of Parsers

Top-Down Parsers

These parse from the top-level rules downward, starting from the start symbol.

Recursive Descent Parser

Manual implementation
One function per non-terminal
Easy to write and debug
Can’t handle left-recursive grammars

Example:

int parseExpr() {
    if (lookahead == NUMBER) return match(NUMBER);
    if (lookahead == LPAREN) {
        match(LPAREN);
        parseExpr();
        match(RPAREN);
    }
}

Bottom-Up Parsers

These work by reducing tokens into non-terminals, working from the leaves (tokens) up to the root of the grammar tree.

Shift-Reduce Parsing

Used by parser generators like Bison, Yacc
Supports left recursion, complex rules
Requires grammar to be defined in a .y file
Produces fast and efficient parsers

Context-Free Grammars (CFGs)

A CFG is a set of production rules that describe how tokens can be combined into valid language constructs.

Example Grammar (for arithmetic expressions):

Expr   -> Expr + Term | Term
Term   -> Term * Factor | Factor
Factor -> ( Expr ) | number

This tells us:

An expression can be two terms added together.
A term can be two factors multiplied.
A factor is either a number or a nested expression.

Terminology:

Terminals: Actual tokens (e.g., +, *, number, (, )).
Non-terminals: Grammar symbols like Expr, Term, Factor.
Production rule: How a non-terminal can be formed.

Parser Construction

There are two common approaches to building a parser:

Option 1: Manual Parser (Recursive Descent)

Best for simple grammars. Requires you to write functions for each rule.

Pros:

Full control
Great for learning

Cons:

Tedious for complex languages
Doesn’t handle ambiguity well

Option 2: Using Bison (Shift-Reduce Parser)

Bison is a tool that generates a parser from a .y grammar file.

How It Works:

Tokens come from the lexer.
Bison matches them against the rules.
When a rule matches, it reduces the symbols and performs an action (e.g., build an AST node).

Using Bison: Turning Grammar into a Parser

As your language grows in complexity — from simple expressions to nested conditionals, function calls, and user-defined types — writing your own recursive descent parser becomes cumbersome. That’s when Bison, a powerful parser generator, shines.

Bison takes in a high-level description of your language grammar and generates C/C++ code to parse that grammar, complete with conflict resolution, state management, and error handling.

Creating a .y Grammar File

The heart of a Bison-based parser is the .y file (often named something like parser.y or flare.y). This file is divided into three sections, separated by %%:

yacc

%{
    // C/C++ header includes and declarations go here
    #include <iostream>
    using namespace std;
%}
%token NUMBER IDENTIFIER PLUS MINUS MULTIPLY DIVIDE
%%
expr: expr PLUS term
    | expr MINUS term
    | term
    ;
term: term MULTIPLY factor
    | term DIVIDE factor
    | factor
    ;
factor: NUMBER
      | IDENTIFIER
      | '(' expr ')'
      ;
%%

Let’s break this down.

1. Defining Terminal Tokens (Lexical Tokens)

All the tokens that the lexer returns must be declared using the %token directive. These include:

Literals: PLUS, MINUS, MULTIPLY, DIVIDE
Identifiers: like variable names
Literals: like NUMBER, STRING, etc.

These names must match what your lexer (e.g., yylex) returns. If your lexer returns PLUS, then %token PLUS must be declared.

You can even assign literal values using ‘%token’ like this:

yacc

%token <int> NUMBER
%token '+' '-' '*' '/'

…but using symbolic names is clearer and better for extensibility.

2. Defining Non-Terminal Symbols

Non-terminals like expr, term, factor, etc., are defined implicitly in the grammar rules section. These represent composite language structures.

Each non-terminal corresponds to a rule or a combination of rules — and that’s how your language syntax is shaped.

3. Writing Grammar Rules

Grammar rules are written in the format:

non_terminal: expansion1
            | expansion2
            ;

Each expansion is made of tokens and/or other non-terminals, defining how a valid sequence of tokens can form that structure.

Example:

expr: expr PLUS term
    | term;

This means expr can be either another expr followed by a PLUS and a term, or just a term.

Note: In this phase, we’re not embedding C++ actions (e.g., { $$ = new AddNode(…); }). We’ll add semantic actions when building the AST in the next part.

Common Pitfalls to Watch Out For

Many beginners face issues when using Bison for the first time. Here are some common ones and how to avoid them:

Left Recursion vs Right Recursion

Left recursion (like expr: expr ‘+’ term) is supported and preferred in Bison because it builds left-associative parsers.
Right recursion leads to deeper stack usage.

Stick to left recursion unless you have a good reason otherwise.

Unresolved Tokens

If Bison complains about an “undeclared token,” check that your %token declarations match exactly with what your lexer returns.

Shift/Reduce and Reduce/Reduce Conflicts

If you see these warnings:

1 shift/reduce conflict

Don’t panic — Bison resolved it automatically, usually preferring shift. But:

Re-express your grammar to eliminate ambiguity.
Use precedence and associativity declarations if needed:

%left '+' '-'
%left '*' '/'

These resolve ambiguities like a + b * c correctly as a + (b * c).

Naming Consistency

Make sure your yylex() function returns the same token names declared in %token.

Compiling the Grammar with Bison

Once your .y file is written, compile it using:

bison -d flare.y

This will generate two files:

flare.tab.cpp — the actual parser code
flare.tab.hpp — header file with token definitions and parser declarations

The -d flag tells Bison to generate the .hpp file, which is important for interfacing with the lexer and other components.

Integration into the Compiler Pipeline

Here’s how this fits into your compiler:

Lexer scans raw code and returns tokens via yylex().
Parser (using flare.tab.cpp) receives the token stream and verifies syntax.
In this stage, it only checks grammar rules — no AST is built yet.
Later, we’ll modify this to construct the AST directly in the .y file using semantic actions.

Typical call in main.cpp:

extern int yyparse();
yyparse();

And link everything together:

g++ -o flare main.cpp flare.tab.cpp lexer.cpp

By using Bison, you offload the hard work of managing parser states, precedence, and syntax trees, freeing yourself to focus on semantics and code generation. In the next part, we’ll extend the .y file to build an AST directly as parsing happens.

Let’s build smart — not from scratch.

Parsing a Simple Expression Language

Input Code:

3 + 4 * (2 – 1)

Token Stream:

[NUMBER: 3] [OP: +] [NUMBER: 4] [OP: *] [LPAREN] [NUMBER: 2] [OP: -] [NUMBER: 1] [RPAREN]

AST Built by Parser:

The parser’s job is to make sure the order of operations (precedence and associativity) is respected. In this case, * and – bind tighter than +. We will see in next post that how we can modify grammar to get the AST nodes in C++.

Integration With Compiler Pipeline

Once parsing is complete, we have a structured tree (AST) that represents the source code.

From here:

Semantic Analysis: Check types, declarations, scope.
Code Generation: Translate the AST into machine code, bytecode, or LLVM IR.

Simplified Pipeline:

Source Code
   ↓
Lexer → Tokens
   ↓
Parser → AST
   ↓
Semantic Analyzer
   ↓
Code Generator

Summary

A parser transforms flat token streams into structured syntax trees.
It uses context-free grammar to guide this transformation.
Parsers can be written by hand (top-down) or generated using tools like Bison (bottom-up).
Parsing enables semantic analysis and code generation by exposing hierarchical structure.

Understanding parsing gives you the ability to define how your language works—and that’s the heart of compiler design.

Next Steps

In the next post, we’ll look at how we can create the AST to perform:

Variable and function resolution
Type checking
Scope management

This will lay the foundation for generating optimized and type-safe code.

Parser Definition and Implementation in C++ Using Bison: What Is a Parser and How It Works