Parser Generation with Flex and Bison

Data Structures and Operations

// se_lib.h

#ifndef __SE_LIB__
#define __SE_LIB__

#include <cstddef>

typedef enum tag_e {
  TAG_SYMBOL,
  TAG_LIST
} tag_t;

typedef char const* symbol_t;

typedef struct list_s {
  struct expr_s* hd;
  struct list_s* tl;
} list_t;

typedef union body_u {
  symbol_t symbol;
  list_t* list;
} body_t;

typedef struct expr_s
{
  tag_t tag;
  body_t body;
} expr_t;

symbol_t make_symbol (char const* s, size_t len);
void free_symbol (symbol_t symbol);

expr_t* make_expr_symbol (symbol_t symbol);
expr_t* make_expr_list (list_t* list);
void free_expr (expr_t* expr);

list_t* make_list_nil (void);
list_t* make_list_cons (expr_t* hd, list_t* tl);
void free_list(list_t* list);

#endif
// se_lib.c

#include "se_lib.h"

symbol_t make_symbol (char const* s, size_t len)
{
  char* result = malloc (sizeof (char)*(len+1));
  strncpy (result, s, len);
  result[len] = '\0';
  return result;
}

void free_symbol (symbol_t symbol)
{
  free (symbol);
}

expr_t* make_expr_symbol (symbol_t symbol)
{
  expr_t* result = malloc (sizeof (expr_t));
  *result = { TAG_SYMBOL, { .symbol = symbol } };
  return result;
}

expr_t* make_expr_list (list_t* list)
{
  expr_t* result = malloc (sizeof (expr_t));
  *result = { TAG_LIST, { .list = list } };
  return result;
}

void free_expr (expr_t* expr)
{
  switch (expr->tag)
    {
    case TAG_SYMBOL :
      free_symbol (expr->body.symbol);
      break;
    case TAG_LIST :
      free_list (expr->body.list);
      break;
    };
    free (expr);
}

list_t* make_list_nil (void)
{
  return NULL;
}

list_t* make_list_cons (expr_t* hd, list_t* tl)
{
  list_t* result = malloc (sizeof (list_t));
  *result = { hd, tl };
  return result;
}

void free_list(list_t* list)
{
  if (NULL != list)
    {
      free_expr (list->hd);
      free_list (list->tl);
      free (list);
    }
}

Bison Parser

A Bison parser follows a grammar outline which consists of

  1. declarations

  2. grammar rules

  3. epilogue

These three parts are separated by a double-percent %%.

Pure Declaration

To define a pure (reentrant) parser we provide a pure declaration in the declaration part of the Bison parser.

%define api.pure full

Error Reporting

For Bison to perform error reporting we need to provide an error reporting function.

%code requires {
#include <cstdio>
void yyerror (char const* s);
}
void yyerror (char const* s)
{
  fprintf (stderr, "%s\n", s);
  abort ();
}

Node Types

The goal of a Bison parser is to generate an abstract syntax tree. Every node in that abstract syntax tree has a type. So, in the Bison declaration part of the Bison parser, we need to introduce these types. We do so using a union declaration. The union declaration consists of a list of semicolon ;-separated pairs. Left is the C name for the node type. Right is the Bison name for the node type.

%code requires {
#include "se_lib.h"
}
%union {
  expr_t*  expr;
  list_t*  list;
  symbol_t symbol;
}

This union declaration introduces three node types: expr, list, and symbol. We use these node types to declare the types for both terminal and non-terminal symbols which should appear in the abstract syntax tree.

Token Declarations

%token          LPAREN
%token          RPAREN
%token <symbol> ID

Syntax Definition

The core of our parser definition is the syntax in BNF.

e      : ID
       | LPAREN e_list RPAREN
       ;

e_list :
       | e e_list
       ;

The syntax definition consists of productions terminated by a semicolon ;. A production has a head and a body. Head and body are separated by a colon :. The head introduces a symbol. The body details how the symbol can be produced. The body itself consists of symbols. We distinguish between terminal and non-terminal symbols. We write terminal symbols in upper-case. Each terminal symbol represents a scanner token. Terminal symbols appear only in production bodies but never as a production head. We write non-terminal symbols in lower-case and terminal symbols in upper-case. Every non-terminal symbol (except the root non-terminal symbol) creates an obligation for the parser. We need a production for every non-terminal symbol except the root symbol.

Token Enumeration

In contrast, every terminal symbols creates an obligation for the scanner. In the parser we need a token declaration for every terminal symbol. Bison allows us to declare tokens via a token declaration.

Herein, we leave all tokens we intend to discard untagged, but tag all tokens which we want to keep with a data type.

%token LPAREN
%token RPAREN
%token <char const*> ID

Semantic Values

Next, we need to create a data structure for semantic values. The parser emits such a data structure whenever it encounters a match for a symbol. Bison allows us to introduce a data structure for semantic values via a union declaration.

%union semantic_value {
  char const* id_value;
  list const* list_value;
}

Symbol Types

type declaration

Now that we have a data structure to hold a semantic value, we need to extend the syntax definition with actions.

e      : ID                    { $$ = $1 }
       | LPAREN e_list RPAREN  { $$ = $1 }
       ;

e_list :
       | e e_list              { $$ = $1 }
       ;

Using Bison

// se.yy

%token LPAREN
%token RPAREN
%token ID

%union semantic_value {
  char const* id_value;
  list const* list_value;
}

%%

e       : LPAREN id_list RPAREN
        ;

id_list :
        | ID id_list
        ;

%%
bison se.yy

Scanner Definition

Using Flex

// se.ll

%%

%%
flex se.ll

Make Workflow