Chapter 13 Lexer and parser generators (ocamllex, ocamlyacc)
This chapter describes two program generators: ocamllex, that produces a lexical analyzer from a set of regular expressions with associated semantic actions, and ocamlyacc, that produces a parser from a grammar with associated semantic actions.
These program generators are very close to the well-known lex and yacc commands that can be found in most C programming environments. This chapter assumes a working knowledge of lex and yacc: while it describes the input syntax for ocamllex and ocamlyacc and the main differences with lex and yacc, it does not explain the basics of writing a lexer or parser description in lex and yacc. Readers unfamiliar with lex and yacc are referred to “Compilers: principles, techniques, and tools” by Aho, Sethi and Ullman (Addison-Wesley, 1986), or “Lex & Yacc”, by Levine, Mason and Brown (O’Reilly, 1992).
ocamllex
1 Overview ofThe ocamllex command produces a lexical analyzer from a set of regular expressions with attached semantic actions, in the style of lex. Assuming the input file is lexer.mll, executing
ocamllex lexer.mll
produces OCaml code for a lexical analyzer in file lexer.ml. This file defines one lexing function per entry point in the lexer definition. These functions have the same names as the entry points. Lexing functions take as argument a lexer buffer, and return the semantic attribute of the corresponding entry point.
Lexer buffers are an abstract data type implemented in the standard library module Lexing. The functions Lexing.from_channel, Lexing.from_string and Lexing.from_function create lexer buffers that read from an input channel, a character string, or any reading function, respectively. (See the description of module Lexing in chapter 25.)
When used in conjunction with a parser generated by ocamlyacc, the semantic actions compute a value belonging to the type token defined by the generated parsing module. (See the description of ocamlyacc below.)
1.1 Options
The following command-line options are recognized by ocamllex.
- -ml
- Output code that does not use OCaml’s built-in automata interpreter. Instead, the automaton is encoded by OCaml functions. This option improves performance when using the native compiler, but decreases it when using the bytecode compiler.
- -o output-file
- Specify the name of the output file produced by ocamllex. The default is the input file name with its extension replaced by .ml.
- -q
- Quiet mode. ocamllex normally outputs informational messages to standard output. They are suppressed if option -q is used.
- -v or -version
- Print version string and exit.
- -vnum
- Print short version number and exit.
- -help or --help
- Display a short usage summary and exit.
2 Syntax of lexer definitions
The format of lexer definitions is as follows:
{ header } let ident = regexp … [refill { refill-handler }] rule entrypoint [arg1… argn] = parse regexp { action } | … | regexp { action } and entrypoint [arg1… argn] = parse … and … { trailer }
Comments are delimited by (* and *), as in OCaml. The parse keyword, can be replaced by the shortest keyword, with the semantic consequences explained below.
Refill handlers are a recent (optional) feature introduced in 4.02, documented below in subsection 13.2.7.
2.1 Header and trailer
The header and trailer sections are arbitrary OCaml text enclosed in curly braces. Either or both can be omitted. If present, the header text is copied as is at the beginning of the output file and the trailer text at the end. Typically, the header section contains the open directives required by the actions, and possibly some auxiliary functions used in the actions.
2.2 Naming regular expressions
Between the header and the entry points, one can give names to frequently-occurring regular expressions. This is written let ident = regexp. In regular expressions that follow this declaration, the identifier ident can be used as shorthand for regexp.
2.3 Entry points
The names of the entry points must be valid identifiers for OCaml values (starting with a lowercase letter). Similarly, the arguments arg1… argn must be valid identifiers for OCaml. Each entry point becomes an OCaml function that takes n+1 arguments, the extra implicit last argument being of type Lexing.lexbuf. Characters are read from the Lexing.lexbuf argument and matched against the regular expressions provided in the rule, until a prefix of the input matches one of the rule. The corresponding action is then evaluated and returned as the result of the function.
If several regular expressions match a prefix of the input, the “longest match” rule applies: the regular expression that matches the longest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is selected.
However, if lexer rules are introduced with the shortest keyword in place of the parse keyword, then the “shortest match” rule applies: the shortest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is still selected. This feature is not intended for use in ordinary lexical analyzers, it may facilitate the use of ocamllex as a simple text processing tool.
2.4 Regular expressions
The regular expressions are in the style of lex, with a more OCaml-like syntax.
|
- ' regular-char ∣ escape-sequence '
- A character constant, with the same syntax as OCaml character constants. Match the denoted character.
- _
- (underscore) Match any character.
- eof
-
Match the end of the lexer input.
Note: On some systems, with interactive input, an end-of-file may be followed by more characters. However, ocamllex will not correctly handle regular expressions that contain eof followed by something else. - " { string-character } "
- A string constant, with the same syntax as OCaml string constants. Match the corresponding sequence of characters.
- [ character-set ]
- Match any single character belonging to the given character set. Valid character sets are: single character constants ' c '; ranges of characters ' c1 ' - ' c2 ' (all characters between c1 and c2, inclusive); and the union of two or more character sets, denoted by concatenation.
- [ ^ character-set ]
- Match any single character not belonging to the given character set.
- regexp1 # regexp2
- (difference of character sets) Regular expressions regexp1 and regexp2 must be character sets defined with [… ] (or a single character expression or underscore _). Match the difference of the two specified character sets.
- regexp *
- (repetition) Match the concatenation of zero or more strings that match regexp.
- regexp +
- (strict repetition) Match the concatenation of one or more strings that match regexp.
- regexp ?
- (option) Match the empty string, or a string matching regexp.
- regexp1 | regexp2
- (alternative) Match any string that matches regexp1 or regexp2
- regexp1 regexp2
- (concatenation) Match the concatenation of two strings, the first matching regexp1, the second matching regexp2.
- ( regexp )
- Match the same strings as regexp.
- ident
- Reference the regular expression bound to ident by an earlier let ident = regexp definition.
- regexp as ident
- Bind the substring matched by regexp to identifier ident.
Concerning the precedences of operators, # has the highest precedence, followed by *, + and ?, then concatenation, then | (alternation), then as.
2.5 Actions
The actions are arbitrary OCaml expressions. They are evaluated in a context where the identifiers defined by using the as construct are bound to subparts of the matched string. Additionally, lexbuf is bound to the current lexer buffer. Some typical uses for lexbuf, in conjunction with the operations on lexer buffers provided by the Lexing standard library module, are listed below.
- Lexing.lexeme lexbuf
- Return the matched string.
- Lexing.lexeme_char lexbuf n
- Return the nth character in the matched string. The first character corresponds to n = 0.
- Lexing.lexeme_start lexbuf
- Return the absolute position in the input text of the beginning of the matched string (i.e. the offset of the first character of the matched string). The first character read from the input text has offset 0.
- Lexing.lexeme_end lexbuf
- Return the absolute position in the input text of the end of the matched string (i.e. the offset of the first character after the matched string). The first character read from the input text has offset 0.
- entrypoint [exp1… expn] lexbuf
- (Where entrypoint is the name of another entry point in the same lexer definition.) Recursively call the lexer on the given entry point. Notice that lexbuf is the last argument. Useful for lexing nested comments, for example.
2.6 Variables in regular expressions
The as construct is similar to “groups” as provided by numerous regular expression packages. The type of these variables can be string, char, string option or char option.
We first consider the case of linear patterns, that is the case when all as bound variables are distinct. In regexp as ident, the type of ident normally is string (or string option) except when regexp is a character constant, an underscore, a string constant of length one, a character set specification, or an alternation of those. Then, the type of ident is char (or char option). Option types are introduced when overall rule matching does not imply matching of the bound sub-pattern. This is in particular the case of ( regexp as ident ) ? and of regexp1 | ( regexp2 as ident ).
There is no linearity restriction over as bound variables. When a variable is bound more than once, the previous rules are to be extended as follows:
- A variable is a char variable when all its occurrences bind char occurrences in the previous sense.
- A variable is an option variable when the overall expression can be matched without binding this variable.
For instance, in ('a' as x) | ( 'a' (_ as x) ) the variable x is of type char, whereas in ("ab" as x) | ( 'a' (_ as x) ? ) the variable x is of type string option.
In some cases, a successful match may not yield a unique set of bindings.
For instance the matching of aba
by the regular expression
(('a'|"ab") as x) (("ba"|'a') as y) may result in binding
either
x
to "ab"
and y
to "a"
, or
x
to "a"
and y
to "ba"
.
The automata produced ocamllex on such ambiguous regular
expressions will select one of the possible resulting sets of
bindings.
The selected set of bindings is purposely left unspecified.
2.7 Refill handlers
By default, when ocamllex reaches the end of its lexing buffer, it will silently call the refill_buff function of lexbuf structure and continue lexing. It is sometimes useful to be able to take control of refilling action; typically, if you use a library for asynchronous computation, you may want to wrap the refilling action in a delaying function to avoid blocking synchronous operations.
Since OCaml 4.02, it is possible to specify a refill-handler, a function that will be called when refill happens. It is passed the continuation of the lexing, on which it has total control. The OCaml expression used as refill action should have a type that is an instance of
(Lexing.lexbuf -> 'a) -> Lexing.lexbuf -> 'a
where the first argument is the continuation which captures the processing ocamllex would usually perform (refilling the buffer, then calling the lexing function again), and the result type that instantiates [’a] should unify with the result type of all lexing rules.
As an example, consider the following lexer that is parametrized over an arbitrary monad:
{ type token = EOL | INT of int | PLUS module Make (M : sig type 'a t val return: 'a -> 'a t val bind: 'a t -> ('a -> 'b t) -> 'b t val fail : string -> 'a t (* Set up lexbuf *) val on_refill : Lexing.lexbuf -> unit t end) = struct let refill_handler k lexbuf = M.bind (M.on_refill lexbuf) (fun () -> k lexbuf) } refill {refill_handler} rule token = parse | [' ' '\t'] { token lexbuf } | '\n' { M.return EOL } | ['0'-'9']+ as i { M.return (INT (int_of_string i)) } | '+' { M.return PLUS } | _ { M.fail "unexpected character" } { end }
2.8 Reserved identifiers
All identifiers starting with __ocaml_lex are reserved for use by ocamllex; do not use any such identifier in your programs.
ocamlyacc
3 Overview ofThe ocamlyacc command produces a parser from a context-free grammar specification with attached semantic actions, in the style of yacc. Assuming the input file is grammar.mly, executing
ocamlyacc options grammar.mly
produces OCaml code for a parser in the file grammar.ml, and its interface in file grammar.mli.
The generated module defines one parsing function per entry point in the grammar. These functions have the same names as the entry points. Parsing functions take as arguments a lexical analyzer (a function from lexer buffers to tokens) and a lexer buffer, and return the semantic attribute of the corresponding entry point. Lexical analyzer functions are usually generated from a lexer specification by the ocamllex program. Lexer buffers are an abstract data type implemented in the standard library module Lexing. Tokens are values from the concrete type token, defined in the interface file grammar.mli produced by ocamlyacc.
4 Syntax of grammar definitions
Grammar definitions have the following format:
%{ header %} declarations %% rules %% trailer
Comments are enclosed between /*
and */
(as in C) in the
“declarations” and “rules” sections, and between (*
and
*)
(as in OCaml) in the “header” and “trailer” sections.
4.1 Header and trailer
The header and the trailer sections are OCaml code that is copied as is into file grammar.ml. Both sections are optional. The header goes at the beginning of the output file; it usually contains open directives and auxiliary functions required by the semantic actions of the rules. The trailer goes at the end of the output file.
4.2 Declarations
Declarations are given one per line. They all start with a %
sign.
- %token constr … constr
- Declare the given symbols constr … constr as tokens (terminal symbols). These symbols are added as constant constructors for the token concrete type.
- %token < typexpr > constr … constr
-
Declare the given symbols constr … constr as tokens with an
attached attribute of the
given type. These symbols are added as constructors with arguments of
the given type for the token concrete type. The typexpr part is
an arbitrary OCaml type expression, except that all type
constructor names must be fully qualified (e.g. Modname.typename)
for all types except standard built-in types, even if the proper
open
directives (e.g.open Modname
) were given in the header section. That’s because the header is copied only to the .ml output file, but not to the .mli output file, while the typexpr part of a%token
declaration is copied to both. - %start symbol … symbol
-
Declare the given symbols as entry points for the grammar. For each
entry point, a parsing function with the same name is defined in the
output module. Non-terminals that are not declared as entry points
have no such parsing function. Start symbols must be given a type with
the
%type
directive below. - %type < typexpr > symbol … symbol
-
Specify the type of the semantic attributes for the given symbols.
This is mandatory for start symbols only. Other nonterminal symbols
need not be given types by hand: these types will be inferred when
running the output files through the OCaml compiler (unless the
-s
option is in effect). The typexpr part is an arbitrary OCaml type expression, except that all type constructor names must be fully qualified, as explained above for %token. - %left symbol … symbol
- %right symbol … symbol
- %nonassoc symbol … symbol
Associate precedences and associativities to the given symbols. All symbols on the same line are given the same precedence. They have higher precedence than symbols declared before in a
%left
,%right
or%nonassoc
line. They have lower precedence than symbols declared after in a%left
,%right
or%nonassoc
line. The symbols are declared to associate to the left (%left
), to the right (%right
), or to be non-associative (%nonassoc
). The symbols are usually tokens. They can also be dummy nonterminals, for use with the%prec
directive inside the rules.The precedence declarations are used in the following way to resolve reduce/reduce and shift/reduce conflicts:
- Tokens and rules have precedences. By default, the precedence of a rule is the precedence of its rightmost terminal. You can override this default by using the %prec directive in the rule.
- A reduce/reduce conflict is resolved in favor of the first rule (in the order given by the source file), and ocamlyacc outputs a warning.
- A shift/reduce conflict is resolved by comparing the precedence of the rule to be reduced with the precedence of the token to be shifted. If the precedence of the rule is higher, then the rule will be reduced; if the precedence of the token is higher, then the token will be shifted.
- A shift/reduce conflict between a rule and a token with the same precedence will be resolved using the associativity: if the token is left-associative, then the parser will reduce; if the token is right-associative, then the parser will shift. If the token is non-associative, then the parser will declare a syntax error.
- When a shift/reduce conflict cannot be resolved using the above method, then ocamlyacc will output a warning and the parser will always shift.
4.3 Rules
The syntax for rules is as usual:
nonterminal : symbol … symbol { semantic-action } | … | symbol … symbol { semantic-action } ;
Rules can also contain the %prec
symbol directive in the
right-hand side part, to override the default precedence and
associativity of the rule with the precedence and associativity of the
given symbol.
Semantic actions are arbitrary OCaml expressions, that
are evaluated to produce the semantic attribute attached to
the defined nonterminal. The semantic actions can access the
semantic attributes of the symbols in the right-hand side of
the rule with the $
notation: $1
is the attribute for the
first (leftmost) symbol, $2
is the attribute for the second
symbol, etc.
The rules may contain the special symbol error to indicate resynchronization points, as in yacc.
Actions occurring in the middle of rules are not supported.
Nonterminal symbols are like regular OCaml symbols, except that they cannot end with ' (single quote).
4.4 Error handling
Error recovery is supported as follows: when the parser reaches an error state (no grammar rules can apply), it calls a function named parse_error with the string "syntax error" as argument. The default parse_error function does nothing and returns, thus initiating error recovery (see below). The user can define a customized parse_error function in the header section of the grammar file.
The parser also enters error recovery mode if one of the grammar actions raises the Parsing.Parse_error exception.
In error recovery mode, the parser discards states from the stack until it reaches a place where the error token can be shifted. It then discards tokens from the input until it finds three successive tokens that can be accepted, and starts processing with the first of these. If no state can be uncovered where the error token can be shifted, then the parser aborts by raising the Parsing.Parse_error exception.
Refer to documentation on yacc for more details and guidance in how to use error recovery.
5 Options
The ocamlyacc command recognizes the following options:
- -bprefix
- Name the output files prefix.ml, prefix.mli, prefix.output, instead of the default naming convention.
- -q
- This option has no effect.
- -v
- Generate a description of the parsing tables and a report on conflicts resulting from ambiguities in the grammar. The description is put in file grammar.output.
- -version
- Print version string and exit.
- -vnum
- Print short version number and exit.
- -
- Read the grammar specification from standard input. The default output file names are stdin.ml and stdin.mli.
- -- file
- Process file as the grammar specification, even if its name starts with a dash (-) character. This option must be the last on the command line.
At run-time, the ocamlyacc-generated parser can be debugged by setting the p option in the OCAMLRUNPARAM environment variable (see section 11.2). This causes the pushdown automaton executing the parser to print a trace of its action (tokens shifted, rules reduced, etc). The trace mentions rule numbers and state numbers that can be interpreted by looking at the file grammar.output generated by ocamlyacc -v.
6 A complete example
The all-time favorite: a desk calculator. This program reads arithmetic expressions on standard input, one per line, and prints their values. Here is the grammar definition:
/* File parser.mly */ %token <int> INT %token PLUS MINUS TIMES DIV %token LPAREN RPAREN %token EOL %left PLUS MINUS /* lowest precedence */ %left TIMES DIV /* medium precedence */ %nonassoc UMINUS /* highest precedence */ %start main /* the entry point */ %type <int> main %% main: expr EOL { $1 } ; expr: INT { $1 } | LPAREN expr RPAREN { $2 } | expr PLUS expr { $1 + $3 } | expr MINUS expr { $1 - $3 } | expr TIMES expr { $1 * $3 } | expr DIV expr { $1 / $3 } | MINUS expr %prec UMINUS { - $2 } ;
Here is the definition for the corresponding lexer:
(* File lexer.mll *) { open Parser (* The type token is defined in parser.mli *) exception Eof } rule token = parse [' ' '\t'] { token lexbuf } (* skip blanks *) | ['\n' ] { EOL } | ['0'-'9']+ as lxm { INT(int_of_string lxm) } | '+' { PLUS } | '-' { MINUS } | '*' { TIMES } | '/' { DIV } | '(' { LPAREN } | ')' { RPAREN } | eof { raise Eof }
Here is the main program, that combines the parser with the lexer:
(* File calc.ml *) let _ = try let lexbuf = Lexing.from_channel stdin in while true do let result = Parser.main Lexer.token lexbuf in print_int result; print_newline(); flush stdout done with Lexer.Eof -> exit 0
To compile everything, execute:
ocamllex lexer.mll # generates lexer.ml ocamlyacc parser.mly # generates parser.ml and parser.mli ocamlc -c parser.mli ocamlc -c lexer.ml ocamlc -c parser.ml ocamlc -c calc.ml ocamlc -o calc lexer.cmo parser.cmo calc.cmo
7 Common errors
- ocamllex: transition table overflow, automaton is too big
The deterministic automata generated by ocamllex are limited to at most 32767 transitions. The message above indicates that your lexer definition is too complex and overflows this limit. This is commonly caused by lexer definitions that have separate rules for each of the alphabetic keywords of the language, as in the following example.
rule token = parse "keyword1" { KWD1 } | "keyword2" { KWD2 } | ... | "keyword100" { KWD100 } | ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] * as id { IDENT id}
To keep the generated automata small, rewrite those definitions with only one general “identifier” rule, followed by a hashtable lookup to separate keywords from identifiers:
{ let keyword_table = Hashtbl.create 53 let _ = List.iter (fun (kwd, tok) -> Hashtbl.add keyword_table kwd tok) [ "keyword1", KWD1; "keyword2", KWD2; ... "keyword100", KWD100 ] } rule token = parse ['A'-'Z' 'a'-'z'] ['A'-'Z' 'a'-'z' '0'-'9' '_'] * as id { try Hashtbl.find keyword_table id with Not_found -> IDENT id }
- ocamllex: Position memory overflow, too many bindings
- The deterministic automata generated by ocamllex maintain a table of positions inside the scanned lexer buffer. The size of this table is limited to at most 255 cells. This error should not show up in normal situations.