Foreach Tutorial (Camlp4 3.10)
Deprecation Warning: this tutorial describes technology that is considered obsolete. It's been replaced by extensions points and ppx rewriters
This is a tutorial that guides you, step by step, how to write syntax extension using Camlp4 in Objective CAML. The example we present in this tutorial should give you a practical idea how syntax extension works, but this is not meant to be comprehensive. There are many resources on the web for Camlp4 prior to version 3.10, but Camlp4 3.10 introduces incompatible changes. This tutorial targets Camlp4 3.10 only.
Throughout the tutorial, we intentionally introduce subtle bugs in the interest of simplicity, but these bugs will be later explained.
The reader is expected to be familiar with context-free parsing. Some experience with yacc or ocamlyacc will be helpful. Knowing the difference between LL, LR and LALR parsers is a plus.
Motivation
One of the greatest strengths of OCaml is Camlp4, a modular pre-processor pretty-printer that lets you add syntactic sugar to the language without rewriting the entire grammar from scratch. A syntactic sugar is a language construct that can be decomposed to an existing construct. For example, OCaml doesn't have a "for-each" syntax, which can be illustrated by the following Python code:
a_list = ["hello", "world"]
for s in a_list:
print s
This code outputs:
hello
world
However, that doesn't mean OCaml can't support iterators! As it turns out, many modules in OCaml that define a data structure also provide an "iter" higher-order function that works similarly:
let a_list = ["hello"; "world"] in
List.iter (fun s -> print_endline s) a_list
Besides List
, other modules that provide "iter" are
Array
, Hashtbl
, Map.Make(*)
,
Queue
, Set.Make(*)
, Stack
, and
even String
. Wouldn't it be nice if one can write:
let a_list = ["hello"; "world"] in
for s in a_list do
print_endline s
done
instead? Imagine a long for-each body; this style of writing puts the "each" variable and the list next to each other, which results in cleaner code.
The idea is to transform the latter code into former, leveraging the
iter
function provided by these modules.
Notice that we can't generally look up a module specific iter
function
by the type of the expression we want to iterate over, so we have to
specify a module like this:
let a_list = ["hello"; "world"] in
for s in List a_list do
print_endline s
done
More generally, the syntactic sugar takes the following form:
for v in M e do
seq
done
where module M
implements an iterator over the elements in
collection e; most of the time, e also has the type
M.t
, though this is not strictly required.
Towards the First Try
Without going into too much detail, we can write a simple extension as the following:
open Camlp4.PreCast
open Syntax
let () =
EXTEND Gram
expr: LEVEL "top"
[ [ "for"; v = a_LIDENT; "in"; m = a_UIDENT; e = expr; "do";
seq = sequence; "done" ->
<:expr< $uid:m$.iter (fun $lid:v$ -> $seq$) $e$ >>
] ];
END
For what follows, it is assumed that this code is written to a file pa_foreach.ml.
All the prerequisites are provided by the module
Camlp4.PreCast
, including the Syntax
module
that provides the terminals a_LIDENT
and
a_UIDENT
(for lowercase and uppercase identifiers
respectively), and the non-terminals expr
(expressions) and
sequence
(semicolon separated expressions). The
Gram
module provides functions to manipulate non-terminal
rules.
Suppose we save this syntax extension in the file called
pa_foreach.ml
. Notice that a syntax extension is just a
regular OCaml file with some fancy EXTEND
syntax. It can be
compiled using the command:
ocamlc -I +camlp4 camlp4lib.cma -pp camlp4orf -c pa_foreach.ml
FYI: Camlp4 is complicated to learn because a syntax extension consists
of two domain-specific languages that are foreign to OCaml: parsing and
code embedding. To add insult to injury, Camlp4 has a
revised syntax
dialect of OCaml that can be used, in addition to the
original OCaml dialect, both as the host (top-level) language and as the
embedded language. This tutorial uses original OCaml as the host
language and revised dialect as the embedded language. This choice is
reflected in the camlp4 executable that we use here,
camlp4orf
. See
Using Camlp4
for the explanation and additional options.
We can verify that it works:
# #load "camlp4o.cma";;
Line 1:
Error: Reference to undefined global `Stdlib__stream'
# (* Add directory where the syntax extension is compiled to the module
path (not required if pa_foreach.cmo is in the current directory). *)
#directory "site/learn/tutorials/camlp4_3.10/";;
# #load "pa_foreach.cmo";;
Line 1:
Error: Reference to undefined global `Camlp4'
# for s in List ["hello"; "world"] do print_endline s done;;
Error: Syntax error
Now we have a rudimentary syntax extension. However, it is quite restrictive. For example, it would be nice to do pattern matching like this:
for (k, v) in List [(0, "hello"); (1, "world")] do
Printf.printf "%d: %s\n" k v
done
It doesn't support Hashtbl.iter
yet because, so far, we
can only generate functions with one argument. Hashtbl.iter
passes two arguments to the function, the key and the value. We will
address these issues in the next section.
Supporting Multiple Patterns
We're now beginning to use Camlp4 features that are less documented. It is useful to keep in mind that syntax extension is modifying existing parsing rules, and we can't do without an understanding of how existing rules work. It is recommended that you take a quick look at two files in the source distribution under camlp4/Camlp4Parsers:
- Camlp4OCamlRevisedParser.ml; this provides the bulk of the syntax rules.
- Camlp4OCamlParser.ml
implicitly borrows most of its rules
from
Camlp4OCamlRevisedParser
, but clears some of the rules that are particular to the revised syntax and replaces them with the original OCaml syntax.
From these files, we can see that ipatt
rule supplies what
we want for matching patterns, and LIST1
modifier allows us
to take one or more patterns. There is also a LIST0
modifier for "zero or more" matching. Here is our second version,
pa_foreach2.ml, of the syntax extension:
open Camlp4.PreCast
open Syntax
let rec mkfun _loc patts e =
match patts with
| p :: patts ->
<:expr< fun $p$ -> $mkfun _loc patts e$ >>
| [] ->
e
let () =
EXTEND Gram
expr: LEVEL "top"
[ [ "for"; patts = LIST1 ipatt; "in"; m = a_UIDENT; e = expr; "do";
seq = sequence; "done" ->
let f = mkfun _loc patts seq in
<:expr< $uid:m$.iter $f$ $e$ >>
] ];
END
In order to generate a higher-order function according to a list of
patterns, we use the helper mkfun
that essentially
generates, for patterns patt1
... pattN
,
fun patt1 ->
...
fun pattN ->
seq
which is the desugared form for fun patt1 ... pattN -> seq
.
We can verify that pattern works:
ocamlc -I +camlp4 camlp4lib.cma -pp camlp4orf -c pa_foreach2.ml
# #load "pa_foreach2.cmo";;
Line 1:
Error: Reference to undefined global `Camlp4'
# for (k, v) in List [(0, "hello"); (1, "world")] do
Printf.printf "%d: %s\n" k v
done;;
Error: Syntax error
and that multiple arguments work:
# let tbl = Hashtbl.create 3;;
val tbl : ('_weak1, '_weak2) Hashtbl.t = <abstr>
# Hashtbl.add tbl 0 "hello";
Hashtbl.add tbl 1 "world";
Hashtbl.add tbl 2 "foobar";;
- : unit = ()
# for k v in Hashtbl tbl do
Printf.printf "%d: %s\n" k v
done;;
Error: Syntax error
There are still some subtle problems with this approach. In the next section, we'll discuss solutions to these problems.
Final Touch-up
The first problem we find out is that foreach body is supposed to be a sequence, but that doesn't work:
# for s in List ["hello"; "world"] do print_endline s; () done;;
Error: Syntax error
The problem is due to the desugared form, which becomes:
List.iter (fun s -> print_endline s; ()) ["hello"; "world"]
and this is ambiguous. Do we interpret the expression
fun s -> print_endline s; ()
as
(fun s -> print_endline s); ()
or fun s -> (print_endline (); ())
?
In order to avoid ambiguity, we wrap the sequence using do {…}
when it is a sequence. The function mksequence'
in
Camlp4OCamlRevisedParser.ml
provides some inspiration how
to handle this.
FYI: see this post by Nicolas Pouillard for a discussion of a similar issue.
We also notice that the original for-loop syntax ceases working:
# for i = 0 to 5 do print_int i done;;
012345- : unit = ()
It turns out that the new rule we add for for
doesn't play nice with
the original rule. The solution is to rewrite the original rule so it
becomes compatible with the new rule.
The code listing, pa_foreach3.ml, that addresses these issues can be found below:
open Camlp4.PreCast
open Syntax
let mksequence _loc = function
| <:expr< $_$; $_$ >> as e -> <:expr< $e$ >>
| e -> e
let rec mkfun _loc patts e =
match patts with
| p :: patts -> <:expr< fun $p$ -> $mkfun _loc patts e$ >>
| [] -> mksequence _loc e
let lident_of_patt p =
match p with
| <:patt< $lid:i$ >> -> i
| _ -> invalid_arg "lident_of_patt"
let () =
DELETE_RULE Gram
expr:
"for"; a_LIDENT; "="; sequence; direction_flag; sequence;
"do"; do_sequence
END;
EXTEND Gram
expr: LEVEL "top"
[ [ "for"; i = ipatt; "="; e1 = sequence; df = direction_flag;
e2 = sequence; "do"; seq = do_sequence ->
<:expr<
for $lident_of_patt i$ =
$mksequence _loc e1$ $to:df$ $mksequence _loc e2$
do $seq$ done
>>
| "for"; p = ipatt; patts = LIST0 ipatt; "in";
m = a_UIDENT; e = expr; "do"; seq = do_sequence ->
let patts = p :: patts in
let f = mkfun _loc patts seq in
<:expr< $uid:m$.iter $f$ $e$ >>
] ];
END
In order for alternative rules to "fall through" correctly, rules must share a common prefix, and the rest of the non-terminals must be distinct enough. The way we structured the "for" syntax before, we essentially have the following two rules:
EXTEND Gram
expr: LEVEL "top"
[ [ "for"; patts = LIST1 ipatt; "in"; m = a_UIDENT; e = expr;
"do"; seq = do_sequence ->
(* ... *)
| "for"; i = a_LIDENT; "="; e1 = sequence; df = direction_flag;
e2 = sequence; "do"; seq = do_sequence ->
(* ... *)
] ]
;
END
Once the "for" keyword is matched, the parser proceeds to match the next
token, but ipatt
also matches a_LIDENT
(a
lowercase identifier is also a legal pattern). At this point, the parser
fixes on a rule (whichever comes first) and no longer backtracks to the
alternative option.
The solution is to make the original for loop always match
ipatt
, but refuse anything except a lowercase identifier.
We can now verify that everything works as expected.
$ ocamlc -I +camlp4 camlp4lib.cma -pp camlp4orf -c pa_foreach3.ml
# #load "pa_foreach3.cmo";;
Line 1:
Error: Reference to undefined global `Camlp4'
# let tbl = Hashtbl.create 3;;
val tbl : ('_weak3, '_weak4) Hashtbl.t = <abstr>
# Hashtbl.add tbl 0 "hello";
Hashtbl.add tbl 1 "world";
Hashtbl.add tbl 2 "foo";;
- : unit = ()
# for k v in Hashtbl tbl do Printf.printf "%d: %s\n" k v done;;
Error: Syntax error
# for i = 0 to 5 do print_int i done;;
012345- : unit = ()
# for k, v in List [0, "hello"; 1, "world"; 2, "foo"] do
Printf.printf "%d: %s\n" k v
done;;
Error: Syntax error
Conclusion
We begin with a simple Camlp4 syntax extension for the for-each iterator syntax, then explored further refinements to that idea that works around parsing issues both in embedded code and in the rules.
This tutorial is written by Likai Liu. The text has been barely peer reviewed, so use this at your own risk. Questions and comments are welcome.