Chapter 6  The OCaml language

Foreword

This document is intended as a reference manual for the OCaml language. It lists the language constructs, and gives their precise syntax and informal semantics. It is by no means a tutorial introduction to the language: there is not a single example. A good working knowledge of OCaml is assumed.

No attempt has been made at mathematical rigor: words are employed with their intuitive meaning, without further definition. As a consequence, the typing rules have been left out, by lack of the mathematical framework required to express them, while they are definitely part of a full formal definition of the language.

Notations

The syntax of the language is given in BNF-like notation. Terminal symbols are set in typewriter font (like this). Non-terminal symbols are set in italic font (like  that). Square brackets […] denote optional components. Curly brackets {…} denotes zero, one or several repetitions of the enclosed components. Curly bracket with a trailing plus sign {…}+ denote one or several repetitions of the enclosed components. Parentheses (…) denote grouping.

6.1  Lexical conventions

Blanks

The following characters are considered as blanks: space, newline, horizontal tabulation, carriage return, line feed and form feed. Blanks are ignored, but they separate adjacent identifiers, literals and keywords that would otherwise be confused as one single identifier, literal or keyword.

Comments

Comments are introduced by the two characters (*, with no intervening blanks, and terminated by the characters *), with no intervening blanks. Comments are treated as blank characters. Comments do not occur inside string or character literals. Nested comments are handled correctly.

Identifiers

ident::= (letter∣ _) { letter∣ 09∣ _∣ ' }  
 
letter::= A … Z ∣  a … z

Identifiers are sequences of letters, digits, _ (the underscore character), and ' (the single quote), starting with a letter or an underscore. Letters contain at least the 52 lowercase and uppercase letters from the ASCII set. The current implementation also recognizes as letters all accented characters from the ISO 8859-1 (“ISO Latin 1”) set. All characters in an identifier are meaningful. The current implementation accepts identifiers up to 16000000 characters in length.

Integer literals

integer-literal::= [-] (09) { 09∣ _ }  
  [-] (0x∣ 0X) (09∣ AF∣ af) { 09∣ AF∣ af∣ _ }  
  [-] (0o∣ 0O) (07) { 07∣ _ }  
  [-] (0b∣ 0B) (01) { 01∣ _ }

An integer literal is a sequence of one or more digits, optionally preceded by a minus sign. By default, integer literals are in decimal (radix 10). The following prefixes select a different radix:

PrefixRadix
0x, 0Xhexadecimal (radix 16)
0o, 0Ooctal (radix 8)
0b, 0Bbinary (radix 2)

(The initial 0 is the digit zero; the O for octal is the letter O.) The interpretation of integer literals that fall outside the range of representable integer values is undefined.

For convenience and readability, underscore characters (_) are accepted (and ignored) within integer literals.

Floating-point literals

float-literal::= [-] (09) { 09∣ _ } [. { 09∣ _ }] [(e∣ E) [+∣ -] (09) { 09∣ _ }]

Floating-point decimals consist in an integer part, a decimal part and an exponent part. The integer part is a sequence of one or more digits, optionally preceded by a minus sign. The decimal part is a decimal point followed by zero, one or more digits. The exponent part is the character e or E followed by an optional + or - sign, followed by one or more digits. The decimal part or the exponent part can be omitted, but not both to avoid ambiguity with integer literals. The interpretation of floating-point literals that fall outside the range of representable floating-point values is undefined.

For convenience and readability, underscore characters (_) are accepted (and ignored) within floating-point literals.

Character literals

char-literal::= ' regular-char '  
  ' escape-sequence '  
 
escape-sequence::= \ (\ ∣  " ∣  ' ∣  n ∣  t ∣  b ∣  r)  
  \ (09) (09) (09)  
  \x (09∣ AF∣ af) (09∣ AF∣ af)

Character literals are delimited by ' (single quote) characters. The two single quotes enclose either one character different from ' and \, or one of the escape sequences below:

SequenceCharacter denoted
\\backslash (\)
\"double quote (")
\'single quote (')
\nlinefeed (LF)
\rcarriage return (CR)
\thorizontal tabulation (TAB)
\bbackspace (BS)
\spacespace (SPC)
\dddthe character with ASCII code ddd in decimal
\xhhthe character with ASCII code hh in hexadecimal

String literals

string-literal::= " { string-character } "  
 
string-character::= regular-char-str  
  escape-sequence

String literals are delimited by " (double quote) characters. The two double quotes enclose a sequence of either characters different from " and \, or escape sequences from the table given above for character literals.

To allow splitting long string literals across lines, the sequence \newline blanks (a \ at end-of-line followed by any number of blanks at the beginning of the next line) is ignored inside string literals.

The current implementation places practically no restrictions on the length of string literals.

Naming labels

To avoid ambiguities, naming labels in expressions cannot just be defined syntactically as the sequence of the three tokens ~, ident and :, and have to be defined at the lexical level.

label-name ::= (a … z∣ _) { letter∣ 09∣ _∣ ' } 
 
label ::= ~ label-name :  
 
optlabel ::= ? label-name :

Naming labels come in two flavours: label for normal arguments and optlabel for optional ones. They are simply distinguished by their first character, either ~ or ?.

Despite label and optlabel being lexical entities in expressions, their expansions ~ label-name : and ? label-name : will be used in grammars, for the sake of readability. Note also that inside type expressions, this expansion can be taken literally, i.e. there are really 3 tokens, with optional spaces beween them.

Prefix and infix symbols

infix-symbol::= (= ∣  < ∣  > ∣  @ ∣  ^ ∣  | ∣  & ∣  + ∣  - ∣  * ∣  / ∣  $ ∣  %) { operator-char }  
 
prefix-symbol::= ! { operator-char }  
  (? ∣  ~) { operator-char }+  
 
operator-char::= ! ∣  $ ∣  % ∣  & ∣  * ∣  + ∣  - ∣  . ∣  / ∣  : ∣  < ∣  = ∣  > ∣  ? ∣  @ ∣  ^ ∣  | ∣  ~  
 

Sequences of “operator characters”, such as <=> or !!, are read as a single token from the infix-symbol or prefix-symbol class. These symbols are parsed as prefix and infix operators inside expressions, but otherwise behave like normal identifiers.

Keywords

The identifiers below are reserved as keywords, and cannot be employed otherwise:

      and         as          assert      asr         begin       class
      constraint  do          done        downto      else        end
      exception   external    false       for         fun         function
      functor     if          in          include     inherit     initializer
      land        lazy        let         lor         lsl         lsr
      lxor        match       method      mod         module      mutable
      new         object      of          open        or          private
      rec         sig         struct      then        to          true
      try         type        val         virtual     when        while
      with

The following character sequences are also keywords:

    !=    #     &     &&    '     (     )     *     +     ,     -
    -.    ->    .     ..    :     ::    :=    :>    ;     ;;    <
    <-    =     >     >]    >}    ?     ??    [     [<    [>    [|
    ]     _     `     {     {<    |     |]    }     ~

Note that the following identifiers are keywords of the Camlp4 extensions and should be avoided for compatibility reasons.

    parser    value   <<    <:    >>    $     $$    $:

Ambiguities

Lexical ambiguities are resolved according to the “longest match” rule: when a character sequence can be decomposed into two tokens in several different ways, the decomposition retained is the one with the longest first token.

Line number directives

linenum-directive::= # {0 … 9}+  
  # {0 … 9}+ " { string-character } "  
 

Preprocessors that generate OCaml source code can insert line number directives in their output so that error messages produced by the compiler contain line numbers and file names referring to the source file before preprocessing, instead of after preprocessing. A line number directive is composed of a # (sharp sign), followed by a positive integer (the source line number), optionally followed by a character string (the source file name). Line number directives are treated as blank characters during lexical analysis.