Syntax Reference
This document describes the syntax of the Ohm language, which is a variant of parsing expression grammars (PEGs). If you have experience with PEGs, the Ohm syntax will mostly look familiar, but there are a few important differences to note:
- When naming rules, case matters: whitespace is implicitly skipped inside a rule application if the rule name begins with an uppercase letter. For further information, see Syntactic vs. Lexical Rules.
- Grammars are purely about recognition: they do not contain semantic actions (those are defined separately) or bindings. The separation of semantic actions is one of the defining features of Ohm β we believe that it improves modularity and makes both grammars and semantics easier to understand.
- Alternation expressions support case names, which are used in inline rule declarations. This makes semantic actions for alternation expressions simpler and less error-prone.
- Ohm does not (yet) support semantic predicates.
Ohm is closely related to OMeta, another PEG-based language for parsing and pattern matching. Like OMeta, Ohm supports a few features not supported by many PEG parsing frameworks:
- Rule applications can accept parameters. This makes it possible to write higher-order rules, such as the built-in
ListOf
rule. - Grammars can be extended in an object-oriented way β see Defining, Extending, and Overriding Rules.
Terminologyβ
Arithmetic {
Expr = "1 + 1"
}
This is a grammar named "Arithmetic", which has a single rule named "Expr". The right hand side of Expr is known as a "rule body". A rule body may be any valid parsing expression.
Parsing Expressionsβ
Here is a full list of the different kinds of parsing expressions supported by Ohm:
Terminalsβ
"hello there"
Matches exactly the characters contained inside the quotation marks.
Special charactersβ
Special characters ("
, \
, and '
) can be escaped with a backslash β e.g., "\""
will match a literal quote character in the input stream. Other valid escape sequences include: \b
(backspace), \f
(form feed), \n
(line feed), \r
(carriage return), and \t
(tab), as well as \x
followed by 2 hex digits and \u
followed by 4 hex digits, for matching characters by code point.
The \u{hexDigits}
escape sequence can be used to represent any Unicode code point, including code points above 0xFFFF
. E.g., "\u{1F639}"
will match 'πΉ'
. (New in Ohm v16.3.0.)
NOTE: For grammars defined in a JavaScript string literal (i.e., not in a separate .ohm file), it's recommended to use a template literal with the String.raw tag. Without String.raw
, you'll need to use double-escaping β e.g., \\n
rather than \n
.
Terminal Rangeβ
start..end
Matches exactly one code point whose value is between start and end (inclusive). E.g., "a".."c"
will match 'a'
, 'b'
, or 'c'
. Note: start and end must be Terminal expressions containing a single character or code point. (Note: Prior to Ohm v16.3.0, terminal ranges only supported code points up 0xFFFF
. As of v16.3.0, higher code points can be specified directly (e.g. "π".."π"
) or with an escape code ("\u{1F607}".."\u{1F608}"
).
Rule Applicationβ
ruleName
Matches the body of the rule named ruleName. For example, the built-in rule letter
will parse a string of length 1 that is a letter.
ruleName<expr>
Matches the body of the parameterized rule named ruleName, substituting the parsing expression expr as its first parameter. For parameterized rules with more than one parameter, the parameters are comma-separated, e.g. ListOf<field, ";">
.
Repetition operators: *, +, ?β
expr *
Matches the expression expr repeated 0 or more times. E.g., "a"*
will match ''
, 'a'
, 'aa'
, ...
Inside a syntactic rule β any rule whose name begins with an upper-case letter β spaces before a match are automatically skipped. E.g., "a"*
will match " a a"
as well as "aa"
. See the documentation on syntactic and lexical rules for more information.
expr +
Matches the expression expr repeated 1 or more times. E.g., letter+
will match 'x'
, 'xA'
, ...
As with the *
operator, spaces are skipped when used in a syntactic rule.
expr ?
Tries to match the expression expr, succeeding whether it matches or not. No input is consumed if it does not match.
Sequenceβ
expr1 expr2
Matches the expression expr1
followed by expr2
. E.g., "grade" letter
will match 'gradeA'
, 'gradeB'
, ...
As with the *
and +
operators, spaces are skipped when used in a syntactic rule. E.g., "grade" letter
will match ' grade A'
as well as 'gradeA'
.
Alternationβ
expr1 | expr2
Matches the expression expr1
, and if that does not succeed, matches the expression expr2
. E.g., letter | digit
will match 'a'
, '9'
, ...
Lookahead: &β
& expr
Succeeds if the expression expr
can be matched, but does not consume anything from the input stream. Usually used as part of a sequence, e.g. letter &digit
will match 'a9'
, but only consume 'a'. &"a" letter+
will match any string of letters that begins with 'a'.
Negative Lookahead: ~β
~ expr
Succeeds if the expression expr
cannot be matched, and does not consume anything from the input stream. Usually used as part of a sequence, e.g., ~"\n" any
will consume any single character that is not a new line character.
Lexification: #β
# expr
Matches expr as if in a lexical context. This can be used to prevent whitespace skipping before an expression that appears in the body of a syntactic rule. For further information, see Syntactic vs. Lexical Rules.
Commentβ
Inside an Ohm grammar, you can use both single-line (//
) comments like
booleanLiteral = ("true" | "false") // TODO: Should we support "True"/"False" as well?
or
// For semantics on how decimal literals are constructed, see section 7.8.3
as well as multiline (/* */
) comments like:
/*
Note: Punctuator and DivPunctuator (see https://es5.github.io/x7.html#x7.7) are
not currently used by this grammar.
*/
Built-in Rulesβ
(See src/built-in-rules.ohm.)
any
: Matches the next Unicode character β i.e., a single code point βΒ in the input stream, if one exists.
NOTE: A JavaScript string is a sequence of 16-bit code units. Some Unicode characters, such as emoji, are encoded as pairs of 16-bit values. For example, the string 'π'
has length 2, but contains a single Unicode code point. Prior to Ohm v17, any
always consumed a single 16-bit code unit, rather than a full Unicode character.
letter
: Matches a single character which is a letter (either uppercase or lowercase).
lower
: Matches a single lowercase letter.
upper
: Matches a single uppercase letter.
digit
: Matches a single character which is a digit from 0 to 9.
hexDigit
: Matches a single character which is a either digit or a letter from A-F.
alnum
: Matches a single letter or digit; equivalent to letter | digit
.
space
: Matches a single whitespace character (e.g., space, tab, newline, etc.)
end
: Matches the end of the input stream. Equivalent to ~any
.
caseInsensitive<terminal>
: Matches _terminal_, but ignoring any differences in casing (based on the simple, single-character Unicode case mappings). E.g., `caseInsensitive<"ohm">` will match `'Ohm'`, `'OHM'`, etc.ListOf<elem, sep>
: Matches the expression _elem_ zero or more times, separated by something that matches the expression _sep_. E.g., `ListOf<letter, ",">` will match `''`, `'a'`, and `'a, b, c'`.NonemptyListOf<elem, sep>
: Like `ListOf`, but matches _elem_ at least one time.listOf<elem, sep>
: Similar to `ListOf<elem, sep>` but interpreted as [lexical rule](#syntactic-lexical).applySyntactic<ruleName>
: Allows the syntactic rule _ruleName_ to be applied in a lexical context, which is otherwise not allowed. Spaces are skipped _before_ and _after_ the rule application. _New in Ohm v16.1.0._Grammar Syntaxβ
Grammar Inheritanceβ
grammarName <: supergrammarName { ... }
Declares a grammar named grammarName
which inherits from supergrammarName
.
Defining, Extending, and Overriding Rulesβ
In the three forms below, the rule body may optionally begin with a |
character, which will be
ignored. Also note that in rule names, case is significant.
ruleName = expr
Defines a new rule named ruleName
in the grammar, with the parsing expression expr
as the rule body. Throws an error if a rule with that name already exists in the grammar or one of its supergrammars.
ruleName := expr
Defines a rule named ruleName
, overriding a rule of the same name in a supergrammar. Throws an error if no rule with that name exists in a supergrammar.
New in 15.3.0: The super-splice operator (...
) can be used to append and/or prepend cases to the supergrammar rule body. E.g., if the supergrammar defines comment = multiLineComment
, then comment := ... | singleLineComment
is equivalent to comment := multiLineComment | singleLineComment
.
ruleName += expr
Extends a supergrammar rule named ruleName
, throwing an error if no rule with that name exists in a supergrammar. The rule body will effectively be expr | oldBody
, where oldBody
is the rule body as defined in the supergrammar.
Note that as of v15.3.0, the super-splice operator (...
) offers a more general form of rule extension. E.g., keyword += "def"
can also be written keyword := "def" | ...
.
Parameterized Rulesβ
ruleName<arg1, ..., argN> = expr
Defines a new rule named ruleName
which has n parameters. In the rule body expr, the parameter names (e.g. arg1) may be used as rule applications. E.g., Repeat<x> = x x
.
Rule Descriptionsβ
Rule declarations may optionally have a description, which is a parenthesized "comment" following the name of the rule in its declaration. Rule descriptions are used to produce better error messages for end users of a language when input is not recognized. For example:
ident (an identifier)
= ~keyword name
Inline Rule Declarationsβ
expr β caseName
When a parsing expression is followed by the characters --
and a name, it signals an inline rule declaration. This is most commonly used in alternation expressions to ensure that each branch has the same arity. For example, the following declaration:
AddExp = AddExp "+" MulExp -- plus
| MulExp
is equivalent to:
AddExp = AddExp_plus
| MulExp
AddExp_plus = AddExp "+" MulExp
Syntactic vs. Lexical Rulesβ
A syntactic rule is a rule whose name begins with an uppercase letter, and lexical rule is one whose name begins with a lowercase letter. The difference between lexical and syntactic rules is that syntactic rules implicitly skip whitespace characters.
The definition of "whitespace character" is anything that matches the grammar's space
rule. The default implementation of space
matches ' ', '\t', '\n', '\r', and any other character that is considered whitespace in the ES5 spec.
How space skipping worksβ
In the body of a syntactic rule, Ohm implicitly inserts applications of the spaces
rule before each expression. (The spaces
rule is defined as spaces = space*
.) As an example, take this fragment of JSON grammar:
Array = "[" "]" -- empty
| "[" Elements "]" -- nonEmpty
Elements = Element ("," Element)*
Array
and Elements
are both synactic rules, since their names begin with a capital letter. Here's what a lexical version of these rule would look like, with explicit space skipping:
array = spaces "[" spaces "]" -- empty
| spaces "[" spaces elements spaces "]" -- nonEmpty
elements = spaces element (spaces "," spaces element)*
In terms of the language it accepts, this version of the rules β with explicit space skipping β is equivalent to the syntactic version above.
A few other details that are helpful to know:
- If the start rule is a syntactic rule, both leading and trailing spaces are skipped around the top-level application.
- When the body of a rule contains a repetition operator (e.g.
+
or*
), spaces are skipped before each match. In other words,Names = name+
is equivalent tonames = (spaces name)+
. - The lexification operator (
#
) can be used in the body of a syntactic rule to prevent space skipping in specific places. For example:
KeyAndValue = #(letter alnum+) ":" #(digit+)
is equivalent to:
keyAndValue = letter alnum+ spaces ":" digit+
Note that no space skipping occurs inside or before the lexical context defined by the #
character. That means that this rule will match 'count :33'
, but not 'count: 33'
.