Skip to main content

Syntax Reference

This document describes the syntax of the Ohm language, which is a variant of parsing expression grammars (PEGs). If you have experience with PEGs, the Ohm syntax will mostly look familiar, but there are a few important differences to note:

  • When naming rules, case matters: whitespace is implicitly skipped inside a rule application if the rule name begins with an uppercase letter. For further information, see Syntactic vs. Lexical Rules.
  • Grammars are purely about recognition: they do not contain semantic actions (those are defined separately) or bindings. The separation of semantic actions is one of the defining features of Ohm β€” we believe that it improves modularity and makes both grammars and semantics easier to understand.
  • Alternation expressions support case names, which are used in inline rule declarations. This makes semantic actions for alternation expressions simpler and less error-prone.
  • Ohm does not (yet) support semantic predicates.

Ohm is closely related to OMeta, another PEG-based language for parsing and pattern matching. Like OMeta, Ohm supports a few features not supported by many PEG parsing frameworks:

Terminology​

Arithmetic {
Expr = "1 + 1"
}

This is a grammar named "Arithmetic", which has a single rule named "Expr". The right hand side of Expr is known as a "rule body". A rule body may be any valid parsing expression.

Parsing Expressions​

Here is a full list of the different kinds of parsing expressions supported by Ohm:

Terminals​

"hello there"

Matches exactly the characters contained inside the quotation marks.

Special characters​

Special characters (", \, and ') can be escaped with a backslash β€” e.g., "\"" will match a literal quote character in the input stream. Other valid escape sequences include: \b (backspace), \f (form feed), \n (line feed), \r (carriage return), and \t (tab), as well as \x followed by 2 hex digits and \u followed by 4 hex digits, for matching characters by code point.

The \u{hexDigits} escape sequence can be used to represent any Unicode code point, including code points above 0xFFFF. E.g., "\u{1F639}" will match '😹'. (New in Ohm v16.3.0.)

NOTE: For grammars defined in a JavaScript string literal (i.e., not in a separate .ohm file), it's recommended to use a template literal with the String.raw tag. Without String.raw, you'll need to use double-escaping β€” e.g., \\n rather than \n.

Terminal Range​

start..end

Matches exactly one code point whose value is between start and end (inclusive). E.g., "a".."c" will match 'a', 'b', or 'c'. Note: start and end must be Terminal expressions containing a single character or code point. (Note: Prior to Ohm v16.3.0, terminal ranges only supported code points up 0xFFFF. As of v16.3.0, higher code points can be specified directly (e.g. "πŸ˜‡".."😈") or with an escape code ("\u{1F607}".."\u{1F608}").

Rule Application​

ruleName

Matches the body of the rule named ruleName. For example, the built-in rule letter will parse a string of length 1 that is a letter.

ruleName<expr>

Matches the body of the parameterized rule named ruleName, substituting the parsing expression expr as its first parameter. For parameterized rules with more than one parameter, the parameters are comma-separated, e.g. ListOf<field, ";">.

Repetition operators: *, +, ?​

expr *

Matches the expression expr repeated 0 or more times. E.g., "a"* will match '', 'a', 'aa', ...

Inside a syntactic rule β€” any rule whose name begins with an upper-case letter β€” spaces before a match are automatically skipped. E.g., "a"* will match " a a" as well as "aa". See the documentation on syntactic and lexical rules for more information.

expr +

Matches the expression expr repeated 1 or more times. E.g., letter+ will match 'x', 'xA', ...

As with the * operator, spaces are skipped when used in a syntactic rule.

expr ?

Tries to match the expression expr, succeeding whether it matches or not. No input is consumed if it does not match.

Sequence​

expr1 expr2

Matches the expression expr1 followed by expr2. E.g., "grade" letter will match 'gradeA', 'gradeB', ...

As with the * and + operators, spaces are skipped when used in a syntactic rule. E.g., "grade" letter will match ' grade A' as well as 'gradeA'.

Alternation​

expr1 | expr2

Matches the expression expr1, and if that does not succeed, matches the expression expr2. E.g., letter | digit will match 'a', '9', ...

Lookahead: &​

& expr

Succeeds if the expression expr can be matched, but does not consume anything from the input stream. Usually used as part of a sequence, e.g. letter &digit will match 'a9', but only consume 'a'. &"a" letter+ will match any string of letters that begins with 'a'.

Negative Lookahead: ~​

~ expr

Succeeds if the expression expr cannot be matched, and does not consume anything from the input stream. Usually used as part of a sequence, e.g., ~"\n" any will consume any single character that is not a new line character.

Lexification: #​

# expr

Matches expr as if in a lexical context. This can be used to prevent whitespace skipping before an expression that appears in the body of a syntactic rule. For further information, see Syntactic vs. Lexical Rules.

Comment​

Inside an Ohm grammar, you can use both single-line (//) comments like

booleanLiteral = ("true" | "false") // TODO: Should we support "True"/"False" as well?

or

// For semantics on how decimal literals are constructed, see section 7.8.3

as well as multiline (/* */) comments like:

/*
Note: Punctuator and DivPunctuator (see https://es5.github.io/x7.html#x7.7) are
not currently used by this grammar.
*/

Built-in Rules​

(See src/built-in-rules.ohm.)

any: Matches the next Unicode character β€” i.e., a single code point β€”Β in the input stream, if one exists.

NOTE: A JavaScript string is a sequence of 16-bit code units. Some Unicode characters, such as emoji, are encoded as pairs of 16-bit values. For example, the string 'πŸ˜†' has length 2, but contains a single Unicode code point. Prior to Ohm v17, any always consumed a single 16-bit code unit, rather than a full Unicode character.

letter: Matches a single character which is a letter (either uppercase or lowercase).

lower: Matches a single lowercase letter.

upper: Matches a single uppercase letter.

digit: Matches a single character which is a digit from 0 to 9.

hexDigit: Matches a single character which is a either digit or a letter from A-F.

alnum: Matches a single letter or digit; equivalent to letter | digit.

space: Matches a single whitespace character (e.g., space, tab, newline, etc.)

end: Matches the end of the input stream. Equivalent to ~any.

caseInsensitive<terminal>: Matches _terminal_, but ignoring any differences in casing (based on the simple, single-character Unicode case mappings). E.g., `caseInsensitive<"ohm">` will match `'Ohm'`, `'OHM'`, etc.ListOf<elem, sep>: Matches the expression _elem_ zero or more times, separated by something that matches the expression _sep_. E.g., `ListOf<letter, ",">` will match `''`, `'a'`, and `'a, b, c'`.NonemptyListOf<elem, sep>: Like `ListOf`, but matches _elem_ at least one time.listOf<elem, sep>: Similar to `ListOf<elem, sep>` but interpreted as [lexical rule](#syntactic-lexical).applySyntactic<ruleName>: Allows the syntactic rule _ruleName_ to be applied in a lexical context, which is otherwise not allowed. Spaces are skipped _before_ and _after_ the rule application. _New in Ohm v16.1.0._

Grammar Syntax​

Grammar Inheritance​

grammarName <: supergrammarName { ... }

Declares a grammar named grammarName which inherits from supergrammarName.

Defining, Extending, and Overriding Rules​

In the three forms below, the rule body may optionally begin with a | character, which will be ignored. Also note that in rule names, case is significant.

ruleName = expr

Defines a new rule named ruleName in the grammar, with the parsing expression expr as the rule body. Throws an error if a rule with that name already exists in the grammar or one of its supergrammars.

ruleName := expr

Defines a rule named ruleName, overriding a rule of the same name in a supergrammar. Throws an error if no rule with that name exists in a supergrammar.

New in 15.3.0: The super-splice operator (...) can be used to append and/or prepend cases to the supergrammar rule body. E.g., if the supergrammar defines comment = multiLineComment, then comment := ... | singleLineComment is equivalent to comment := multiLineComment | singleLineComment.

ruleName += expr

Extends a supergrammar rule named ruleName, throwing an error if no rule with that name exists in a supergrammar. The rule body will effectively be expr | oldBody, where oldBody is the rule body as defined in the supergrammar.

Note that as of v15.3.0, the super-splice operator (...) offers a more general form of rule extension. E.g., keyword += "def" can also be written keyword := "def" | ....

Parameterized Rules​

ruleName<arg1, ..., argN> = expr

Defines a new rule named ruleName which has n parameters. In the rule body expr, the parameter names (e.g. arg1) may be used as rule applications. E.g., Repeat<x> = x x.

Rule Descriptions​

Rule declarations may optionally have a description, which is a parenthesized "comment" following the name of the rule in its declaration. Rule descriptions are used to produce better error messages for end users of a language when input is not recognized. For example:

ident (an identifier)
= ~keyword name

Inline Rule Declarations​

expr β€” caseName

When a parsing expression is followed by the characters -- and a name, it signals an inline rule declaration. This is most commonly used in alternation expressions to ensure that each branch has the same arity. For example, the following declaration:

AddExp = AddExp "+" MulExp  -- plus
| MulExp

is equivalent to:

AddExp = AddExp_plus
| MulExp
AddExp_plus = AddExp "+" MulExp

Syntactic vs. Lexical Rules​

A syntactic rule is a rule whose name begins with an uppercase letter, and lexical rule is one whose name begins with a lowercase letter. The difference between lexical and syntactic rules is that syntactic rules implicitly skip whitespace characters.

The definition of "whitespace character" is anything that matches the grammar's space rule. The default implementation of space matches ' ', '\t', '\n', '\r', and any other character that is considered whitespace in the ES5 spec.

How space skipping works​

In the body of a syntactic rule, Ohm implicitly inserts applications of the spaces rule before each expression. (The spaces rule is defined as spaces = space*.) As an example, take this fragment of JSON grammar:

Array = "[" "]"  -- empty
| "[" Elements "]" -- nonEmpty
Elements = Element ("," Element)*

Array and Elements are both synactic rules, since their names begin with a capital letter. Here's what a lexical version of these rule would look like, with explicit space skipping:

array = spaces "[" spaces "]"  -- empty
| spaces "[" spaces elements spaces "]" -- nonEmpty
elements = spaces element (spaces "," spaces element)*

In terms of the language it accepts, this version of the rules β€” with explicit space skipping β€” is equivalent to the syntactic version above.

A few other details that are helpful to know:

  1. If the start rule is a syntactic rule, both leading and trailing spaces are skipped around the top-level application.
  2. When the body of a rule contains a repetition operator (e.g. + or *), spaces are skipped before each match. In other words, Names = name+ is equivalent to names = (spaces name)+.
  3. The lexification operator (#) can be used in the body of a syntactic rule to prevent space skipping in specific places. For example:
KeyAndValue = #(letter alnum+) ":" #(digit+)

is equivalent to:

keyAndValue = letter alnum+ spaces ":" digit+

Note that no space skipping occurs inside or before the lexical context defined by the # character. That means that this rule will match 'count :33', but not 'count: 33'.