A few weeks ago, we announced the Ohm v18 beta, which involved a complete rewrite of the core parsing engine. Since then, we've implemented even more performance improvements: v18 is now more than 50x faster for real-world grammars while using about 10% of the memory.
The new parsing engine works by compiling an Ohm grammar — which is a form of parsing expression grammars, or PEG — into a WebAssembly module that implements a parser. In this post, we'll dive into the technical details of how that works, and talk about some of the optimizations that made it even faster.
PExpr trees
In previous versions of Ohm (up to and including v17), the parsing engine used an approach called AST interpretation. We'll briefly explain how that works; it will be useful for understanding how the WebAssembly version is different.
When you instantiate a grammar with Ohm, it parses your grammar and converts it to an abstract syntax tree. You can think of this tree as a kind of program, which describes a parser for the language. The nodes of the tree are parsing expressions, or PExprs as they're called in the source code.
We'll use the following grammar as an example:
JSONLike {
Value = Object
| "true"
| "false"
| "null"
Object = "{" Members? "}"
Members = Member ("," Member)*
Member = string ":" Value
string = "\"" (~"\"" any)* "\""
}
This grammar is parsed by Ohm (using the "grammar grammar", or metagrammar, defined in ohm-grammar.ohm). The result is a Map<string, PExpr>:
{
'Value' => Alt(
Apply('Object'),
Term('true'),
Term('false'),
Term('null')
),
'Object' => Seq(
Term('{'),
Opt(Apply('Members')),
Term('}')
),
...
}
You can think of each rule ('Value', 'Object', etc.) as being like a function, and the function bodies are parsing expressions. Alt, Apply, Opt, Seq, and Term are all subclasses of the abstract PExpr class, and they all have an eval method. These methods are mostly pretty small and straightforward — here's the implementation for Alt:
pexprs.Alt.prototype.eval = function (state) {
for (let idx = 0; idx < this.terms.length; idx++) {
if (state.eval(this.terms[idx])) {
return true;
}
}
return false;
};
To evaluate an Alt, we just recursively evaluate its children. If one of them succeeds, then the Alt succeeds; otherwise, it fails. This approach is quite straightforward and simple, but it's also not very performant.
Wasm compilation
At a high level, the v18 code has two parts:
- Runtime support code written in AssemblyScript.
- The WebAssembly codegen piece, which is written in TypeScript.
The code generation phase begins with the same PExpr tree as v17, but instead of interpreting it, we compile it to WebAssembly. (We actually first convert it to a slightly lower-level intermediate representation, but let's not worry about that for now.) Then we link the generated WebAssembly code with the runtime support code to produce the final Wasm module.
Here's what that code generation looks like for Alt:
function emitAlt(exp: ir.Alt): void {
const {asm} = this;
const saved = asm.maybeSaveBacktrackPoint();
asm.block(w.blocktype.empty, () => {
for (const term of exp.children) {
this.emitPExpr(term);
asm.localGet('ret');
asm.condBreak(asm.depthOf('pexprEnd')); // `return true`
saved.pos.restore();
}
});
}
Unsurprisingly, it has a similar structure: we loop over all the children, and emit the code for each child. But, notice that this is a compile-time loop, not a run-time one. So the structure of the final code, expressed as pseudocode, looks like this:
try matching terms[0]
succeeded? return true
try matching terms[1]
succeeded? return true
// ...
return false
Note that in the generated WebAssembly code, we're also not dispatching to any kind of generic eval function, we just inline the code for each individual expression. The exception is rule application: by default, each rule gets compiled to its own function, so a rule application (like Apply('Object') in the JSONLike grammar) just compiles to a call.
Producing a recognizer (something that just accepts or rejects a given string, without producing a parse tree) in this way was the first major milestone for v18, and it didn't take long. I only targeted pure-PEG features, Ohm-specific things like parameterized rules and left recursion would be harder to deal with.
After adding support for construction of basic parse trees, the first version of the WebAssembly compiler was about 800 lines of JavaScript. After that, it was time to start tackling the Ohm-specific features.
Building syntax trees
In order to be able to do something useful with a valid input, we need to produce some kind of parse tree — or concrete syntax tree (CST), as we usually call them in Ohm. In v17, CST nodes are regular JavaScript objects, allocated on the heap and managed by the garbage collector. But, from a memory management perspective, CST nodes have a few interesting properties:
- The nodes themselves are relatively small, so the per-node memory management overhead is relatively large.
- There are a large number of nodes (counting Terminal nodes, around one per input character).
- The nodes are full of references (which need to be scanned during garbage collection).
- All nodes generally have the same lifetime: either the whole tree is in use, or all its nodes can be freed.
These properties make CST nodes well-suited for region-based memory management, also known as arena allocation. As Wikipedia describes it:
A region [...] is a collection of allocated objects that can be efficiently reallocated or deallocated all at once. Memory allocators using region-based managements are often called area allocators, and when they work by only "bumping" a single pointer, as bump allocators.
Bump allocation into Wasm linear memory
In v18 we use a bump allocator (provided by AssemblyScript's stub runtime) to allocate CST nodes in Wasm linear memory. This has lower overhead than heap-allocated JavaScript objects (only one 32-bit header field per object, vs 3–4 in most JS engines). We consider all CST nodes to be owned by the MatchResult they are associated with, so when the MatchResult is freed, we also reclaim the memory from all its CST nodes.
For references between nodes, we use a 32-bit offset into linear memory, rather than a full-width pointer. (This is the normal way to use references in 32-bit WebAssembly.) Overall, the approach is similar to what Adrian Sampson describes in Flattening ASTs (and Other Compiler Data Structures).
Node layout
The representation for the nodes themselves is also fairly compact.
Terminal nodes
Terminals are the most important thing to optimize, since a typical tree has approximately one terminal node per input character. So, rather than allocating a separate node for each terminal, we use a tagged 32-bit value: (matchLength << 1) | 1.
A regular reference to a full CST node is always 4-byte aligned: the offset is a multiple of 4, and thus the two low bits are always 0. So if the low bit is set, we can detect that it's not a true reference, and instead use the upper 31 bits as the payload — and for terminal nodes, the only thing we need to store is how many characters of input were consumed.
Other nodes
The other node types (nonterminal, list, opt) have the following layout:
Byte
0 matchLength: i32 (chars consumed)
4 typeAndDetails: i32
bits [1:0] = node type
bits [31:2] = ruleId (nonterm) or arity
8 childCount: i32
12 failureOffset: i32 (relative to startIdx)
16+ children: i32[] (child node "pointers")
Chunked bindings
When a parsing expression is successful, it produces a number of CST nodes, which we call bindings. If the parent expression succeeds, those nodes will become its children; but if it fails, they become garbage.
This bottom-up way of building the CST requires a stack-like structure for temporarily storing the bindings. The original implementation used an AssemblyScript Array<i32> — a managed, dynamically-resized array. This was convenient, but it meant that in some scenarios, pushing a binding could be quite expensive (allocate a new backing buffer, copy all elements, free the old one).
We replaced this with an unrolled doubly-linked list of fixed-size chunks:
prev: i32 next: i32 data: i32[128]
∅
▲
┌───┼───────────────────────────┐
│ prev next data (128×i32) │
│ │ │
└──────────┼────────────────────┘
▲ │
│ ▼
┌───────────────────────────────┐
│ prev next data (128×i32) │
│ │ │
└──────────┼────────────────────┘
▼
∅
Each chunk holds 128 binding slots. Two globals track the current position: bindingsChunk (pointer to the active chunk) and bindingsIdx (offset within it). Push is a single store instruction plus an index increment — only when bindingsIdx hits 128 does it advance to the next chunk, reusing an existing one if available from a previous backtrack.
The critical property for a PEG parser is that backtracking is cheap: it's just restoring two i32 values (the saved chunk pointer and index). No elements need to be zeroed, copied, or freed. The "abandoned" slots in subsequent chunks are simply ignored and will be overwritten on the next forward pass.
This change alone made parsing 15–16% faster on our benchmarks, purely from eliminating array resize copies and managed-object overhead.
Memoization
Ohm uses a technique called packrat parsing, in which rule applications are memoized: the first time a rule is applied at a given input position, the result is stored in a table. If the same rule is applied at the same position again, we just look up the result instead of re-evaluating the rule body.
Conceptually, the memo table is a 2D structure indexed by input position and rule ID:
pos 0 pos 1 pos 2 pos 3 pos 4 ...
┌───────┬───────┬───────┬───────┬───────┬─────
Value │ │ │ │ │ │
Object │ │ │ ✓ │ │ │
Members │ │ │ │ │ │
Member │ │ │ │ │ │
string │ │ ✓ │ │ ✓ │ │
└───────┴───────┴───────┴───────┴───────┴─────
In a naive implementation, the memo table would have numPositions × numRules entries. But in practice, it's very sparse — most rules are never attempted at most positions.
Block-sparse representation
To avoid wasting memory on empty entries, v18 uses a block-sparse memo table. The index is a flat array of block pointers: index[pos * numBlocks + blockIdx]. Each block holds 16 entries and is allocated lazily on first write.
pos 0 pos 1
┌────────┬────────┬─── ··· ┌────────┬────────┬─── ···
│ blk 0 │ blk 1 │ │ blk 0 │ blk 1 │
└───┬────┴───┬────┴─── ··· └────────┴────────┴─── ···
│ │
▼ ▼
┌────────┐ ┌────────┐
│16 slots│ │16 slots│ ← i32 MemoEntry values
└────────┘ └────────┘
(64 bytes) (0 = not yet allocated)
This means we only allocate memory for rules that are actually attempted at a given position, while keeping lookups fast.
(This two-level representation is common in packrat parsers; it was first described in Bryan Ford's thesis and later in Robert Grimm's Rats! parser generator.)
Memo entry encoding
Each memo entry is packed into a single i32:
- Success: a pointer to the CST node (bit 0 is always 0, since nodes are 4-byte aligned)
- Failure:
(failureOffset << 1) | 1 - Spaces:
(matchLength << 2) | 2— more on this below
This encoding lets us distinguish the three cases with a simple bit check, and avoids the need for any auxiliary data structures.
Parameterized rules
In Ohm, a rule can have parameters — parsing expressions that are substituted into the rule body. For example:
KeyVal<keyExp, valExp> = keyExp ":" valExp
The KeyVal rule takes two parameters, keyExp and valExp. These work much like function parameters in any old programming language: when we use the rule, we also need to supply the actual parameters (aka arguments). For example:
IdField = KeyVal<"\"id\"", number>
In the v17 interpreter, we handle parameters by maintaining a rule stack (which is much like the call stack in most programming languages).
In v18, we handle parameterized rules via static specialization. This means that we generate a separate rule body for every unique combination of parameters. So parameterized rules are more like macros: they are expanded at compile time, and no parameters exist at runtime. In the example above, it means that there is no generic KeyVal rule — it's as if we defined the rule like this:
KeyVal$0 = "\"id\"" ":" number
Not only does this simplify the runtime semantics, but it also works well with memoization. For parameterized rules, we can only use a memo entry if the parameters are identical: if KeyVal<"\"id\"", number> succeeds at position 0, it doesn't mean that KeyVal<"\"id\"", letter> will.
After specialization, two applications of the same rule with different parameters can simply be treated as two unique rule applications, and their memo entries will never be shared.
Optimized space skipping
One of Ohm's distinctive features is that syntactic rules (those starting with an uppercase letter) automatically skip whitespace between tokens. And we allow the grammar author to override the definition of whitespace — for example, to allow any comments to be treated as whitespace.
In v17, implicit space skipping is treated just like an explicit application of the spaces rule: CST nodes are allocated and the result is memoized.
v18 uses an optimized form of implicit space skipping: it avoids creating CST nodes altogether. During the walk phase, it can lazily materialize those nodes if and when they are required. For the common case where no one inspects the space-skipping nodes, this avoids a huge number of allocations.
Lazy spaces nodes also have a special representation in the memo table: (matchLength << 2) | 2. Since there is no CST node to point to, we just record the matchLength as a tagged, 32-bit value.
Optimizing space skipping can lead to a huge performance gain in some grammars. For example, here are the results from our official ES5 grammar on a 742KB source file:
Other optimizations
A few other important optimizations that are worth mentioning —
Single-use rule inlining
If a rule is referenced exactly once in the grammar, its body is emitted inline at the call site. This eliminates the function call overhead, and saves space in the memo table. (This is a standard optimization in packrat parsers.)
Preallocated nodes
Some CST nodes have a fixed structure, no matter where they appear in the tree:
- single-child nonterminals: simple rules that match a single code point, like
letterordigit. - empty iteration nodes (zero children and zero match length).
For nodes like this, we preallocate a singleton instance, and use that whenever it's needed, rather than allocating separate nodes for each instance.
Further reading
Very little of what's described here is novel; most of the techniques can be found in one of these papers:
- Bryan Ford:
- Robert Grimm: Better extensibility through modular syntax (2006)
For more details on Ohm's original implementation, see:
- Modular semantic actions (2016)
- Incremental packrat parsing (2017)
Try it out
npm install ohm-js@next # Runtime (production dependency)
npm install --save-dev @ohm-js/compiler@next # Compiler (dev dependency)
We'd love to hear your feedback. Give it a spin, and let us know what you think on Discord or GitHub Discussions.
Acknowledgements
I'd like to thank Adam B. and Project Substrate for the initial funding that kicked off this project, and Shopify for additional financial support.
And thanks to Alex Warth (PEG parsing guru) for the advice, ideas, and encouragement that made this work possible.
