This is the fourth post in a series about “things I’ve learned while
making improvements to Sorbet’s parser.” With the last post, I talked
about some tools and techniques that I’ve found useful while hacking on
Sorbet’s Bison-based
parser. This post is going to continue that theme by explaining in a
little more detail the primary tool Bison has for adding error recovery
to a parser: the special error
token.
You don’t really need to read the previous posts for this post to be useful, but if in case you want to queue them up to read later, here’s the list:
- Part 1: Why Recover from Syntax Errors
- Part 2: What I Didn’t Do
- Part 3: Tools and Techniques for Debugging a (Bison) Parser
- Part 4: Bison’s
error
Token - (coming soon) Part 5: Backtracking, aka Lexer Hacks
- (coming soon) Part 6: Falling Back on Indentation, aka More Lexer Hacks
That being said, if you’re also trying to hack on a Bison parser to make it recover from errors, I hate to say it but this post is not going to be a substitute for the official docs on Error Recovery. You’re going to want to spend some time skimming that section of the docs if you haven’t already.
Bison needs explicit annotations within a grammar to provide syntax
error recovery. This is in contrast with parser tools like tree-sitter,If you’re curious, I’ve written some assorted thoughts on tree-sitter.
which automatically include error recovery. Concretely,
Bison requires inserting special error
tokens in production
rules that should participate in error recovery.
To get the most out of Bison’s error recovery mode, it’s crucial to
understand what it’s actually doing with those error
tokens.
Bison’s error recovery algorithm
There’s a vague description of the algorithm in the docs, but I found that I had to make the algorithm more explicit before I could understand what was and wasn’t possible.
At a high level, this is what Bison does:
It encounters an error. By which we mean: neither shifting the lookahead token nor reducing the current stack is a valid action given the current lookahead token).
It reports the error by calling the (user-defined)
yyerror
In C++ parsers, this is calledparser::error
.
function.Importantly, this function is always called.Other parser generators, for example Happy for Haskell do not necessarily report an error when an
error
token is produced.
Even if a production rule eventually consumes the error token and successfully recovers from the parse error, an error will have been reported.Also note that it’s impossibleSorbet gets around this limitation by appending parse errors to a temporary queue, only flushing them to the user once parsing has completed. Sorbet sometimes mutates the last element of the queue inside semantic actions to improve the error message with specific information about the parse failure.
to delay callingyyerror
until it’s clear that no production rule matched theerror
token, since theyyerror
function is called even before attempting to shift theerror
token, less reduce a rule that uses it. For similar reasons, this makes it more complicated to allow the eventual error rule to provide extra context on the error message.Next, Bison looks to see it can shift the
error
token, given what the current stack contents and parser state are. It leaves the current lookahead token untouched for the time being.If it can shift the
error
token, it does so. Bison has finished recovering from the syntax error. The parse continues, using the untouched lookahead token.If it can’t shift the
error
token, Bison completely discards the object on the top of the stack.To make that clear, if the parser stack looked something like this:
# def foo # x. # end = ['def', identifier, '\n', expression, '.'] stack = 'end' lookahead
and Bison found no matching error production rule, it would throw away the
'.'
token that it had already shifted onto the parser stack:= ['def', identifier, '\n', expression] stack = 'end' lookahead
and then loop back to the previous step, checking to see whether it’s now possible to shift the
error
token. This process repeats until Bison has gobbled up the whole stack or some production rule consumes the error token.If Bison’s Location Tracking feature is on (which allows using
@1
,@2
, etc. in semantic actions to get the locations associated with components of the rule), it’s worth knowing how theerror
token’s location is set. Bison sets the error location to span from the last thing it discarded from the stack all the way to the lookahead token that induced the error. If it discarded nothing, then the range would just be the location of the lookahead token.Using the example above, if the
'.'
token was the only token Bison needed to discard, the error token’s location would be set to span from that'.'
token all the way to the'end'
lookahead token.
Most Bison grammars have a catch all | error
production
somewhere, like this one in Sorbet’s parser:
The nice thing about a rule like this is that it provides coarse
grained error recovery at a high level without requiring special cases
for every production in the grammar. It works because no matter what
happens to be on the stack, it’ll always eventually match (as long as
we’re in the parser state corresponding to stmts
) because
eventually Bison will have discarded the entire stack.
It’ll definitely throw away a lot of stuff, but at least it’ll let the parse continue instead of failing to produce any parse result. For example, if there was no parse error further down in the file, and the error occurred near the top, this rule gets us lots of error recovery for little work. But yeah, it’s not great to throw that much stuff away.
We’re going to want to put more error
tokens in more
targeted places. For that, I’ve come up with a handful of strategies to
make the most of Bison’s error recovery.
Figure out the most common edit paths
Even though Bison requires a lot of error
annotations to
get good parse results, you can get good bang for your buck by figuring
out the most common edit paths. For example, here’s every intermediate
edit when the user adds a keyword argument to a Ruby method:
x: x) # contents before edit
foo(a, x: x,)
foo(a, x: x, y)
foo(a, x: x, y:)
foo(a, x: x, y: y) # edit finished foo(a,
Ideally there’s an error
production for every
intermediate state, because adding a keyword argument to a method call
is common. On the other hand, you can likely get away not adding rules
for uncommon syntax errors.
If you want, you can take the guesswork out of what’s common and
what’s not by measuring, assuming you have a corpus of syntax errors you
can sample from.For example, we gather usage metrics from every Sorbet
user at Stripe.
The semi-automated approach to measurement, which is what
I’ve personally used: when there’s a syntax error and the parse result
is “bad” according to some heuristic (like: the parse is completely
empty, or there was a partial parse result but it was too bad to find
any completion suggestions at the user’s cursor), log the bad source
buffer to a file, and then go triage the logged files, fixing the most
common errors first.
The annoying part about that approach is the manual triage work of
opening up the logged buffers, identifying which part of the file had
the parse error, and blaming it to some section of the parser. An idea
I’ve had (but not implemented) for a more automatic approach: when
there’s a syntax error that’s never recovered (or that’s handled by some
“catch all” production rule), log the lookahead token and parser state
where the error happened. Cross reference parser states with what’s in
the textual
report on the parser to get approximate line numbers in the grammar
that need to be updated. States that show up the most commonly are the
ones in need of dedicated error
rules.
The error
token is
usually last
With the most common edit paths in hand, I’ve usually had the most success by following two tips for crafting the error rules.
Put the
error
token as the very last token in the production rule. It can be tempting to try writing rules like this, where theerror
token is followed by some other stuff:/* ... */ } | args ',' arg { /* ... */ } | args ',' error arg {
Sometimes this works, but in my experience, it’s much easier to reason about conflicts when the
error
token is the last token in a rule.Put the
error
token only after terminals. There’s almost never conflicts in the grammar when putting theerror
token after a','
or'='
token, but there usually are when putting it after something like anargs
non-terminal.Intuitively this makes sense, because the
args
production itself probably has a bunch of rules that have consume anerror
token at the end, causing the conflicts. The non-terminal might even have a catch-all| error
rule.
In situations where I haven’t been able to trivially follow these
rules, I’ve usually been able to go into the preceding non-terminal rule
(like args
) and sprinkle error
tokens
judiciously inside that rule to allow following these
rules.
Unfortunately, there have definitely been times where that hasn’t worked, which will be the topic of a future post.
Consider using
yyclearin
After recovering from a parse error using the error
token, the lookahead token will still be set to whatever it was that
caused the error to happen in the first place.
If for whatever reason you think that attempting to continue the
parse with that token would just screw things up again, you can use the
yyclearin
macroIn the C++ skeleton, this is available using
yyla.clear()
instead.
to clear out the lookahead token, which will cause Bison
to request another token from the lexer.
We’re not currently using this in Sorbet because I’ve replaced most places where it might have been useful with some even more powerful techniques (discussed in a future part), but I figured I may as well mention it.
Invent a special AST node for errors
Consider this parse error:
def foo
=
x end
The rule for parsing an assignment with no error looks like this, and
produces an assign
node in the AST:
To handle the error case, we still have the lhs
and the
'='
, but we don’t have the arg_rhs
. The parser
will detect that 'end'
is not a valid arg_rhs
,
and shift the error
token for us to recognize:
arg: lhs '=' arg_rhs
{
$$ = driver.build.assign(self, $1, $2, $3);
}
| lhs '=' error
{
$$ = driver.build.assign(self, $1, $2, /* ... ? ... */);
}
| ...
It’s unclear what to use in place of $3
, because
error
doesn’t have an associated semantic value. To fill
the void, we can invent a special AST error_node
type.Slight fib; Sorbet actually creates a constant
literal node with a magic name for backwards compatibility
reasons.
“What’s up with that endPos
stuff?”
There’s some discussion in the full source on GitHub.
This special AST node allows phases downstream of the parser to pretend the parse succeeded. In particular, it’s easy to detect where the syntax error occurred when responding to completion requests (which is important, because in the above example, the syntax error is also where the user’s cursor is).
Read the generated parser’s source
To close, I’d like to point out that everything in this post that I didn’t find in the official docs, I taught myself by browsing the source code generated by Bison. Despite being generated, it’s actually pretty well commented, and with a bit of elbow grease you might even be able to get your IDE to let you use jump to def in it.
Some nice things about browsing the source:
It’s never out of sync with the version of Bison you’re using (unlike the official docs, which only track the latest version).
You can see exactly what happens and in what order. For example, reading the source is how I convinced a colleague that no, using
error
productions did not mean we would be preventing errors from being reported. It was faster to read the source than attempt to find whether the docs mentioned this.You can see what fun, undocumented APIs are actually available to you. For example, the docs talk about
yylval
andyylloc
, which are supposed to store the semantic value and location of the lookahead token. But in the C++ skeleton, these things have been renamed (without documentation) toyyla.value
andyyla.location
, respectively.
Reading the generated parser’s source code reinforced my understanding of Bison’s parsing algorithm and made it easier to debug when things went wrong.
All this being said, I’ve run into plenty of limitations when
attempting to improve Sorbet’s parser. In the next post, I’ll explain
one such example, why using error
tokens alone wasn’t
enough, and how I tweaked Sorbet’s lexer to aid the parser in error
recovery.
← Part 3: Tools and Techniques for Debugging a (Bison) Parser
(coming soon) Part 5: Backtracking, aka Lexer Hacks →