Parse Error Recovery in Sorbet: Part 2

What I Didn't Do

This is the second post in a series about “things I’ve learned while making improvements to Sorbet’s parser.” Specifically, it’s about approaches I considered but decided against.

Before we get started, I should say: I’m not, like, an expert at writing parsers. In fact of all the changes I’ve made to Sorbet, it’s definitely up there for “changes I’ve been least qualified to have made.” But at the end of the day my test cases passed 🙃 Take my experiences with as many or as few grains of salt as you’d like. This also means that if you want to suggest other alternatives or otherwise teach me something new, I’m all ears!

First, a little bit of history. Sorbet’s parser was originally a part of the TypedRuby project.TypedRuby was an aspirational Ruby type checker implemented in Rust that predated Sorbet. It is now abandoned.

In turn, TypedRuby sourced its parser by porting the grammar file in the whitequark parser from Racc (a Yacc-like parser generator for Ruby) to Bison (a Yacc-like parser generator for C/C++). Sorbet imported the source code of the TypedRuby parser and continued to modify it over time as Ruby syntax evolved. The lexer uses Ragel (also inherited from whitequark by way of TypedRuby) and tends to be quite stateful compared to other lexers I’ve seen—a point which we’ll come back to in future posts.

Importantly…

All of these claims about Sorbet’s parser were true when I started, and they haven’t changed. You’ll notice that in most cases the justification is “I don’t have time to do X” and not “doing X is wrong.” My biggest constraint in improving the parser has been making small, fast, iterative improvements. I wanted to be left with something to show even if I had to stop working on the parser sooner than expected. It’s possible that someone with more time or more patience will want to revisit one of these approaches in the future, and if you do I’d love to hear about it!

Anyways, that rules out the most common refrains from onlookers. But there was another, more unconventional approach I considered and decided against: using dead_end. dead_end isn’t a Ruby parser but rather a tool that hijacks Ruby’s syntax error reporting mechanismIt turns out, all (“all”) you have to do is is monkey patch require to rescue SyntaxError. Thanks Ruby 🙂

to improve the message for certain syntax errors. Specifically, it’ll try to show error messages in cases like this:

class A
  def foo
    # ... lots of code ...
  # ← dead_end error: missing `end` keyword

  def bar
  end
end # ← ruby default error: unexpected token "end of file"

Missing an end keyword is a super common class of Ruby syntax errors,One of my biggest Ruby syntax gripes is that it isn’t a curly brace language like C or JavaScript. Any sensibly editor will immediately insert the matching } after first typing {. But most Ruby editors will only insert the end matching some statement after a full line has been typed and <Enter> has been pressed, if anything. This means that unclosed if/while/do/def/class statements are abundantly common in Ruby, and this class of error (mismatched pairs) is trickier than the average error.

and dead_end already works particularly well at reporting them, so it was tempting to steal reuse either the code or the ideas.

Early on I had decided not to use the code directly (it’s written in Ruby, and I didn’t want to add a runtime dependency on Ruby to Sorbet). But in the end, I decided not to use its recovery algorithm either.

The algorithm is described in more detail here, but the tl;dr is that it uses indentation to search for mismatched snippets, expanding and discarding lines from the search frontier when it finds portions of a Ruby file that properly parse at a given indentation level.

The problem with taking that idea verbatim is that the end result is basically just a set of lines in the source file that contain the error. But knowing those lines, there’s still no parse result for those lines. For example:

def foo
  # ... code before ...
  if arbitrary_expression().
  # ... code after ...
end

dead_end could point to line 3 as the problem, but then I’d still have to parse that line to be able to e.g. service a completion request after the ., which is basically the situation we started with, because the parser would still be on the hook for the full complexity of what that arbitrary_expression() could represent. So I put the dead_end algorithm itself aside as well.

But! the general idea of using indentation to guide recovery proved out to be pretty useful—most Ruby editors will auto-indent and -dedent correctly for most edits—and there was another way to take advantage of it in Sorbet’s parser, along with some other tricks. The next few posts will discuss those tricks!

← Part 1: Why Recover from Syntax Errors

Part 3: Tools and Techniques for Debugging a (Bison) Parser →