I’m frequently faced wth problems like “find and replace this pattern, but only on specific lines,” especially lines that have type errors on them. I’ve built three new CLI tools that fit the need to operate on a specific set of lines in a codebase. In this post I’ll walk through a couple examples to show them in action.
For the impatient, the tools that I’ve built are:
grep, but search for a pattern only at the specified locations, printing the locations where a match was found.
Substitute a pattern with a replacement at the specified locations, editing the file in place.
Convert a unified diff (like the output of
git diff) into a list of locations affected by that diff.
With the quick intros out of the way, let’s dive into some examples.
Consider the file
locs.txt below which is a list of
(I call such
filename:line pairs “locations” or “locs.”)
Also consider that our project is huge, and has many more files
file_b.txt. To filter the
to only the lines that contain the pattern “hello”, we can use
The output means that only line 13 in
file_a.txt and line 10 in
hello, given our initial set of 5 locs. The
search completely ignored all other files in the project because they
weren’t mentioned in
locs.txt. Searching with
multi-grep scales with
the size of the input list, not with the size of the codebase being
This was a contrived example, but let’s keep plowing forward with the basics so we can apply them to a real example.
multi-grep is like
multi-sub is like
sed but with only
the substitute command (
s/find/replace/). Taking our previous example,
multi-sub finds and replaces a pattern on specific input lines:
In our previous example, only locations
file_b.txt:10 matched the pattern
hello. So after running this
multi-sub command, those two files will be updated in place. On both
hello will be replaced with
A larger example
With the basics out of the way, let’s tackle a real-world problem. I work with the output of Sorbet a lot, so I’ve used it for this next example (Sorbet is a type checker for Ruby). When Sorbet detects type errors in a program, it generates output like this:
The error messages look a lot better with colors! If you don’t believe me, you can try Sorbet in the browser and see for yourself.
The example above is inspired by real output that we saw at Stripe while iterating on Sorbet. In this specific case, one of my coworkers had improved Sorbet to track more information statically, which uncovered a bunch of new type errors.
On the Sorbet team, we have a policy that before landing changes like this, we modify Stripe’s monorepo to preemptively silence the new errors. Jordan Brown has a great article on the Flow blog justifying this technique, so I’ll skip the why and focus only on how to carry out codemods like this.
As seen above, Sorbet error output always looks like
filename.rb:line:Error message. With a little massaging, this will
feed directly into
multi-sub. Also notice that on the lines with
errors, the file contents always looked something like this:
To tell Sorbet to silence the errors on these lines, we’ll need to wrap
the variable in a call to
This instructs to Sorbet to forget all static type
information about the variable, thus silencing the error.
The key is to only perform this edit on lines with errors—
we’d hate to needlessly throw away type information by changing
unrelated lines! For things like this,
sed are often too
coarse-grained, because accessing a hash like this in Ruby is abundantly
multi-sub, we can write a really simple regex targetting these
hash lookups, but scope the regex to only lines in the error output:
Take a look through the four steps in the bash oneliner above:
- Type check the project, then
- filter out every line that doesn’t have a location, then
- chop of the error messages, and finally
multi-subto perform the substitution.
The net result is to update the files in place, performing the substitution only on the lines with errors. Altogether once more, but on one line, making use of a shell alias that I have to abbreviate the inner two steps:
So with a super short bash oneliner, we’ve done a mass codemod that fixes hundreds of errors at once, without having to silence more than necessary.
sed are like chainsaws, I like to think of
multi-sub like scalpels—ideal for performing surgery on a
codebase. Regular expressions are often super imprecise tools for
codemods. But by scoping down the regex to run only on specific lines,
it doesn’t matter. The added precision from explicit locations makes up
for how blunt regular expressions are.
I’ve built one more command in the same spirit as
multi-sub, except that instead of consuming locations, it emits them.
Specifically, given a diff it outputs one
filename:line pair for every
line that was affected by the diff.1 It’s ideal for consuming the
git show or
I frequently use
diff-locs to tweak codemods that I’ve already
committed. For example, we could go back and add a TODO comment above
Recapping the pipeline above:
git showto generate a diff, then
- convert the diff to a list of locations with
diff-locs, and finally
- insert a comment before each location with
diff-locs is particularly handy because after the first codemod, there
won’t be type errors anymore! So to get a list of locations to perform
the edit on, we’d have had to check out the commit before fixing the
errors, save the list of errors to a file, go back, and finally do the
edit we wanted to in the first place.
Instead, we can take advantage of the fact that all that information is already stored in git history, skipping a bunch of steps. (And asking git to show a diff is way faster than asking Sorbet to re-typecheck a whole project 😅)
Aside: The implementations
One thing I’d like to point out is that I took some care to make sure these commands weren’t eggregiously slow. I prototyped these commands with some hacky scripts, but after doing some rather large codemods I got annoyed with them taking minutes to finish.
Some things that make these new commands fast:
multi-sedre-use an already opened file to avoid reading extra information.
multi-grepis written in Standard ML,
multi-subis written in OCaml, and
diff-locsis written in Haskell—all languages which have great optimizing compilers. This means much better performance than a scripting language.
If you’re curious, you can read through their implementations on GitHub:
As always if you have questions or notice issues please don’t hesitate to reach out!
It defaults to only lines affected after the diff applies, but there’s an option to make it show both added and removed lines.↩