TL;DR: Semgrep has achieved remarkably fast scan times by prioritizing speed using methods like taint summaries and tree matching in OCaml. In addition, Semgrep’s design as a tool that searches for syntax makes it fast due to designs like purely textual single-file analysis, partial parsing, and optimizations like skipping files that cannot produce matches.
Program analysis is an extremely interesting discipline that aims to combine an impossible task (finding undesirable parts of programs) with practical usage (being useful to developers to fix their code). Practical usage takes many forms ranging from convenience of information and quality of findings to the speed at which the analysis is carried out.
At Semgrep, we have one motto which we stick to — “code analysis at ludicrous speed”. After almost 3 years of development, a question remains—what goes into making a code analysis product that can run at “ludicrous speed”, and have we achieved that goal with Semgrep?
Ludicrous graphs
How do we qualify “ludicrous speed”? Some results for Semgrep’s speed can be seen here, in graphic form:
Here is a graph of Semgrep’s scan time (in seconds) for the Django Python repository, over time. This data serves as a direct reflection of Semgrep’s growth over the past year, as various optimizations and engine upgrades have been carried out:
Figure 1: Semgrep scan time (in seconds) for Django Python repository - source
And for the lodash JavaScript repository:
Figure 2: Semgrep scan time (in seconds) for lodash JavaScript repository - source
Here’s the performance of Semgrep on all of its benchmarking repositories over time:
Figure 3: Semgrep scan time for different repositories
Over time, Semgrep has been making a consistent effort towards increased performance. In all benchmarked cases, that scan time using the latest Semgrep version takes place in less than 20 seconds, which is a significantly short enough period to run within a developer’s normal commit workflow.
Here’s some data validating Semgrep’s run-time (in Python, on the Django repository) against some other open-source Python analysis tools. All data are averaged and sourced from tests run on an M1 Mac machine.
Figure 4: Semgrep scan time as compared to other Python analysis tools on Django repository
In this graph, we see that Semgrep performs quite fast, beating out the other tools. It’s worth noting that pylint
and flake8
are linting tools, which primarily work in the realm of style enforcement, which notably is not concerned with the behavior of the program, like Semgrep. With features like taint analysis, constant propagation, and dataflow analysis, it’s a fair description that Semgrep performs more computationally intensive analysis than the other options. More than just curiosity, Semgrep’s speed has made it feasible to be run by existing organizations in production to shift left.
Static analysis at scale
What makes a static analysis (SAST) tool fast? Well, it’s useful to look at what may make a static analysis tool slow. SAST applications have to be able to process large amounts of source code, break it down into a format suitable for analysis, and then run detailed semantic scans on it.
This analysis may also be done in a dynamic fashion, where programs are instrumented to detect certain faults during their runtime, but this has the disadvantage of adding extra overhead to the analyzed program, as well as operating closer to the hardware level, as opposed to the language level. In this article, we will focus on static analysis, and stay closer to the language level, which will yield dividends later on with Semgrep.
Static analysis is inherently hard because it tries to find answers to questions about program behavior — about programs that may run for a very long time, if not forever. Given that it is static, this analysis must be done without running the program, so any actual evaluation is a non-starter. How can we make such a problem tractable?
The way that this generally takes place is in approximation. While we cannot run the program itself to find out what is actually happening, standard techniques allow us to gain (possibly imprecise) knowledge of the state of the program. In particular, dataflow analysis, a classic technique, involves an iterative scan over all possible program points to find out properties that may be true at those points.
In order to facilitate this dataflow analysis, static analysis tools need to know something about what paths the program may take during execution. This is achieved by computing a control-flow graph, which is a graph that connects the various parts of the code which may execute after each other. Given a project with many functions, conditionals, and program text in general, however, looping over the entire control-flow graph of a program is not a trivial task. How does this be done in a more optimal way?
Knowing is half the battle
The general mantra that Semgrep follows to facilitate its success is that it only picks battles that it can win.
Program analysis is a never-ending uphill slope because analyses can always be done more deeply, and more compute time can always be thrown into figuring out more things about the program's behavior. In practice, however, for the majority of applications, program analysis need not be particularly deep or theoretically based to be effective in general.
From the beginning of Semgrep, speed has been a major focus. To make sure we stay in line with that, from the beginning, we make sure that we only support features when we know that Semgrep can win, or in other words, that it can be done in a fast way.
A unique niche
In a way, philosophically, Semgrep’s original purpose was in line with this way of thinking. Semgrep occupies a unique niche as a tool that straddles the line between syntactic and semantic, however, it used to be more towards the former. It started as sgrep at Facebook by Semgrep’s own Yoann Padioleau, an open source tool that was to match program text in a semantically-aware way, but which lacked some of the modern-day features Semgrep possesses, such as constant propagation and taint analysis.
This original focus granted Semgrep a unique perspective on program analysis, as sgrep didn’t need to solve many of the problems that other SAST applications aimed to do. Since it was a matching tool, there was no need to be aware of code flow at all, which is the main bottleneck in terms of program analysis. Since it only focused on text, there was no need to have any kind of understanding of programs beyond single files—necessarily, there was no need even to require that analyzed code compiled. This also granted other advantages, such as partial parsing, where programs that are not syntactically correct can be parsed to trees that look “close enough”. This overall had the advantage of making Semgrep an extremely versatile and robust tool, from the get-go.
In addition, the unique capabilities of Semgrep as a tool for code review automation, beyond just vulnerability-finding, necessitated that it be able to run quickly. In general, code analysis can occur on a nightly basis, with the goal of reporting any possible errors by the beginning of the next morning, so that engineers can triage those findings. This gives a sizeable 12-hour period for static analysis, which permits a large range of complex inquiries. Semgrep’s role as a customizable tool that catches vulnerabilities on each pull request means that it isn’t working in nearly the same time frame—developers need to be able to interface with it as a regular part of their workflow of merging commits and writing code. This gives an upper limit of minutes, as opposed to hours, for Semgrep. So not only does Semgrep only pick battles it can win—it must.
Findings, faster
Speed is an admirable goal for any static analysis tool. A more interesting question, however, is how this is achieved.
In a similar sense, the fact that Semgrep lives so close to the syntax of a program helps again. One of the most helpful improvements made for Semgrep’s speed was in recognizing this. Whereas an arbitrary static analysis may not know specifically where a bug may occur and thus have to check all of a given program, Semgrep rules are typically written with some amount of the desired text to find—for instance, they may contain the name of a sensitive function or some particular primitive in a language.
This characteristic made it possible to speed up Semgrep scans by only searching files that are known to contain literal text which matches that in the patterns. For instance, consider the following Semgrep rule which looks for a hardcoded tmp
directory within an open
:
source: Semgrep rule
The pattern which this rule looks for involves a call to open
, which can only occur if the literal string open
occurs anywhere in the given file. This can easily be tested in linear time, resulting in Semgrep being able to skip files that are known to be impossible to produce a match in, due to this property, which significantly improves scan times. In this case, it’s interesting how purely syntactic searches can supplement more semantic searches!
Another significant speed benefit occurs from the inherent nature of the problem. Semgrep’s core engine is written in OCaml, a functional programming language because functional languages are ideal for the kind of structural decomposition on recursive data (the target program) that Semgrep’s main matching engine needs to do. This engine is used to provide raw matching data to the Python wrapper, which would then do the work of combining and analyzing the matches using the rule’s pattern combinators. This work is merely more structural decomposition on recursive data (the pattern), however, and another performance boost was gained upon porting that section of the logic to OCaml.
Semantics, speedily
Tree matching has a nearly negligible cost when compared to most deep program analysis techniques, such as pointer analysis or symbolic execution, so this was clearly a winning battle. As Semgrep grew more advanced, more features were added which caused it to err closer to the side of semantics, such as taint analysis and constant propagation.
These analyses are not necessarily ones that can be done quickly. Taint analysis, in particular, requires running dataflow analysis over the entire control flow of a program, which can potentially be huge, when considering how large an entire codebase may be, with all of its function calls and tricky control flow logic. To do taint analysis in this way would be to pick a losing battle.
Semgrep succeeds in that it only carries out single-file analysis, so the control flow graph never exceeds the size of a file. In addition, taint can be done incrementally. Functions have well-defined points where they begin and end, as well as generally well-defined entrances in terms of the data they accept (the function arguments). Thus, Semgrep collects taint summaries, which essentially (per function) encode the information about what taint may be possible, depending on the taint of the inputs that flow in.
So for instance, given a function in Python:
def foo(a, b):
sink(a)
return None
A taint summary for this function foo
will note that, if the input a
is tainted, then it will reach the tainted sink
sink. Regardless of how the function foo
is used, this is a fact about the function’s potential use. Then, at the call site to the function, if the input a
is tainted, then we know to report a finding.
This seems simple, but the end result is that we only ever need to run a dataflow analysis on the code of each function once. Never does a taint summary need to be collected for a given function more than once, meaning that we can simply stitch all these facts together at the end, making for speedy results. The core lesson is that, by collecting summaries and doing taint in an intelligent way, we pick the battle that we can win. It turns out that for our upcoming inter-file extension to Semgrep, by applying this approach in an inter-file manner, we can still reap the speed benefits. This lets us avoid running dataflow analysis on an entire control-flow graph, and instead do small, localized analyses.
Conclusion
At the end of the day, solving an undecidable problem at a pace that is useful to security engineers is a hard task. Harder still is solving an undecidable problem at a pace that is useful to the developer, such that it can fit into the normal cycle of writing code and pushing commits. Semgrep’s unique philosophy as a tool has made it capable of bridging this gap, and over time, has proven it to have a speed that is nothing short of ludicrous.