TL;DR: Semgrep has achieved remarkably fast scan times by prioritizing speed using methods like taint summaries and tree matching in OCaml. In addition, Semgrep’s design as a tool that searches for syntax makes it fast due to designs like purely textual single-file analysis, partial parsing, and optimizations like skipping files that cannot produce matches.
Program analysis is an extremely interesting discipline that aims to combine an impossible task (finding undesirable parts of programs) with practical usage (being useful to developers to fix their code). Practical usage takes many forms ranging from convenience of information and quality of findings to the speed at which the analysis is carried out.
At Semgrep, we have one motto which we stick to — “code analysis at ludicrous speed”. After almost 3 years of development, a question remains—what goes into making a code analysis product that can run at “ludicrous speed”, and have we achieved that goal with Semgrep?
Ludicrous graphs
How do we qualify “ludicrous speed”? Some results for Semgrep’s speed can be seen here, in graphic form:
Here is a graph of Semgrep’s scan time (in seconds) for the Django Python repository, over time. This data serves as a direct reflection of Semgrep’s growth over the past year, as various optimizations and engine upgrades have been carried out:
Figure 1: Semgrep scan time (in seconds) for Django Python repository - source
And for the lodash JavaScript repository:
Figure 2: Semgrep scan time (in seconds) for lodash JavaScript repository - source
Here’s the performance of Semgrep on all of its benchmarking repositories over time:
Figure 3: Semgrep scan time for different repositories
Over time, Semgrep has been making a consistent effort towards increased performance. In all benchmarked cases, that scan time using the latest Semgrep version takes place in less than 20 seconds, which is a significantly short enough period to run within a developer’s normal commit workflow.
Here’s some data validating Semgrep’s run-time (in Python, on the Django repository) against some other open-source Python analysis tools. All data are averaged and sourced from tests run on an M1 Mac machine.
Figure 4: Semgrep scan time as compared to other Python analysis tools on Django repository
In this graph, we see that Semgrep performs quite fast, beating out the other tools. It’s worth noting that pylint
and flake8
are linting tools, which primarily work in the realm of style enforcement, which notably is not concerned with the behavior of the program, like Semgrep. With features like taint analysis, constant propagation, and dataflow analysis, it’s a fair description that Semgrep performs more computationally intensive analysis than the other options. More than just curiosity, Semgrep’s speed has made it feasible to be run by existing organizations in production to shift left.
Static analysis at scale
What makes a static analysis (SAST) tool fast? Well, it’s useful to look at what may make a static analysis tool slow. SAST applications have to be able to process large amounts of source code, break it down into a format suitable for analysis, and then run detailed semantic scans on it.
This analysis may also be done in a dynamic fashion, where programs are instrumented to detect certain faults during their runtime, but this has the disadvantage of adding extra overhead to the analyzed program, as well as operating closer to the hardware level, as opposed to the language level. In this article, we will focus on static analysis, and stay closer to the language level, which will yield dividends later on with Semgrep.
Static analysis is inherently hard because it tries to find answers to questions about program behavior — about programs that may run for a very long time, if not forever. Given that it is static, this analysis must be done without running the program, so any actual evaluation is a non-starter. How can we make such a problem tractable?
The way that this generally takes place is in approximation. While we cannot run the program itself to find out what is actually happening, standard techniques allow us to gain (possibly imprecise) knowledge of the state of the program. In particular, dataflow analysis, a classic technique, involves an iterative scan over all possible program points to find out properties that may be true at those points.
In order to facilitate this dataflow analysis, static analysis tools need to know something about what paths the program may take during execution. This is achieved by computing a control-flow graph, which is a graph that connects the various parts of the code which may execute after each other. Given a project with many functions, conditionals, and program text in general, however, looping over the entire control-flow graph of a program is not a trivial task. How does this be done in a more optimal way?
Knowing is half the battle
The general mantra that Semgrep follows to facilitate its success is that it only picks battles that it can win.
Program analysis is a never-ending uphill slope because analyses can always be done more deeply, and more compute time can always be thrown into figuring out more things about the program's behavior. In practice, however, for the majority of applications, program analysis need not be particularly deep or theoretically based to be effective in general.
From the beginning of Semgrep, speed has been a major focus. To make sure we stay in line with that, from the beginning, we make sure that we only support features when we know that Semgrep can win, or in other words, that it can be done in a fast way.
A unique niche
In a way, philosophically, Semgrep’s original purpose was in line with this way of thinking. Semgrep occupies a unique niche as a tool that straddles the line between syntactic and semantic, however, it used to be more towards the former. It started as sgrep at Facebook by Semgrep’s own Yoann Padioleau, an open source tool that was to match program text in a semantically-aware way, but which lacked some of the modern-day features Semgrep possesses, such as constant propagation and taint analysis.
This original focus granted Semgrep a unique perspective on program analysis, as sgrep didn’t need to solve many of the problems that other SAST applications aimed to do. Since it was a matching tool, there was no need to be aware of code flow at all, which is the main bottleneck in terms of program analysis. Since it only focused on text, there was no need to have any kind of understanding of programs beyond single files—necessarily, there was no need even to require that analyzed code compiled. This also granted other advantages, such as partial parsing, where programs that are not syntactically correct can be parsed to trees that look “close enough”. This overall had the advantage of making Semgrep an extremely versatile and robust tool, from the get-go.
In addition, the unique capabilities of Semgrep as a tool for code review automation, beyond just vulnerability-finding, necessitated that it be able to run quickly. In general, code analysis can occur on a nightly basis, with the goal of reporting any possible errors by the beginning of the next morning, so that engineers can triage those findings. This gives a sizeable 12-hour period for static analysis, which permits a large range of complex inquiries. Semgrep’s role as a customizable tool that catches vulnerabilities on each pull request means that it isn’t working in nearly the same time frame—developers need to be able to interface with it as a regular part of their workflow of merging commits and writing code. This gives an upper limit of minutes, as opposed to hours, for Semgrep. So not only does Semgrep only pick battles it can win—it must.
Findings, faster
Speed is an admirable goal for any static analysis tool. A more interesting question, however, is how this is achieved.
In a similar sense, the fact that Semgrep lives so close to the syntax of a program helps again. One of the most helpful improvements made for Semgrep’s speed was in recognizing this. Whereas an arbitrary static analysis may not know specifically where a bug may occur and thus have to check all of a given program, Semgrep rules are typically written with some amount of the desired text to find—for instance, they may contain the name of a sensitive function or some particular primitive in a language.
This characteristic made it possible to speed up Semgrep scans by only searching files that are known to contain literal text which matches that in the patterns. For instance, consider the following Semgrep rule which looks for a hardcoded tmp
directory within an open
: