The birth of Semgrep Pro Engine

tldr; Semgrep Code specializes in SAST solutions to help developers secure their code. Of all our projects, adding interfile analysis in a way that achieves our developer-focused goals without the aid of the open-source community has been the hardest. To succeed, we had to develop against a focused benchmark of real vulnerabilities before iterating with users thoughtfully.

At Semgrep, our goal is to make security that works for developers. To that end, we’ve been developing Semgrep open source, a fast and easy-to-learn tool that allows you to search for patterns in your code with semantic understanding. With Semgrep, developers can work hand in hand with security engineers because they can understand, maintain, and improve the checks that run on their code.

One limitation of Semgrep open source is that it operates on one file at a time. We added this limitation intentionally. We believe that the future of security is in using secure frameworks and defaults— e.g., catching cases where dangerouslySetInnerHTML was used —so code will be mostly safe from XSS in React no matter what happens in the other files. However, interfile analysis enables other ways of securing code that we also believe are important. Accordingly, we decided to create an option for interfile analysis so that people could choose the analysis that suited their purpose.

In adding interfile analysis, our main concern was that we wanted to make something truly useful for developers, not just an engine we thought was cool. This has always been our priority at Semgrep, which is why we work with developers and security researchers to add all our features. It turns out, though, that interfile analysis is highly technical even by our usual standard, and iterating with users the way we were used to was much harder.

Semgrep Pro Engine was released on February 14, introducing open access for interfile analysis in Semgrep for the first time. Here’s how we did it—and the mistakes we made along the way.

First try

We first began working on interfile analysis as an experiment within the program analysis team. Two of the engineers there implemented a proof of concept for interfile Java analysis. They identified three main features which became very powerful when interfile information was available:

Type inference
Constant propagation
Taint analysis

With these implemented, they wanted to know whether these features were as impactful as we had hoped. Accordingly, we found a few potential users and showed them a demo with some of our small test cases. The users were really excited! They asked to try out our interfile engine on their code. We happily sent them a binary and left them to run it, using whatever rules they saw fit. In each case, they got identical results to Semgrep.

Ok, that’s good to know. Time to iterate! What kinds of code snippets were they expecting us to match?

At this point, we tended to get one of two responses:

No response
A handful of extremely different (and extremely difficult) cases.

Well, any examples were helpful. We just needed to figure out how to prioritize them. Could we get some more examples?

Consistently, when we asked for more examples, we’d get the same answer: “Sure, when we have time!”

(Spoiler alert: they didn’t have time.)

Hm. It is a lot to ask for. We don’t need all our examples to come from the same teams, though. Let’s tell more people about what we’re working on so they get excited and want to work with us!

Accordingly, we launched our DeepSemgrep closed beta.

Now, we had tons of interest! Many people wanted to test our engine. Ruby developers, Python developers, JavaScript developers, PHP developers, and C++ developers all wanted to test our engine. We had so many potential users!

But a potential user could only become a user if we supported their language, and adding interfile understanding of a language takes time. At this point, we only support Java. Even worse, it was impossible to know whether a user would become a good development partner without having some support. We began to be distracted, trying to add languages depending on what users asked for that week.

The problem was that our way of getting feedback was less effective here, though we hadn’t realized it. The program analysis team had always developed features in conjunction with outside users. Since our tool was open source, it was easy to have a dialogue with existing users. They gave us an example of the rule and code they wanted to match, and we decided if it made sense to implement. Within the bounds of single-file analysis, it was usually reasonable or impossible.

Now, we were trying to change our bounds. What we should change those bounds to depended on what features would be most effective, which is not obvious. Consider the following things people commonly do in code:

Mutate values in loops
Inherit from classes
Alias variables

These are all extremely common operations, and they have also all been present in toy examples people have given us. However, it wasn’t clear to us if a precise analysis of these constructs was necessary to produce a significant improvement over Semgrep. After all, from the above list, (1) and (3) can both make it impossible to figure out whether a program will even halt.

We didn’t know what to do.

The importance of being organized

To break out of this bind, we decided to reframe the problem. We learned a lot from our previous customer interaction with Semgrep’s interfile analysis. We knew that we couldn’t rely on just customer feedback, and we were determined to get the direction we needed.

This time, we decided to turn the question around. Instead of starting from interfile analysis and asking users whether it would work for them, we narrowly defined our goal and aimed to design a tool to achieve that goal. In this case, the goal was “better results for our users”. To turn that into something we could develop from, we decided to create a benchmark with code that was representative of most open-source repositories.

What we did

The first thing we did was limit the scope of the benchmark. The benchmark was potentially a huge amount of work in itself. Our goal was to choose a subset of possibilities that was small enough that we could exhaustively explore it but significant enough that tailoring a tool towards the benchmark would produce a useful result. We ended up choosing two constraints:

The language was Java - a popular language
The vulnerability class was SQL injection (SQLi) - a common class that is similar to other injection-style vulnerabilities

From there, we pulled relevant repositories and annotated the lines that should or should not match. Furthermore, we annotated what type of analysis would help catch these vulnerabilities (constant propagation, interfile taint analysis, etc). This helped us narrow down which features to focus on developing further.

We had four sources for those repositories:

CVEs - we searched the National Vulnerability Database for vulnerabilities (CVEs) filed as SQL injection
Historical commits - we scraped GitHub for commits that included terms related to “fix … SQL injection …”
Purposefully vulnerable repositories - we added some commonly used purposefully vulnerable repositories, such as Vulnado
Known false positives - we added some repositories that SQLi Semgrep rules flagged on, but we identified them as false positives

Using these annotations, we produced a scorecard with the true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) for each tool.

Now, we had a definition of better results—a better scorecard.

There was one more thing we needed to do. As soon as we ran a benchmark, we realized we couldn’t just use the same rules as usual. Many of our current rules contained l33t h4cks to make up for the fact that the engine did not perform interfile analysis. For instance, they might use a function definition as a taint source, which helped us match more in OSS but would cause many false positives when run with interfile Semgrep. We additionally created two sets of rules, one that is meant to run within a single file, and one that is meant to run interfile, and benchmarked both. (Today, the latter are indicated with interfile: true in the metadata).

From a benchmark to an engine

Time to test our interfile analysis? Not quite yet. Before we jumped to fixing bugs in the interfile engine, we wanted to get the big picture of what we needed to do to get good results. The benchmark was great for a score but had too much information to digest quickly. Instead, for each annotation, we asked the question, “what code operations would an effective tool need to understand to get this right?”. We cataloged what operations were relevant (“line 47 calls a method from HttpRequestWrapper that it inherited from HttpRequest”), and grouped those operations into categories (“calling inherited methods for child class”).

For the record, this was a slog. Java call stacks sometimes look like this:

Figure 1: Java call stack example (source)

Out of hundreds of lines of code, we identified 11 classes of operations. Notably, very few of them were about deep computation. We didn’t need to compute if-branches or track values of an array. Only purposefully vulnerable repositories required that kind of analysis. In real repositories, it was more important to understand inheritance or improve our type inference. Even in the cases we deemed impractical to solve, the difficulty wasn’t needing very deep computation, but needing to understand a complicated repo with services in multiple languages.

This meant that the interfile engine was already set up well to solve these problems! It just needed a broader understanding of Java. We prioritized adding analysis for the operations the engine didn’t understand, implemented the ones that were practical, and reran the benchmark.

(Strictly speaking, we started fixing bugs before we went through all the code :) Sometimes engineers need treats too.)

With those changes, interfile Semgrep performed much better than before and in fact, much better than Semgrep itself.

Semgrep matrix
TP: 3 TN: 12 FP: 0 FN 27
Overall score 10%

Interfile Semgrep matrix
TP: 17 TN: 12 FP: 0 FN 13
Overall score 56%

To the moon

Now we had something that worked. Now we had a solution for a problem that we could articulate. It was time, again, to talk to external users.

We decided to not only launch Java, the language we had focused on developing, but also to try to launch JavaScript and within-a-file interprocedural taint analysis in all languages. And in order to launch these new features, we needed more people. The Semgrep Pro Engine team expanded rapidly, and we also started reaching out to people to test the engine.

The best audience to test this tool was people who like testing stuff (and hacking stuff) — security researchers! So the security researchers on the Semgrep Pro Engine team reached out to their networks (and forced their colleagues to also reach out to their networks), to try to get people to test out our engine.

The feedback that we received was very valuable. We were able to fix our release process, add new rules, and validate the need for ~~speed~~ JavaScript with what we were hearing from our alpha testers. And most importantly, we learned that other people were able to use our engine to find real vulnerabilities.

A tale of two teams

So, we were done, right? Our engine worked. Time to release!

Not so fast, said our ~~pesky~~ extremely awesome and amazing product manager.

The security researchers that tested the interfile engine were great for validating how the engine worked. They weren’t great examples of our usual customer profile—people looking not for a semantic search tool but for a SAST product to augment their security team’s capabilities. They needed an engine, but they also needed rules. They also needed good defaults. They also needed things to integrate well into their pipelines. As we tested with them, we found that we still had much to do. This led to a flurry of activity by our engineers.

Security Research (SR) team

The security researchers were typing away in the dark, creating curated sets of taint rules to find vulnerabilities. The security team wanted to focus on specific frameworks and vulnerability classes in order to make sure the scope of the rules research wasn’t too broad and that it was able to be completed by the launch date. On the Java side, SR focused on injection-style vulnerabilities such as SQL injection, command injection, code execution, SSRF, XXE, and path traversal. For frameworks, we focused on the most commonly used in Java — Spring and Servlets.

Furthermore, we compared Semgrep results with results from other tools on vulnerable repositories and filled in any missing gaps. As we did our research and wrote these rules, security research would occasionally stumble across some issues in the engine — and that’s where PA would come in.

Program Analysis (PA) team

<cut scene to another dark room with a bunch of engineers> On the other side, the PA engineers were busy fixing the bugs that security researchers filed.

The PA engineers were also busy analyzing the performance of Semgrep Pro Engine on a bunch of open-source repositories and improving engine stability. Most of our customers would run our engine in CI, where there are limits on memory and runtime. We wanted to make sure they got results, no matter what.

Ultimately, the tight collaboration between the Security Research and Program Analysis teams is what made the product and the launch successful. As our work on the engine progressed, so did our testers’ results. We were able to catch real SQL injection vulnerabilities in user code and open-source code using this engine! Govtech, a development partner, stated that two of our rules caught 6 SQL injection vulnerabilities in one of their Java Struts applications with JSP. These two rules, tainted-sql-from-http-request and formatted-sql-string, had high true positive rates reported by multiple customers.

Furthermore, the features and examples we gave during our demo of the engine received better feedback as we improved the user workflow. We were able to prioritize features such as JSP support and language support based on the conversations we had with others, which ultimately helped us build a coherent product roadmap.

Tell me what you got

After working on this engine for over 6 months, we successfully launched interfile analysis for Java and Javascript on February 14.

Ultimately, this is what we had to deliver:

Java support: interfile engine and taint rules
JavaScript support: interfile engine and taint rules
Pro Engine in the Editor

Here’s an example of what we can catch now:

https://semgrep.dev/playground/s/ezAX

With taint mode in Java, we can do interfile and interprocedural analysis. The rule above, for instance, catches cases of command injection with the pattern:

mode: taint

pattern-sources:
  - patterns:
      - pattern-either:
          - pattern-inside: |
              $METHODNAME(..., @$REQ(...) $TYPE $SOURCE,...) {
                ...
              }
          - pattern-inside: |
              $METHODNAME(..., @$REQ $TYPE $SOURCE,...) {
                ...
              }
      - metavariable-regex:
          metavariable: $REQ
          regex: gg(RequestBody|PathVariable|RequestParam|RequestHeader|CookieValue|ModelAttribute)
      - pattern: $SOURCE
pattern-sinks:
  - pattern: |
      (ProcessBuilder $PB).command(...);

We see that the sources consist of Spring user-inputted variables (RequestBody, RequestParam, etc.), and that the sink is a ProcessBuilder function that creates a new system process to execute a command. If a user-inputted variable gets passed into such a command, it could lead to attackers injecting system commands and stealing data from the server. This would count as a command injection vulnerability.

Previously, this rule could match examples like this:

class Test {
    @RequestMapping(value = "/test")
    String test(@RequestParam String input) {
        ProcessBuilder processBuilder = new ProcessBuilder();
        String cmd = "/path/to/folder/ '" + input + "'";
        // MATCHES
        processBuilder.command("bash", "-c", cmd);
    }
}

In this example, user input in the variable input is passed into the cmd string and later used to execute a ProcessBuilder command. This is definitely a command injection vulnerability.

However, Semgrep OSS cannot match cases like this where the pattern we want to catch is split between two files:

File 1:

package com.test;

import org.springframework.web.bind.annotation.*;
import org.springframework.boot.autoconfigure.*;

@RestController
public class Test1 {
    @RequestMapping(value = "/test1")
    String test1(@RequestParam String input) {
        return Test2.test2(input);
    }
}

File 2:

package com.test;

public class Test2 {
  public static void test2(String input) {
    ProcessBuilder processBuilder = new ProcessBuilder();
    String cmd = "/path/to/folder/ '" + input + "'";
    processBuilder.command("bash", "-c", cmd);
  }
}

But Semgrep Pro Engine can!!

Pro engine example
Figure 2: Semgrep Pro Engine matches pattern across files

From the example above, it’s clear that we can capture interfile examples with the taint rules we have written. These taint rules were all crafted with the philosophy of “less is more”. We wanted to reduce false positives of these alerts as much as possible by making sure that if these rules matched, the matches were definitive vulnerabilities as often as possible. To achieve this, we ran them on thousands of open-source repositories and triaged results to ensure these rules wouldn’t cause alert fatigue in our potential application security users.

What’s next?

From just the above example, you can see that we’re not yet done with our work on the Pro Engine. Some features we are thinking about adding in the future include:

Multi-language analysis for Java + JSP
Interfile Python support
Faster scan times

If any of these appeal to you, please let us know. Meanwhile, we are working on supporting Golang and other features.

You’re a champ!

If you’ve actually gotten to the end of this post (reading everything), congrats! I admire you!

Overall, the journey from validating our interfile analysis engine to launching support for interfile Java and JavaScript analysis has been a long but entertaining one. Not only did we do our share of looking through infinite Java call stacks and fixing numerous engine errors, but we also put ourselves out of our comfort zone by talking to customers and fully launching this feature.

On the way, we’ve realized that creating a product that customers will use is not easy. It’s not just about building out something that is usable — in fact, the code was the easiest part. It’s about talking to people about what they want, failing to address their needs, fixing your product so that you do address their needs, talking to people again, and doing it all over in an eternal cycle.

If you’re planning on doing something similar in the future, we wish you the best of luck and hope that this blog post was helpful in illuminating what the journey is like.

For the rest of you, thank your product managers .

PS: To learn more about all features of Semgrep Pro Engine and example rules, please check out the product page.

The birth of Semgrep Pro Engine

Share

First try

The importance of being organized

What we did

From a benchmark to an engine

To the moon

A tale of two teams

Tell me what you got

What’s next?

You’re a champ!

About

Dive deeper into Application Security or continue reading our featured posts.

From idea to (secure) app: Semgrep + Replit

Take control of sensitive code without developer frustration

Announcing an AI AppSec engineer that users agree with 95% of the time

Find and fix the issues that matter before build time