Testing autofix behavior of SAST rules

Automatically test the autofix behavior of custom Semgrep rules

Pieter De Cremer
August 3rd, 2022
Share

It is very easy to write custom rules for Semgrep. And to ensure these rules keep functioning as expected, you can write automated tests. A cool feature of Semgrep rules is autofix, which allows Semgrep to automatically transform code that is violating the rule to a functional equivalent that is compliant. However, until now, there was no way to automatically test this autofix part of the rule. Introducing: autofix tests!

Figure 1: Autofix tests

Back in 2016, while I was still in university, I interned at a company that was building a tool comparable to Semgrep. My task was to write rules for this tool which was still very early in development. Writing rules was not as easy as it is today with Semgrep, but I had gotten pretty good at it. I was hired to keep doing this while finishing my studies. In the internship report, my biggest criticism of the tool was that there was no automated way to verify that rules would still be functioning in later versions of the tool. Even more, after graduating I went on to pursue a Ph.D. at this company (in Belgium, there is a special grant that allows you to pursue a Ph.D at a company), and that same feedback I still gave in my thesis in 2021 (see Section 7.2.4).

Testing rules in Semgrep

You can imagine that I was happy to see that this functionality already existed in Semgrep when I joined r2c after graduating from my Ph.D. program! Not only does the functionality exist, but almost every rule written actually has a bunch of tests. It is easy to see why this functionality is used so frequently, because the implementation of these rule tests is very intuitive and straightforward, and it fits perfectly in a normal rule-writing workflow.

To write a rule, a rule-writer will start by writing a few code fragments that they expect to be marked, and some that they expect not to be marked by the rule. This then allows for an iterative process to test the rule and refine it until the desired behavior is achieved. To automate this testing process with Semgrep, the rule-writer only needs to add a few annotations in the comments of the code.

For example:

# ruleid: my-rule
action(secure=False)

# ok: my-rule
action(secure=True)

When the rule is saved in my-rule.yaml, this test code can be saved in my-rule.py, or whichever is the appropriate file extension for the target language.

Adding these annotations takes no effort at all. Kudos to the r2c team, I really believe this is an excellent implementation of a great feature! This allows for automated testing of the rules to ensure that the right code fragments are marked after new releases of Semgrep, or future updates to the rules themselves.

However, one thing that was still missing is the automated testing of autofix behavior. Since members of the security research team at r2c can spend 10% of our time working on any initiatives we want, I decided to take on this task myself. And of course, I want to live up to r2c’s standards and make this an excellent feature that is as easy to use as possible.

Autofix testing in Semgrep

Even though autofix is still officially an experiment, people who are familiar with my research will know that it is a feature I very much believe in. I think to make developers love a security tool, it should not just slap them on the wrist when they make a mistake, but actively help them fix it while losing as little productivity as possible. What better way to do this, than to automatically fix the mistake for them?

So, what does a rule-writer do when they are writing a fix? They first copy the test code to a new file so that they can apply the fix without messing with the origina. After applying the fix, they compare the result to the original test code. They might need to do this a few times to update the fix in order to achieve the desired behavior.

With the autofix testing feature, automating the process of testing a fix is as easy as using the right filename of the copy of the test code. To test the autofix behavior, all you need to do is create a file my-rule.fixed.py, or whichever was the file extension of the original test code. This file must contain the desired outcome of applying an autofix to the test code in my-rule.py. Semgrep detects this file and does the necessary comparisons. If the autofix does not behave as expected, Semgrep prints a clear line diff to help refine the rule.

For example:

0/1: 1 fix tests did not pass:
--------------------------------------------------------------------------------
    ✖ my-rule.fixed.py <> autofix applied to my-rule.py

    ---
    +++
    @@ -1 +1 @@
    -action(secure=True)
    +action(secure=False)

You can find more information and some tips in the docs!

Internal testing of Semgrep rules

Now that we are on the topic of testing rules, I would like to give you an exclusive sneak peek into the types of testing we do internally at r2c. The quality of our rules is one of the top priorities of the security research team. We want to cover a wide array of vulnerabilities in a large number of languages. So, we are continuously adding new rules.

However, it is also necessary to continuously update and improve the existing rules. Semgrep is still actively being developed. Bug fixes and feature additions every week make it possible to write more precise rules, and reduce false positives. On top of that, our research team grows as well. We hire new people, and we learn more about security every day. Finally, we also set increasingly higher standards for our rules, as well as the messages and metadata that need to be included.

So, how do we determine which rules require updating? First of all, there is no better source of data to detect a less optimal rule, than you, the user! You can report false positives to the security team through the Semgrep App. Don’t forget to add a message to give us some additional information about why you think it is a false positive! And you can also report false negatives through Semgrep shouldafound.

However, without your feedback, we can still test the performance of our rules in other ways, for example by scanning a large number of open source repositories. This gives us an idea of how many findings a rule produces and hence how noisy it is. If there are many findings, this probably means it is a very inaccurate rule, with a decent number of false positives. But there is no way to triage all of the findings. To more objectively measure false positives and negatives, we can use purposefully vulnerable repositories such as juice shop and WebGoat. These applications exist in most programming languages and usually have a well-documented list of vulnerabilities, which makes it easy to compute false positive and false negative rates. The downside is that they are not always a good representation of actual code, and they often even include code constructs with the only purpose of thwarting analysis tools. Just take a look at this file from the OWASP Java Benchmark:

private static String doSomething(HttpServletRequest request, String param)
    throws ServletException, IOException {
    String bar = "safe!";
    java.util.HashMap&lt;String, Object> map26903 = new java.util.HashMap&lt;String, Object>();
    map26903.put("keyA-26903", "a_Value"); // put some stuff in the collection
    map26903.put("keyB-26903", param); // put it in a collection
    map26903.put("keyC", "another_Value"); // put some stuff in the collection
    bar = (String) map26903.get("keyB-26903"); // get it back out
    bar = (String) map26903.get("keyA-26903"); // get safe value back out
    return bar;
}

Conclusion

We are still continuously improving our testing suite, and I have many more things planned. For example, it would also be really fun to look at past CVE’s in code and determine if these could have been prevented with the current Semgrep rules. This exact thing happened with my coworker Vasilii and I. When we were researching XXE in Java, a fresh CVE from that week could have been prevented with the rules we were developing at that exact time! But no shame to those developers, XML parsing in Java can get really confusing. Which is exactly why this will be the topic of my next blog post, so stay tuned! Meanwhile, you can start using Semgrep to try out the autofix tests.

About

Semgrep lets security teams partner with developers and shift left organically, without introducing friction. Semgrep gives security teams confidence that they are only surfacing true, actionable issues to developers, and makes it easy for developers to fix these issues in their existing environments.