Pain-free custom linting: why I moved from ESLint and Bandit to Semgrep

An inside look at writing program analysis using Semgrep

Ulziibayar Otgonbaatar
May 15th, 2020
Share

tldr: Semgrep is an analysis tool that is easy to learn and easy to prototype rules with, and can be adopted across languages.

For anyone who is looking to write a rule or sophisticated analysis using a free analysis tool, I wanted to share my experience of writing AST-based visitor rules in contrast to Semgrep rules.

Having written multiple Flake8 rules for Python3, an ESLint plugin, and poked at Go-AST, I have gotten familiar with how many AST-based analysis engines and frameworks work. After writing about 10 AST-based visitors, I was struck with the non-intuitive nature of rule writing, regardless of whether it’s in Go, Python, or JavaScript. In contrast, I have written 40-50 rules in Semgrep in a matter of two months and I am still amazed at the ease of writing rules with it.

For full disclosure, I work at r2c, and we open sourced Semgrep and actively develop it.

Writing code to analyze code != Writing code

When starting with an analysis, one usually has to program an AST-based visitor. If you’re not familiar with what an AST is or what a visitor means, feel free to check out this excellent blog post.

After writing a few visitors, it becomes obvious that the way I write my program is very different from the way I write program analysis for my program. When I write a visitor, I am essentially writing a graph algorithm that visits nodes in that graph and does certain logic.

One of the core advantages for me in writing analysis with Semgrep is that I don’t have to be in that mental model of graph algorithms.

I can actually reason about my analysis in the way I write my code.

To clarify the difference of mental model, consider writing analysis to match variable declaration like my_var = myvar().

In a typical AST based analysis, I’ll write a function that visits each statement in the AST of the program and programmatically specifies when to fire the rule.

...

def visit_Assign(node: AST.node link ):
    # Logic of Flake8 rule goes here
...
...
module.exports = {
    create: function(context) {
        return {
            `VariableDeclarator`: function(node) {
            // Logic of the ESLint rule goes here
            },
        };
    }
};

With Semgrep, I write my analysis in the way I would write my code.

my_var = $Y

Given this similarity of mental models for writing the code and the analysis for it, Semgrep lends itself as easy-to-learn and easy-to-prototype rule writing engine.

Overview of Semgrep

Without diving into the details, the core design decisions made in Semgrep are as follows:

  • Metavariables: used to track a variable across a specific code scope.

  • ... (ellipsis) operator: it abstracts away sequences so I don’t have to sweat the details of a particular code pattern. Namely, this implies that even my simple rules can match very complex code blocks. Hence, less is more, when writing Semgrep.

  • smart matching: Semgrep uses different pattern matchers depending on the code pattern I write. If I want to target function like def $FOO(...): ... it will match function declarations. If I want to match statements with patterns like $FOO = exec(...), it will match only statements.

Personally speaking, Semgrep has the mantra of “learn once, write anywhere,” as I can very easily adopt my analysis for other languages. It’s worth noting that core of Semgrep engine was written at Facebook, a company that is known for the“learn once, write anywhere” mantra of React and React Native.

Semgrep vs AST based analysis frameworks

r2c previously talked about how hardcoded password checks is a common and noisy rule. While most rules optimize for completeness, we find that precision is just as important if not more.

For the sake of argument, lets say I was to write a rule to detect hardcoded passwords within Semgrep and compare the ease of development with other AST-based analysis frameworks.

Because I don’t have to write boilerplate code at all, the analysis written in Semgrep is significantly (5 -10x) shorter. In addition, the expressive power of abstractions like metavariables and ellipsis operators in my analysis saves the additional code I need in other frameworks. And unlike other frameworks, because the matching engine of Semgrep smartly determines the type of visitor to use, I don’t have to programmatically write the types of nodes to visit explicitly. Given all of this, it’s easy to iterate and reduce false positive rates extremely quickly.

Lastly, just by simply changing the target language of my rule, I can actually adapt this Go rule to be used for Python or JavaScript. In contrast, if you were to adopt a Bandit rule for JavaScript, you’ll mostly likely have to rewrite it from scratch.

Semgrep vs grep-based tools

Grep-based tools like Ripgrep have been used extensively in code analysis. However, the structure-agnostic nature of the grep tools make analysis prone to false positives.

For example, if I simple want to find instances of sensitive function calls like exec(...), the Semgrep pattern exec(...) matches exec() called with any arguments or across multiple lines, but not the string "exec" in comments or hard-coded strings, because Semgrep is aware of the code structure.

Having to specify grep patterns that only fire inside function calls would be very complicated to say the least, and impossible to say the worst.

Semgrep niceties

Beyond pattern matching, Semgrep offers a very robust set of features for complex analysis. These sets of features make it extremely easy to do robust static analysis in less than half the time it takes using other static analysis tools.

Types

For any metavariable I use, I’m able further hone my analysis with type hints. Currently, I may use intfloat, and string literals and formatted strings.

For example, this check will only fires on time.sleep($X: float) , but not on time.sleep(foo()).

Module path

Another advantage of Semgrep is that it’s smart about module paths, such that I can target the specific object I care about in my analysis.

For example, when I was writing a rule to target HttpResponse of the Django framework, I needed to not fire on usage of the vanilla Python HttpResponse. Semgrep module resolution lets me do this very easily.

return django.http.HttpResponse(...)

Custom post-analysis filtering

Another great feature I like about Semgrep is that, after doing my AST-based analysis, I like to hone in my analysis based on certain captured metavariables. This is very useful for the types of analysis where I have some whitelist or blacklisting logic of strings or other literal values.

The following is an example rule that takes advantage of post-analysis filtering.

patterns:
  - pattern-either:
      - pattern: |
          rsa.GenerateKey(..., $BITS)
      - pattern: |
          rsa.GenerateMultiPrimeKey(..., $BITS)
  - pattern-where-python: |
      int(vars['$BITS']) < 2048

Conclusion

Having written all my analysis in Flake8, ESLint, and Semgrep, the amount of time Semgrep save me is very significant. There’s no obvious degradation with quality of analysis I can write and the features built into Semgrep only amplifies what I can express with my simple patterns. As a bonus, prototyping rules against real code using semgrep.live is very robust and functions like an IDE, which is a much better experience compared to https://astexplorer.net/ or https://python-ast-explorer.com/. Overall, without any bias or contention, I don’t want to go back to writing AST-based visitors now that I’ve found Semgrep.

About

Semgrep lets security teams partner with developers and shift left organically, without introducing friction. Semgrep gives security teams confidence that they are only surfacing true, actionable issues to developers, and makes it easy for developers to fix these issues in their existing environments.