Semgrep 0.70 supports taint-sanitizer-sink rules for smarter data-flow based scanning, parsing Terraform files to cover more infrastructure-as-code, and --config=auto to automatically pick Semgrep Registry rules for your project’s language and frameworks.
Semgrep has a weekly release cadence, which lets us ship and iterate on features quickly. This cadence does however fall short when it comes to highlighting larger features the way they deserve. So, today we’re looking back through the last few months’ worth of releases to highlight the biggest new features that landed.
In case you need a reminder: Semgrep is a fast, open-source, static analysis tool for finding bugs and enforcing code standards.
Taint mode: smarter, data-flow based security scanning
Taint checking is a versatile program analysis technique, useful to prevent many serious classes of issues, such as SQL injection or cross-site scripting (XSS). But for a simpler explanation, let’s just say you want to keep your program’s output innocent and clean of profanity, even if you pass your git commit log through it. After all, if you think about it, an ORM is also just a profanity filter for databases. Here’s how you would write a simple Semgrep pattern to make sure user input is printed without filtering profanity, before taint mode was available:
pattern: |
$MESSAGE = input(...)
...
print($MESSAGE)
Simple, right? Deceptively simple, in fact. Truth is, this rule can be tripped up by the simplest things, such as print(text.lower())
. This is because the above pattern finds only code where the variable coming from input(...)
is the exact same variable we pass to print(...)
.
Before taint mode, we’d handle these cases by adding a bunch of logical operations: the pattern can start with either the input itself, or the input being passed through any function, and must end with either of those variables being printed. Handling just one case would make the rule ten lines long, and the outlook still wouldn’t be so good for false negatives. There are countless other functions or operations we could call on the user input which would evade matches, each of which might need to be added to the rule. And we haven’t even reached the part where we figure out the right places to add pattern-not
lines to mark the variable safe after we ran a profanity filter function on it. Nor should we ever reach that point, because now we have taint mode! This replaces what was going to eventually grow to nearly 100 lines in our rule, with effectively three simple lines.
pattern-sources:
- pattern: input(...)
pattern-sanitizers:
- pattern: mask_bad_words(...)
pattern-sinks:
- pattern: print(...)
To explore this further, run and fork this rule yourself in the Semgrep Playground. Of course, this ain’t but a quaint example to acquaint you with taint mode, so to paint a less faint picture of it, we also published a much more in-depth article. See that article to read about more ways to apply it in code and real life examples of finding injection vulnerabilities.
Terraform: security scans for more infrastructure-as-code
We’re right at the one year anniversary of the first Terraform rules being published in the Semgrep Registry, but until now, these rules were relying on Semgrep’s “generic pattern matching”, which can match basic Semgrep patterns on any text file. But to really put the semantic in Semgrep, we had to teach it how to parse HCL files (the language used to configure Terraform). This makes complex rules such as detecting privilege escalation in AWS not only possible, but also reliable:
Terraform and its HCL language are joining good company: Semgrep already supports other infrastructure-as-code tools. YAML support was released earlier this year, which covers Kubernetes, docker-compose, GitHub Actions, and in a kind of mind-bending way, Semgrep itself. JSON support is even more mature, scanning AWS IAM policies since 2020.
Give it a try by navigating to a Terraform project directory, and running:
semgrep --config=p/terraform
Or better yet…
--config=auto: so you don’t have to think about rules
So I just asked you to run semgrep --config=p/terraform
on a Terraform project. Honestly though, a machine could’ve figured that one out. If only there were a tool that could read and understand codebases well enough to do this automatically…
Turns out, the answer has been staring us in the face all this time: just use Semgrep. Here’s the idea:
Semgrep can search the codebase for ‘inventory patterns’, such as import flask in a Python file, or the existence of a Dockerfile.
Without any of your source code leaving your machine, the Semgrep CLI will show these inventory findings to the Semgrep Registry, reporting essentially “I see Flask and Docker in this project”. Semgrep CLI will NOT send any source code, just as promised in the Semgrep CLI Philosophy manifesto.
The Semgrep Registry will search through its 1,500+ rules and bundle all the right ones for your project. In this case, it would select all the high-confidence Flask and Docker rules.
This lets you stop thinking about configuration altogether, and leave it to the Semgrep Registry. And since we frequently add new rules to the Registry, by using --config=auto
you don’t have to worry about ensuring you’re using the latest and greatest available rules—Semgrep takes care of that for you. The Semgrep team will also use the pseudo-anonymized aggregate inventory data to decide what technologies to put more effort into. You can read more about our metric collection in Semgrep’s privacy policy.
As of today, using semgrep --config=auto
returns a static list of the highest confidence rules for generally-available languages in the Semgrep Registry. This approach works quite well with the current, small static list of rules. Over the coming months --config=auto
will start dynamically adjusting to your inventory findings. This will eliminate the overhead that comes with bundling more rules than necessary, and we’ll be able to bundle more rules with confidence. We recommend adopting this flag for your one-off local scans today, and integrating it into your CI pipelines in a couple months, as this feature evolves.
Performance
Our Semgrep CLI philosophy has the ambitious goal to get Semgrep’s performance as close as possible to ripgrep’s. Through a wide array of optimizations, including reimplementing many of Semgrep’s features in OCaml, we reached a 5x speedup compared to half a year ago on large repositories. On these repositories, Semgrep can now return identical results in half the time compared to Bandit with a single CPU core, or even faster with multi-core scanning.
Scan time in seconds with recent versions of Semgrep
Lower is better
But wait, there’s more…
We’ve also been hard at work on Semgrep App. In case you need a reminder: Semgrep App is an online dashboard that helps your team make the best of Semgrep. It’s free for unlimited users & repositories. What’s new there? Loads of things, including:
a new way to configure Semgrep across your organization with 10% of the effort
the ability to track findings end-to-end, and to dismiss false positives forever
connectivity to Jira, allowing you to create tickets directly from findings
Check out the new Semgrep App goodies here →
What’s next
We believe the future is bright and we have great things planned for Semgrep. If you’re new to Semgrep, please join us in the Semgrep community Slack and say hi! And thanks to everyone for creating such a supportive community and helping us move Semgrep forward! 💜