tips and tricks

Structure Mode: Never write an invalid Semgrep rule again

Semgrep rule-writing is simple in principle, but it can be easy to make mistakes in practice (especially if you are newer to rule-writing). Structure mode is a new interface for structured rule-writing that makes writing invalid rules impossible, empowering both seasoned rule-writers and beginners alike.

Introduction

Semgrep has made something of a name for itself as a highly-customizable code scanning tool, with an emphasis on customizable. Semgrep Code (SAST) and Semgrep Secrets both give users the superpower to write rules with patterns that look like code, presented in YAML - meaning that there is no need to learn a vendor-specific DSL. These features are deliberate design decisions made with the explicit goal of making rule-writing as frictionless as possible.

Friction still exists, however. Have you ever written a rule that looks like this?

1rules:
2  - id: dont-call-eval-of-input
3    message: "Message"
4    severity: ERROR
5    language: [python]
6    pattern:
7		- "eval(input)"
8

This is a very basic almost-rule which looks correct, but isn’t actually accepted by Semgrep. The reason is that the pattern operator only takes in strings, and the rule-writer has passed in a list of YAML strings! In other words, due to a type error, the engine doesn’t understand the rule.

Another common mistake that people can make presents itself in this rule:

1rules:
2  - id: passing-functions
3    message: "Message"
4    severity: ERROR
5    language: [python]
6    pattern: |
7    def $FUNC(...):
8      pass
9

At first blush, this looks like an innocuous rule which detects Python functions with pass as the body. However, there is a critical issue: the YAML is not indented properly!

Instead of being interpreted as a key pattern: which is mapping to the contents of the pattern string, the def $FUNC(...) itself is being interpreted as a separate YAML key, mapping to pass, while pattern itself maps to the empty string. This is not at all obvious.

Small paper-cuts like this can happen often when writing rules. This is especially apparent to new rule writers, who may not know the particulars of YAML formatting, or who may be unfamiliar with how Semgrep rules are specifically meant to be formatted in YAML. While it’s nice that YAML is a format that many people are already familiar with, the large search space of ways that YAML-writing can go wrong threatens to undermine our commitment to making rule writing easy and effective.

Simple mode

A long time ago, we introduced simple mode for precisely this reason. Simple mode is just that — simple.

Simple mode takes the form of a list of code patterns and conditions for how they may be combined. For instance, it may specify that matches can be one of several different code patterns, or all of them at once.

This is useful for specifying some simple kinds of searches, but fails to be sufficient for writing more advanced rules. For one, simple mode is simple in the sense of being limited—it only contains a subset of Semgrep’s rule functionality. It excludes more advanced operators like metavariable-regex, metavariable-pattern, focus-metavariable, and even taint mode rules, which are some of the bread and butter features used when writing Semgrep rules for purposes such as code quality or security.

This Semgrep rule checks for how many files in the semgrep-rules repository use features not available in simple mode:

1rules:
2  - id: find-non-simple-rule-features
3    message: |
4      This rule should flag Semgrep rules which have
5      features that are not included in Simple Mode.
6    languages: [yaml]
7    severity: WARNING
8    pattern-either:
9      - pattern: |
10          metavariable-regex: $_
11      - pattern: |
12          metavariable-pattern: $_
13      - pattern: |
14          metavariable-comparison: $_
15      - pattern: |
16          metavariable-analyzer: $_
17      - pattern: |
18          metavariable-type: $_
19      - pattern: |
20          focus-metavariable: $_
21      - pattern: |
22          pattern-regex: $_
23      - pattern: |
24          mode: taint
25

803/1920 = 41.8% (nearly half!) of all rules in the semgrep-rules repository use features which are not expressible in simple mode — in the proprietary Semgrep rules repository, this number goes up to 65%. These figures indicate that these features are essential for writing higher-quality rules. For serious rule writers, simple mode just isn’t going to cut it.

In addition to lacking some pattern operators which are available in advanced mode, simple mode only allows a flat list of patterns. This means that it is incapable of expressing rules which have a nested structure, such as the following:

1rules:
2  - id: nested-rule
3    message:
4    severity: ERROR
5    language: [python]
6    pattern-either:
7      - patterns:
8          - pattern: |
9              foo(...)
10          - pattern-not: |
11              foo(1, ...)
12      - pattern: |
13				  foo(..., 2)
14

This rule contains patterns with an inexpressible structure in simple mode, because of the nested patterns, which has two children foo(...) and foo(1, ...). This is too advanced for simple mode, because it can’t express patterns which have their own children.

Structure editors

Recapping a few problems with simple mode thus far:

Lacking fundamental pattern operators like metavariable-regex , focus-metavariable, ...
Inability to write nested patterns
Easy to make careless mistakes (formatting, etc)

It turns out that there is a solution which is capable of solving all three of these problems.

Let’s take a step back - this makes more sense in the context of programming languages, when we are dealing with data with a rich, predefined structure.

Take the Java programming language as an example. Variable declarations in Java look something like this:

1<type> <identifier> = <expression>;

This is a prescribed structure — a simple Java variable declaration is invalid if it is just a subset of these things. For instance, writing int x 2;, without the = sign, is not a valid statement! All of these must be present to make a syntactically valid Java program, and therefore omitting any is a surefire error.

Regular text editors are unaware of this. When you are writing a text file, you are only dealing with a stream of characters, and nothing prevents you from entering whatever characters you like, correct or not. This means that when you write a Java file, you aren’t writing the Java program — you’re writing an encoding text which hopefully corresponds to the structure of an actual Java program.

In the world of programming languages, a structure editor is defined to be an editor which is aware of the structure, or the form of the data in question, instead of primarily interfacing through minute details like characters and whitespace. Instead of dealing with streams of characters, it deals with actual semantic elements of the program, such as a type, or expression.

So, for example, in the Hazel structure editor, partial programs can have “holes” in them which denote which areas of the program have yet to be filled in.

A stellar example of a structure editor actually also includes the Scratch visual programming language for children, which restricts the allowable programs to only those which can be created by connecting colorful blocks:

While Scratch is for children, this idea is not. We can take this inspiration and build something which both reduces the error space for possible Semgrep programs, and allows us to write rules just as powerful as those you would normally write in advanced mode.

Enter structure mode.

A structured editor

Structure mode is a new rule editing mode that allows a UI-based approach to writing Semgrep rules. While it is primarily geared towards easing new Semgrep rule writers into the rule writing process, it has some nice features for the seasoned Semgrep rule writer also.

The view of a structure mode rule contains all the same inherent information as that of an advanced mode rule, with a few nice features:

Selected keys

Remembering whether to do a pattern-either or patterns-either is now a thing of the past. Note that the former key is correct, but the second is spelled wrong, with an extra “s”! With a clean drop-down interface specifying the core pattern operators, users now never have to write out the name of the operator that they want.

These pattern operators have also been given terser, more descriptive names for what it is that they do. Now, instead of pattern, patterns, pattern-either, pattern-inside, pattern-regex, and pattern-not, we have pattern, all, any, inside, regex, and not. Why use lot word when few word do trick?

Match badges

It’s easy to end up with a complex rule which some deeply nested sub-pattern that isn’t behaving as expected.

To make these rule writing snafus easier to recover from, structure mode features match badges, which are little badges paired with each pattern operator which show the number of matches associated to each.

This works similarly to advanced mode’s inspect rule feature, but without needing to leave the structure of the rule itself. Also similarly to inspect rule, you can hover over the row of a given pattern operator to see the matches which it specifically produces.

Convenience is king in a development workflow!

Pattern extensibility

When adding a new pattern to a nested operator such as a patterns or pattern-either, one has to manually edit the text to have a key with the right indentation and the right name. Moreover, when deleting the pattern, the tried and true practice is simply select the operator in question, and hit backspace!

A workflow as common as this might as well be entrenched in the editor. In Structure Mode, this is as easy as hitting a button — no indentation, no problem.

Separate conditions

Structure mode discriminates the difference between a pattern and a pattern constraint.

A pattern is one of six different fundamental operators, which describe zero or more locations in a rule. These patterns include pattern, any, all, inside, regex, and not*. These may be combined in prescribed ways, such as any and all using range union and intersection, but they still define ranges.

Pattern constraints describe what they sound like — constraints on patterns. They describe Boolean constraints that must hold for a match to survive. If the constraint does not hold, then the ranges are killed outright. Notably, they do not produce new ranges — they merely take existing ranges and decide whether or not to shut them off.

These can be accessed by the “filter” icon next to each pattern operator:

You can use this to write things like metavariable conditions too, like using metavariable-regex to compare the text of a metavariable to some regular expression. These have also been renamed for convenience.

*not behaves slightly differently from the other six, but it’s still valid to include here.

Advanced mode interoperability

Got a complex rule? Entering it manually into structure mode can be a pain. Thankfully, you don’t have to! Rules will automatically translate bi-directionally between advanced and structure modes, so that you can easily export a structure mode rule in YAML, or paste an existing rule into structure mode.

Drag and drop

When dealing with YAML, moving around elements of a rule can be an error-prone copy and paste job, often due to placing a pattern at the incorrect level of nesting.

With structure mode, such problems become solved simply using a click and drag.

Pattern disabling

Sometimes, when debugging a rule, lots of things can be going on at once. A common workflow is to start simple, and slowly add extra patterns and conditions on to observe exactly when a match starts going wrong.

In a textual setting, this can be a lot of painful copy-paste and text deletion. To make these kinds of problems easier to deal with, structure mode comes with a “pattern disable” switch on each pattern, which lets you run a rule without a particular pattern in play.

Now this is rule writing.

Conclusions

At Semgrep, we’re big believers in the power of writing custom rules to tailor your security tooling to the needs of your codebase, your organization, and your priorities.

This means that, to the newcomer and seasoned rule writer alike, this process needs to be as frictionless as possible. Structure mode helps bridge this gap by combining the robustness of advanced mode and the accessibility of simple mode into one powerful interface for rule creation.

With Structure Mode (with the exception of invalid patterns) it is actually impossible to write an invalid rule. Gone are the days of wrestling with YAML errors and indentation mistakes — now, rule writing can be fast, fruitful, and frictionless.

What are you waiting for? Give it a try now in the Semgrep Playground.

About

Semgrep lets security teams partner with developers and shift left organically, without introducing friction. Semgrep gives security teams confidence that they are only surfacing true, actionable issues to developers, and makes it easy for developers to fix these issues in their existing environments.