Experimental feature: generic pattern matching

Recently we added a new experimental feature to Semgrep: generic pattern matching. This post outlines how to use it and what to expect when matching code patterns.

Generic pattern matching allows Semgrep to match code patterns in languages that don’t yet have a Semgrep parser, in configuration files, or in other structured data (e.g., HTML or XML). For example, you may want to find unwanted permissions enabled in Terraform files, insecure redirects in nginx, or misconfigured blog engine settings.

Consider this rule that searches for allowed_origins = ["*"] in Terraform files:

rules:
- id: terraform-all-origins-allowed
  patterns:
  - pattern-inside: cors_rule { ... }
  - pattern: allowed_origins = ["*"]
  languages:
  - generic
  severity: WARNING
  message: CORS rule on bucket permits any origin

The above rule matches this Terraform code snippet:

resource "aws_s3_bucket" "b" {
  bucket = "s3-website-test-open.hashicorp.com"
  acl    = "private"

  cors_rule {
    allowed_headers = ["*"]
    allowed_methods = ["PUT", "POST"]
    allowed_origins = ["*"]  # <--- Matches here
    expose_headers  = ["ETag"]
    max_age_seconds = 3000
  }
}

General properties

Generic pattern matching has the following properties:

A document is interpreted as a nested sequence of ASCII words, ASCII punctuation, and other bytes.
... allows skipping non-matching elements, up to 10 lines down the last match.
$X (metavariable) matches any word.
The interpretation of a document can be inspected with the spacecat command.
Indentation determines primary nesting in the document.
Common ASCII braces (), [], and {} introduce secondary nesting but only within single lines. Therefore, misinterpreted or mismatched braces don't disturb the structure of the rest of document.
The document must be at least as indented as the pattern: any indentation specified in the pattern must be honored in the document.
Shorter matches are preferred over longer ones. This avoids matches like def bar def foo when the pattern is def ... foo, instead matching just def foo.
Leading dots must match at the beginning of a block, allowing patterns like ... foo to match what comes before foo.
In general, short patterns on structured data will perform the best.

Example rules

This Semgrep ruleset based on generic pattern matches performs security checks for nginx configuration files. You can also browse all the generic pattern matching rules in the Semgrep registry.

Caveats and limitations

Generic pattern matching should work fine with any human-readable text, as long as it’s primarily based on ASCII symbols. In practice, it might work great with some languages and less well with others. In general, it’s possible or even easy to write code in weird ways that will prevent matching.

Note it’s not good for detecting malicious code. For example, in HTML one can write Hell&#x6F; instead of Hello and this is not something that would match if the pattern is Hello, unlike if it had full HTML support.

With respect to Semgrep operators and features:

metavariable support is limited to capturing a single “word”, which is a token of the form [A-Za-z0-9_]+. They can’t capture sequences of tokens such as “hello, world” since in this case there are 3 tokens:
- hello
- ,
- world
the ellipsis operator is supported and spans at most 10 lines
pattern operators like either/not/inside are supported

Try it out

In addition to running rules already available in the Semgrep Registry, you can write custom Semgrep rules using generic pattern matching in the Semgrep live editor. Look for the generic pattern matching item in the editor menu.

Experimental feature: generic pattern matching

Share

General properties

Example rules

Caveats and limitations

Try it out

About

Dive deeper into Announcements or continue reading our featured posts.

From idea to (secure) app: Semgrep + Replit

Take control of sensitive code without developer frustration

Announcing an AI AppSec engineer that users agree with 95% of the time

Find and fix the issues that matter before build time