Recently we added a new experimental feature to Semgrep: generic pattern matching. This post outlines how to use it and what to expect when matching code patterns.
Generic pattern matching allows Semgrep to match code patterns in languages that don’t yet have a Semgrep parser, in configuration files, or in other structured data (e.g., HTML or XML). For example, you may want to find unwanted permissions enabled in Terraform files, insecure redirects in nginx, or misconfigured blog engine settings.
Consider this rule that searches for allowed_origins = ["*"]
in Terraform files:
rules:
- id: terraform-all-origins-allowed
patterns:
- pattern-inside: cors_rule { ... }
- pattern: allowed_origins = ["*"]
languages:
- generic
severity: WARNING
message: CORS rule on bucket permits any origin
The above rule matches this Terraform code snippet:
resource "aws_s3_bucket" "b" {
bucket = "s3-website-test-open.hashicorp.com"
acl = "private"
cors_rule {
allowed_headers = ["*"]
allowed_methods = ["PUT", "POST"]
allowed_origins = ["*"] # <--- Matches here
expose_headers = ["ETag"]
max_age_seconds = 3000
}
}
General properties
Generic pattern matching has the following properties:
A document is interpreted as a nested sequence of ASCII words, ASCII punctuation, and other bytes.
...
allows skipping non-matching elements, up to 10 lines down the last match.$X
(metavariable) matches any word.The interpretation of a document can be inspected with the
spacecat
command.Indentation determines primary nesting in the document.
Common ASCII braces
()
,[]
, and{}
introduce secondary nesting but only within single lines. Therefore, misinterpreted or mismatched braces don't disturb the structure of the rest of document.The document must be at least as indented as the pattern: any indentation specified in the pattern must be honored in the document.
Shorter matches are preferred over longer ones. This avoids matches like
def bar def foo
when the pattern isdef ... foo
, instead matching justdef foo
.Leading dots must match at the beginning of a block, allowing patterns like
... foo
to match what comes beforefoo
.In general, short patterns on structured data will perform the best.
Example rules
This Semgrep ruleset based on generic pattern matches performs security checks for nginx configuration files. You can also browse all the generic pattern matching rules in the Semgrep registry.
Caveats and limitations
Generic pattern matching should work fine with any human-readable text, as long as it’s primarily based on ASCII symbols. In practice, it might work great with some languages and less well with others. In general, it’s possible or even easy to write code in weird ways that will prevent matching.
Note it’s not good for detecting malicious code. For example, in HTML one can write Hello
; instead of Hello
and this is not something that would match if the pattern is Hello
, unlike if it had full HTML support.
With respect to Semgrep operators and features:
metavariable support is limited to capturing a single “word”, which is a token of the form [A-Za-z0-9_]+. They can’t capture sequences of tokens such as “hello, world” since in this case there are 3 tokens:
hello
,
world
the ellipsis operator is supported and spans at most 10 lines
pattern operators like either/not/inside are supported
Try it out
In addition to running rules already available in the Semgrep Registry, you can write custom Semgrep rules using generic pattern matching in the Semgrep live editor. Look for the generic pattern matching item in the editor menu.