Generic pattern matching
Introduction
Semgrep can match matches generic patterns in languages that it doesn’t yet support. You can use generic pattern matching for languages that don’t yet have a parser, configuration files, and other structured data, such as HTML or XML. Generic pattern matching is experimental.
As an example, consider this rule:
rules:
- id: terraform-all-origins-allowed
patterns:
- pattern-inside: cors_rule { ... }
- pattern: allowed_origins = ["*"]
languages:
- generic
severity: WARNING
message: CORS rule on bucket permits any origin
The above rule matches this code snippet:
resource "aws_s3_bucket" "b" {
bucket = "s3-website-test-open.hashicorp.com"
acl = "private"
cors_rule {
allowed_headers = ["*"]
allowed_methods = ["PUT", "POST"]
allowed_origins = ["*"] # <--- Matches here
expose_headers = ["ETag"]
max_age_seconds = 3000
}
}
Generic pattern matching has the following properties:
- A document is interpreted as a nested sequence of ASCII words, ASCII punctuation, and other bytes.
...
allows skipping non-matching elements, up to 10 lines down the last match.$X
(metavariable) matches any word.$...X
(ellipsis metavariable) matches a sequence of words, up to 10 lines down the last match.- Indentation determines primary nesting in the document.
- Common ASCII braces
()
,[]
, and{}
introduce secondary nesting but only within single lines. Therefore, misinterpreted or mismatched braces don't disturb the structure of the rest of document. - The document must be at least as indented as the pattern: any indentation specified in the pattern must be honored in the document.
Caveats and limitations
Generic mode should work fine with any human-readable text, as long as it’s primarily based on ASCII symbols. In practice, it might work great with some languages and less well with others. In general, it’s possible or even easy to write code in weird ways that will prevent generic mode from matching. Note it’s not good for detecting malicious code. For example, in HTML one can write Hello
; instead of Hello
and this is not something the generic mode would match if the pattern is Hello
, unlike if it had full HTML support.
With respect to Semgrep operators and features:
- metavariable support is limited to capturing a single “word”, which is a token of the form [A-Za-z0-9_]+. They can’t capture sequences of tokens such as hello, world (in this case there are 3 tokens:
hello
,,
, andworld
). - the ellipsis operator is supported and spans at most 10 lines
- pattern operators like either/not/inside are supported
- inline regular expressions for strings (
"=~/word.*/"
) is not supported
Troubleshooting
Common pitfall #1: not enough ...
Rule of thumb:
If the pattern commonly matches many lines, use
... ...
(20 lines), or... ... ...
(30 lines) etc. to make sure to match all the lines.
Here's an innocuous pattern that should match the call to a function f()
:
f(...)
It matches the following code just fine:
f(
1,
2,
3,
4,
5,
6,
7,
8,
9
)
But it will fail here because the function arguments span more than 10 lines:
f(
1,
2,
3,
4,
5,
6,
7,
8,
9,
10
)
The solution is to use multiple ...
in the pattern:
f(... ...)
Common pitfall #2: not enough indentation
Rule of thumb:
If the target code is always indented, use indentation in the pattern.
In the following example, we want to match the system
sections containing a name
field:
# match here
[system]
name = "Debian"
# DON'T match here
[system]
max_threads = 2
[user]
name = "Admin Overlord"
❌ This pattern will incorrectly catch the name
field in the user
section:
[system]
...
name = ...
✅ This pattern will catch only the name
field in the system
section:
[system]
...
name = ...
Command line example
Sample pattern: exec(...)
Sample target file exec.txt
contains:
import exec as safe_function
safe_function(user_input)
exec("ls")
exec(some_var)
some_exec(foo)
exec (foo)
exec (
bar
)
# exec(foo)
print("exec(bar)")
Output:
$ semgrep -l generic -e 'exec(...)` exec.text
7:exec("ls")
--------------------------------------------------------------------------------
11:exec(some_var)
--------------------------------------------------------------------------------
19:exec (foo)
--------------------------------------------------------------------------------
23:exec (
24:128
25: bar
26:129
27:)
--------------------------------------------------------------------------------
31:# exec(foo)
--------------------------------------------------------------------------------
35:print("exec(bar)")
ran 1 rules on 1 files: 6 findings
Semgrep Registry rules for generic pattern matching
You can peruse existing generic rules in the Semgrep registry. In general, short patterns on structured data will perform the best.
Cheat sheet
Some examples of what will and will not match on the generic
tab of the Semgrep cheat sheet below:
Hidden bonus
In the Semgrep code the generic pattern matching implementation is called spacegrep because it tokenizes based on whitespace (and because it sounds cool 😎).