Generic pattern matching

Introduction

Semgrep can match generic patterns in languages that it does not yet support. Use generic pattern matching for languages that do not have a parser, configuration files, or other structured data such as XML. Generic pattern matching can also be helpful in files containing multiple languages, even if the languages are otherwise supported, such as HTML with embedded JavaScript or PHP code. In those cases, you can also consider Extract mode (experimental), but generic patterns may be more straightforward and still effective.

As an example of generic matching, consider this rule:

rules:
  - id: dynamic-proxy-scheme
    pattern: proxy_pass $$SCHEME:// ...;
    paths:
      include:
        - "*.conf"
        - "*.vhost"
        - sites-available/*
        - sites-enabled/*
    languages:
      - generic
    severity: MEDIUM
    message: >-
      The protocol scheme for this proxy is dynamically determined.
      This can be dangerous if the scheme is injected by an
      attacker because it may forcibly alter the connection scheme.
      Consider hardcoding a scheme for this proxy.
    metadata:
      references:
        - https://github.com/yandex/gixy/blob/master/docs/en/plugins/ssrf.md
      category: security
      technology:
        - nginx
      confidence: MEDIUM

The preceeding rule matches this code snippet:

server {
  listen              443 ssl;
  server_name         www.example.com;
  keepalive_timeout   70;

  ssl_certificate     www.example.com.crt;
  ssl_certificate_key www.example.com.key;

  location ~ /proxy/(.*)/(.*)/(.*)$ {
    # ruleid: dynamic-proxy-scheme
    proxy_pass $1://$2/$3;
  }

  location ~* ^/internal-proxy/(?<proxy_proto>https?)/(?<proxy_host>.*?)/(?<proxy_path>.*)$ {
    internal;

    # ruleid: dynamic-proxy-scheme
    proxy_pass $proxy_proto://$proxy_host/$proxy_path ;
    proxy_set_header Host $proxy_host;
}

  location ~ /proxy/(.*)/(.*)/(.*)$ {
    # ok: dynamic-proxy-scheme
    proxy_pass http://$1/$2/$3;
  }

  location ~ /proxy/(.*)/(.*)/(.*)$ {
    # ok: dynamic-proxy-scheme
    proxy_pass https://$1/$2/$3;
  }
}

Generic pattern matching has the following properties:

A document is interpreted as a nested sequence of ASCII words, ASCII punctuation, and other bytes.
... (ellipsis operator) allows skipping non-matching elements, up to 10 lines down from the last match.
$X (metavariable) matches any word.
$...X (ellipsis metavariable) matches a sequence of words, up to 10 lines down from the last match.
Indentation determines primary nesting in the document.
Common ASCII braces (), [], and {} introduce secondary nesting but only within single lines. Therefore, misinterpreted or mismatched braces don't disturb the structure of the rest of the document.
The document must be at least as indented as the pattern: any indentation specified in the pattern must be honored in the document.

Caveats and limitations of generic mode

Semgrep can reliably understand the syntax of natively supported languages. The generic mode is useful for unsupported languages and consequently brings specific limitations.

caution

The quality of results in the generic mode can vary depending on the language you use it for.

The generic mode works fine with any human-readable text, as long as it is primarily based on ASCII symbols. Since the generic mode does not understand the syntax of the language you are scanning, the quality of the result may differ from language to language or even depend on specific code. As a consequence, the generic mode works well for some languages, but it does not always give consistent results. Generally, it's possible or even easy to write code in weird ways that prevent generic mode from matching.

Example: In XML, one can write Hell&#x6F instead of Hello. If a rule pattern in generic mode is Hello, Semgrep is unable to match the Hell&#x6F, unlike if it had full XML support.

With respect to Semgrep operators and features:

Metavariable support is limited to capturing a single “word”, which is a token of the form [A-Za-z0-9_]+. They can’t capture sequences of tokens such as hello, world (in this case, there are three tokens: hello, ,, and world).
The ellipsis operator is supported and spans, at most, 10 lines.
The pattern operators like either/not/inside are supported.
Inline regular expressions for strings ("=~/word.*/") are not supported.

Troubleshooting

Common pitfall #1: not enough `...`

Rule of thumb:

If the pattern commonly matches many lines, use ... ... (20 lines), or ... ... ... (30 lines), to ensure that all lines are matched.

Here's an innocuous pattern that should match the call to a function f():

f(...)

It matches the following code just fine:

f(
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9
)

But it fails here because the function arguments span more than 10 lines:

f(
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10
)

The solution is to use multiple ... in the pattern:

f(... ...)

Common pitfall #2: not enough indentation

Rule of thumb:

If the target code is always indented, use indentation in the pattern.

In the following example, the goal is to match the system sections containing a name field:

# match here
[system]
  name = "Debian"
# DON'T match here
[system]
  max_threads = 2
[user]
  name = "Admin Overlord"

❌ This pattern incorrectly catches the name field in the user section:

[system]
...
name = ...

✅ This pattern catches only the name field in the system section:

[system]
  ...
  name = ...

Handling line-based input

This section explains how to use Semgrep's generic mode to match single lines of code using an ellipsis metavariable. Many simple configuration formats are collections of key and value pairs delimited by newlines. For example, to extract the password value from the following made-up input:

username = bob
password = p@$$w0rd
server = example.com

Unfortunately, the following pattern does not match the whole line. In generic mode, metavariables only capture a single word (alphanumeric sequence):

password = $PASSWORD

This pattern matches the input file but does not assign the value p to $PASSWORD instead of the full value p@$$w0rd.

To match an arbitrary sequence of items and capture their value in the example:

Use a named ellipsis by changing the pattern to the following:
```
password = $...PASSWORD
```

This still leads Semgrep to capture too much information. The value assigned to $...PASSWORD are now p@$$w0rd and
server = example.com. In generic mode, an ellipsis extends until the end of the current block or up to 10 lines below, whichever comes first. To prevent this behavior, continue with the next step.

In the Semgrep rule, specify the following key:
```
generic_ellipsis_max_span: 0
```

This option forces the ellipsis operator to match patterns within a single line. Example of the resulting rule:

id: password-in-config-file
pattern: |
  password = $...PASSWORD
options:
  # prevent ellipses from matching multiple lines
  generic_ellipsis_max_span: 0
message: |
  password found in config file: $...PASSWORD
languages:
  - generic
severity: WARNING

Ignoring comments

By default, the generic mode does not know about comments or code that can be ignored. The following example is scanning for CSS code that sets the text color to blue. The target code is the following:

color: /* my fave color */ blue;

Use the options.generic_comment_style to ignore C-style comments, as is the case in the example. The Semgrep rule is:

id: css-blue-is-ugly
pattern: |
  color: blue
options:
  # ignore comments of the form /* ... */
  generic_comment_style: c
message: |
  Blue is ugly.
languages:
  - generic
severity: WARNING

Command line example

Sample pattern: exec(...)

Sample target file exec.txt contains:

import exec as safe_function
safe_function(user_input)

exec("ls")

exec(some_var)

some_exec(foo)

exec (foo)

exec (
    bar
)

# exec(foo)

print("exec(bar)")

Output:

$ semgrep -l generic -e 'exec(...)` exec.text
7:exec("ls")
--------------------------------------------------------------------------------
11:exec(some_var)
--------------------------------------------------------------------------------
19:exec (foo)
--------------------------------------------------------------------------------
23:exec (
24:128
25:    bar
26:129
27:)
--------------------------------------------------------------------------------
31:# exec(foo)
--------------------------------------------------------------------------------
35:print("exec(bar)")
ran 1 rules on 1 files: 6 findings

Semgrep Registry rules for generic pattern matching

You can peruse existing generic rules in the Semgrep registry. In general, short patterns on structured data performs the best.

Cheat sheet

Some examples of what matches and what doesn't match on the generic tab of the Semgrep cheat sheet below:

Hidden bonus

In the Semgrep code, the generic pattern matching implementation is called spacegrep because it tokenizes based on whitespace (and because it sounds cool 😎).

Not finding what you need in this doc? Ask questions in our Community Slack group, or see Support for other ways to get help.

Introduction​

Caveats and limitations of generic mode​

Troubleshooting​

Common pitfall #1: not enough ...​

Common pitfall #2: not enough indentation​

Handling line-based input​

Ignoring comments​

Command line example​

Semgrep Registry rules for generic pattern matching​

Cheat sheet​

Hidden bonus​