Skip to main content

Rule syntax

tip

Getting started with rule writing? Try the Semgrep Tutorial 🎓

This document describes Semgrep’s YAML rule syntax.

Schema

Required

All required fields must be present at the top-level of a rule, immediately underneath rules.

FieldTypeDescription
idstringUnique, descriptive identifier, e.g., no-unused-variable
messagestringMessage highlighting why this rule fired and how to remediate the issue
severitystringOne of: INFO, WARNING, or ERROR
languagesarraySee language extensions and tags
pattern*stringFind code matching this expression
patterns*arrayLogical AND of multiple patterns
pattern-either*arrayLogical OR of multiple patterns
pattern-regex*stringFind code matching this PCRE-compatible pattern in multiline mode
info

Only one of pattern, patterns, pattern-either, or pattern-regex is required.

Language extensions and tags

LanguageExtensionsTags
Bash.shbash
C.cc
C++.cpp, .hcpp
C#.cscsharp, cs, C#
Genericgeneric
Go.gogo, golang
Hack.h, .hackhack
HTML.htm, .htmlhtml
Java.javajava
JavaScript.js, .jsxjs, jsx, javascript
JSON.jsonjson, JSON, Json
JSX.js, .jsxjs, jsx, javascript
Kotlin.kt, .kts, .ktmkotlin
Lua.lualua
OCaml.ml, .mliocaml, ml
PHP.phpphp
Python.py, .pyipython, python2, python3, py
R.r, .rda, rdsr
Ruby.rbruby, rb
Rust.rsrust, Rust, rs
Scala.scala, .scscala
Solidity.solsolidity
Terraform.tfhcl
TypeScript.ts, .tsxts, tsx, typescript
TSX.ts, .tsxts, tsx, typescript
YAML.yamlyaml

Optional

FieldTypeDescription
optionsobjectOptions object to enable/disable certain matching features
fixobjectSimple search-and-replace autofix functionality
metadataobjectArbitrary user-provided data; attach data to rules without affecting Semgrep’s behavior
pathsobjectPaths to include or exclude when running this rule

The below optional fields must reside underneath a patterns or pattern-either field.

FieldTypeDescription
pattern-insidestringKeep findings that lie inside this pattern

The below optional fields must reside underneath a patterns field.

FieldTypeDescription
metavariable-regexmapSearch metavariables for Python re compatible expressions; regex matching is unanchored
metavariable-patternmapMatches metavariables with a pattern formula
metavariable-comparisonmapCompare metavariables against basic Python expressions
pattern-notstringLogical NOT - remove findings matching this expression
pattern-not-insidestringKeep findings that do not lie inside this pattern
pattern-not-regexstringFilter results using a PCRE-compatible pattern in multiline mode

Operators

pattern

The pattern operator looks for code matching its expression. This can be basic expressions like $X == $X or unwanted function calls like hashlib.md5(...).

patterns

The patterns operator performs a logical AND operation on one or more child patterns. This is useful for chaining multiple patterns together that all must be true.

patterns operator evaluation strategy

Note that the order in which the child patterns are declared in a patterns operator has no effect on the final result. A patterns operator is always evaluated in the same way:

  1. Semgrep evaluates all positive patterns, that is pattern-insides, patterns, pattern-regexes, and pattern-eithers. Each range matched by each one of these patterns is intersected with the ranges matched by the other operators. The result is a set of positive ranges. The positive ranges carry metavariable bindings. For example, in one range $X can be bound to the function call foo(), and in another range $X can be bound to the expression a + b.
  2. Semgrep evaluates all negative patterns, that is pattern-not-insides, pattern-nots, and pattern-not-regexes. This gives a set of negative ranges which are used to filter the positive ranges. This results in a strict subset of the positive ranges computed in the previous step.
  3. Semgrep evaluates all conditionals, that is metavariable-regexes, metavariable-patterns, and metavariable-comparisons. These conditional operators can only examine the metavariables bound in the positive ranges in step 1, that passed through the filter of negative patterns in step 2. Note that metavariables bound by negative patterns are not available here.
  4. Semgrep applies all focus-metavariables, by computing the intersection of each positive range with the range of the metavariable on which we want to focus. Again, the only metavariables available to focus on are those bound by positive patterns.

pattern-either

The pattern-either operator performs a logical OR operation on one or more child patterns. This is useful for chaining multiple patterns together where any may be true.

This rule looks for usage of the Python standard library functions hashlib.md5 or hashlib.sha1. Depending on their usage, these hashing functions are considered insecure.

pattern-regex

The pattern-regex operator searches files for substrings matching the given PCRE pattern. This is useful for migrating existing regular expression code search functionality to Semgrep. PCRE "Perl-Compatible Regular Expressions" is a full-featured regex library that is widely compatible with Perl of course, but also with the respective regex libraries of Python, JavaScript, Go, Ruby, and Java. Patterns are compiled in multiline mode, i.e. ^ and $ will match at the beginning and end of lines respectively in addition to the beginning and end of input (since Semgrep 0.95.0).

⚠️ PCRE supports only a limited number of Unicode character properties. For example, \p{Egyptian_Hieroglyphs} is supported but \p{Bidi_Control} isn't.

The pattern-regex operator can be combined with other pattern operators:

It can also be used as a standalone, top-level operator:

info

Single (') and double (") quotes behave differently in YAML syntax. Single quotes are typically preferred when using backslashes (\) with pattern-regex.

Note that if the regex uses groups, the metavariables $1, $2, etc. will be bound to the content of the captured group.

pattern-not-regex

The pattern-not-regex operator filters results using a PCRE regular expression in multiline mode. This is most useful when combined with regular-expression only rules, providing an easy way to filter findings without having to use negative lookaheads. pattern-not-regex will work with regular pattern clauses, too.

The syntax for this operator is the same as pattern-regex.

This operator will filter findings that have any overlap with the supplied regular expression. For example, if you use pattern-regex to detect Foo==1.1.1 and it also detects Foo-Bar==3.0.8 and Bar-Foo==3.0.8, you can use pattern-not-regex to filter the unwanted findings.

focus-metavariable

The focus-metavariable operator puts the focus, or zooms in, on the code region matched by a single metavariable or a list of metavariables. For example, to find all functions arguments annotated with the type bad you may write the following pattern:

pattern: |
def $FUNC(..., $ARG : bad, ...):
...

This works but it matches the entire function definition. Sometimes, this is not desirable. If the definition spans hundreds of lines they are all matched. In particular, if you are using Semgrep App and you have triaged a finding generated by this pattern, the same finding shows up again as new if you make any change to the definition of the function!

To specify that you are only interested in the code matched by a particular metavariable, in our example $ARG, use focus-metavariable.

Note that focus-metavariable: $ARG is not the same as pattern: $ARG! Using pattern: $ARG finds all the uses of the parameter x which is not what we want! (Note that pattern: $ARG does not match the formal parameter declaration, because in this context $ARG only matches expressions.)

In short, focus-metavariable: $X is not a pattern in itself, it does not perform any matching, it only focuses the matching on the code already bound to $X by other patterns. Whereas pattern: $X matches $X against your code (and in this context, $X only matches expressions)!

info

To make a list of multiple focus metavariables, see Using multiple focus metavariables documentation.

metavariable-regex

The metavariable-regex operator searches metavariables for a PCRE regular expression. This is useful for filtering results based on a metavariable’s value. It requires the metavariable and regex keys and can be combined with other pattern operators.

Regex matching is unanchored. For anchored matching, use \A for start-of-string anchoring and \Z for end-of-string anchoring. The next example, using the same expression as above but anchored, will find no matches:

info

Include quotes in your regular expression when using metavariable-regex to search string literals. See this snippet for more details. String matching functionality can also be used to search string literals.

metavariable-pattern

The metavariable-pattern operator matches metavariables with a pattern formula. This is useful for filtering results based on a metavariable’s value. It requires the metavariable key, and exactly one key of pattern, patterns, pattern-either, or pattern-regex. This operator can be nested as well as combined with other operators.

For example, it can be used to filter out matches that do not match certain criteria:

info

In this case it is possible to start a patterns AND operation with a pattern-not, because there is an implicit pattern: ... that matches the content of the metavariable.

It is also useful in combination with pattern-either:

tip

It is possible to nest metavariable-pattern inside metavariable-pattern!

info

The metavariable should be bound to an expression, a statement, or a list of statements, for this test to be meaningful. A metavariable bound to a list of function arguments, a type, or a pattern, will always evaluate to false.

metavariable-pattern with nested language

If the metavariable's content is a string, then it is possible to use metavariable-pattern to match this string as code by specifying the target language via the language key.

For example, we can match JavaScript code inside HTML:

We can also use this feature to filter regex matches:

metavariable-comparison

The metavariable-comparison operator compares metavariables against a basic Python comparison expression. This is useful for filtering results based on a metavariable's numeric value.

The metavariable-comparison operator is a mapping which requires the metavariable and comparison keys. It can be combined with other pattern operators:

This will catch code like set_port(80) or set_port(443), but not set_port(8080).

Comparison expressions support simple arithmetic as well as composition with boolean operators to allow for more complex matching. This is particularly useful for checking that metavariables are divisible by particular values, such as enforcing that a particular value is even or odd:

Building off of the previous example this will still catch code like set_port(80) but will no longer catch set_port(443) or set_port(8080).

The comparison key accepts Python expression using:

  • Boolean, string, integer, and float literals.
  • Boolean operators not, or, and and.
  • Arithmetic operators +, -, *, /, and %.
  • Comparison operators ==, !=, <, <=, >, and >=.
  • Function int() to convert strings into integers.
  • Function str() to convert numbers into strings.
  • Lists and the in infix operator.
  • Function re.match() to match a regular expression (without the optional flags argument).

You can use Semgrep metavariables such as $MVAR, which Semgrep evaluates as follows:

  • If $MVAR binds to a literal, then that literal is the value assigned to $MVAR.
  • If $MVAR binds to a code variable that is a constant, and constant propagation is enabled (as it is by default), then that constant is the value assigned to $MVAR.
  • Otherwise the code bound to the $MVAR is kept unevvaluated, and its string representation can be obtainer using the str() function, as in str($MVAR). For example, if $MVAR binds to the code variable x, str($MVAR) evaluates to the string literal "x".

Legacy metavariable-comparison keys

info

You can avoid the use of the legacy keys described below (base: int and strip: bool) by using the int() function, as in int($ARG) > 0o600 or int($ARG) > 2147483647.

The metavariable-comparison operator also takes optional base: int and strip: bool keys. These keys set the integer base the metavariable value should be interpreted as and remove quotes from the metavariable value, respectively.

For example, base:

This will interpret metavariable values found in code as octal, so 0700 will be detected, but 0400 will not.

For example, strip:

This will remove quotes (', ", and `) from both ends of the metavariable content. So "2147483648" will be detected but "2147483646" will not. This is useful when you expect strings to contain integer or float data.

pattern-not

The pattern-not operator is the opposite of the pattern operator. It finds code that does not match its expression. This is useful for eliminating common false positives.

pattern-inside

The pattern-inside operator keeps matched findings that reside within its expression. This is useful for finding code inside other pieces of code like functions or if blocks.

pattern-not-inside

The pattern-not-inside operator keeps matched findings that do not reside within its expression. It is the opposite of pattern-inside. This is useful for finding code that’s missing a corresponding cleanup action like disconnect, close, or shutdown. It’s also useful for finding problematic code that isn't inside code that mitigates the issue.

The above rule looks for files that are opened but never closed, possibly leading to resource exhaustion. It looks for the open(...) pattern and not a following close() pattern.

The $F metavariable ensures that the same variable name is used in the open and close calls. The ellipsis operator allows for any arguments to be passed to open and any sequence of code statements in-between the open and close calls. The rule ignores how open is called or what happens up to a close call it only needs to make sure close is called.

pattern-where-python

danger

This feature was deprecated in Semgrep v0.61.0.

The pattern-where-python is the most flexible operator. It allows for writing custom Python logic to filter findings. This is useful when none of the other operators provide the functionality needed to create a rule.

danger

Use caution with this operator. It allows for arbitrary Python code execution.

As a defensive measure, the --dangerously-allow-arbitrary-code-execution-from-rules flag must be passed to use rules containing pattern-where-python.

Example:

rules:
- id: use-decimalfield-for-money
patterns:
- pattern: $FIELD = django.db.models.FloatField(...)
- pattern-inside: |
class $CLASS(...):
...
- pattern-where-python: "'price' in vars['$FIELD'] or 'salary' in vars['$FIELD']"
message: "use DecimalField for currency fields to avoid float-rounding errors"
languages: [python]
severity: ERROR

The above rule looks for use of Django’s FloatField model when storing currency information. FloatField can lead to rounding errors and should be avoided in favor of DecimalField when dealing with currency. Here the pattern-where-python operator allows us to utilize the Python in statement to filter findings that look like currency.

Metavariable matching

Metavariable matching operates differently for logical AND (patterns) and logical OR (pattern-either) parent operators. Behavior is consistent across all child operators: pattern, pattern-not, pattern-regex, pattern-inside, pattern-not-inside.

Metavariables in logical ANDs

Metavariable values must be identical across sub-patterns when performing logical AND operations with the patterns operator.

Example:

rules:
- id: function-args-to-open
patterns:
- pattern-inside: |
def $F($X):
...
- pattern: open($X)
message: "Function argument passed to open() builtin"
languages: [python]
severity: ERROR

This rule matches the following code:

def foo(path):
open(path)

The example rule doesn’t match this code:

def foo(path):
open(something_else)

Metavariables in logical ORs

Metavariable matching does not affect the matching of logical OR operations with the pattern-either operator.

Example:

rules:
- id: insecure-function-call
pattern-either:
- pattern: insecure_func1($X)
- pattern: insecure_func2($X)
message: "Insecure function use"
languages: [python]
severity: ERROR

The above rule matches both examples below:

insecure_func1(something)
insecure_func2(something)
insecure_func1(something)
insecure_func2(something_else)

Metavariables in complex logic

Metavariable matching still affects subsequent logical ORs if the parent is a logical AND.

Example:

patterns:
- pattern-inside: |
def $F($X):
...
- pattern-either:
- pattern: bar($X)
- pattern: baz($X)

The above rule matches both examples below:

def foo(something):
bar(something)
def foo(something):
baz(something)

The example rule doesn’t match this code:

def foo(something):
bar(something_else)

options

Enable, disable, or modify the following matching features:

OptionDefaultDescription
ac_matchingtrueMatching modulo associativity and commutativity, we treat Boolean AND/OR as associative, and bitwise AND/OR/XOR as both associative and commutative.
attr_exprtrueExpression patterns (e.g., f($X)) will match attributes (e.g., @f(a)).
commutative_boolopfalseTreat Boolean AND/OR as commutative even if not semantically accurate.
constant_propagationtrueConstant propagation, including intra-procedural flow-sensitive constant propagation.
generic_comment_stylenoneIn generic mode, assume that comments follow the specified syntax. They are then ignored for matching purposes. Allowed values for comment styles are c for traditional C-style comments (/* ... */), cpp for modern C or C++ comments (// ... or /* ... */), and shell (# ...). By default, the generic mode does not recognize any comments. Available since Semgrep 0.96.
generic_ellipsis_max_span10In generic mode, this is the maximum number of newlines that an ellipsis pattern ... can match or equivalently, the maximum number of lines covered by the match minus one. The default value is 10 (newlines) for performance reasons. Increase it with caution. Note that the same effect as 20 can be achieved without changing this setting and by writing ... ... in the pattern instead of .... Setting it to 0 is useful with line-oriented languages (for example INI or key-value pairs in general) to force a match to not extend to the next line of code. Available since Semgrep 0.96.
vardef_assigntrueAssignment patterns (for example $X = $E) match variable declarations (for example var x = 1;).
xml_attrs_implicit_ellipsistrueAny XML/JSX/HTML element patterns have implicit ellipsis for attributes (for example: <div /> matches <div foo="1">.

The full list of available options can be consulted here. Note that options not included in the table above are considered experimental, and they may change or be removed without notice.

fix

The fix top-level key allows for simple autofixing of a pattern by suggesting an autofix for each match. Run semgrep with --autofix to apply the changes to the files.

Example:

rules:
- id: use-dict-get
patterns:
- pattern: $DICT[$KEY]
fix: $DICT.get($KEY)
message: "Use `.get()` method to avoid a KeyNotFound error"
languages: [python]
severity: ERROR

metadata

To note extra information on a rule, such as a related CVE or the name of the security engineer who wrote the rule, use the metadata: key.

Example:

rules:
- id: eqeq-is-bad
patterns:
- [...]
message: "useless comparison operation `$X == $X` or `$X != $X`"
metadata:
cve: CVE-2077-1234
discovered-by: Ikwa L'equale

The metadata will also be shown in Semgrep’s output if you’re running it with --json.

paths

Excluding a rule in paths

To ignore a specific rule on specific files, set the paths: key with one or more filters.

Example:

rules:
- id: eqeq-is-bad
pattern: $X == $X
paths:
exclude:
- "*.jinja2"
- "*_test.go"
- "project/tests"
- project/static/*.js

When invoked with semgrep -f rule.yaml project/, the above rule will run on files inside project/, but no results will be returned for:

  • any file with a .jinja2 file extension
  • any file whose name ends in _test.go, such as project/backend/server_test.go
  • any file inside project/tests or its subdirectories
  • any file matching the project/static/*.js glob pattern
note

The glob syntax is from Python's wcmatch and is used to match against the given file and all its parent directories.

Limiting a rule to paths

Conversely, to run a rule only on specific files, set a paths: key with one or more of these filters:

rules:
- id: eqeq-is-bad
pattern: $X == $X
paths:
include:
- "*_test.go"
- "project/server"
- "project/schemata"
- "project/static/*.js"
- "tests/**/*.js"

When invoked with semgrep -f rule.yaml project/, this rule will run on files inside project/, but results will be returned only for:

  • files whose name ends in _test.go, such as project/backend/server_test.go
  • files inside project/server, project/schemata, or their subdirectories
  • files matching the project/static/*.js glob pattern
  • all files with the .js extension, arbitrary depth inside the tests folder

If you are writing tests for your rules, you will need to add any test file or directory to the included paths as well.

note

When mixing inclusion and exclusion filters, the exclusion ones take precedence.

Example:

paths:
include: "project/schemata"
exclude: "*_internal.py"

The above rule returns results from project/schemata/scan.py but not from project/schemata/scan_internal.py.

Other examples

This section contains more complex rules that perform advanced code searching.

Complete useless comparison

rules:
- id: eqeq-is-bad
patterns:
- pattern-not-inside: |
def __eq__(...):
...
- pattern-not-inside: assert(...)
- pattern-not-inside: assertTrue(...)
- pattern-not-inside: assertFalse(...)
- pattern-either:
- pattern: $X == $X
- pattern: $X != $X
- patterns:
- pattern-inside: |
def __init__(...):
...
- pattern: self.$X == self.$X
- pattern-not: 1 == 1
message: "useless comparison operation `$X == $X` or `$X != $X`"

The above rule makes use of many operators. It uses pattern-either, patterns, pattern, and pattern-inside to carefully consider different cases, and uses pattern-not-inside and pattern-not to whitelist certain useless comparisons.

Full specification

The full configuration-file format is defined as a jsonschema object.


Find what you needed in this doc? Join the Slack group to ask the maintainers and the community if you need help.