Overview
The philosophy for Semgrep has always been to build a lightweight, fast tool optimized for enforcing good coding practices. Because of this, though r2c has continually made the engine smarter, Semgrep rules only run on a single file, and Semgrep taint rules (source to sink analysis) run within a single function. This allows Semgrep to run as fast as most linters and work even on incomplete code.
Sometimes, however, running on a single file is too limited for finding complex bugs. That is why we have created a proprietary extension to Semgrep called DeepSemgrep. It leverages global analysis to return better results using the exact same rules without needing to build code (i.e. on incomplete code unlike many SAST tools).
DeepSemgrep trades analysis time for more accurate results:
Fewer false negatives, for instance by finding more matches to a pattern with inter-file constant propagation.
Reduced false positives, for instance taint tracking will find out whether tainted user's input may be reaching an unsafe SQL statement through a long chain of function calls.
DeepSemgrep is available for Semgrep Team and Enterprise tiers.
We're thrilled to partner with early users and push Semgrep's analysis capabilities even further. If you'd like to join the private beta, request access here.
We have focused on extending three analyses to be inter-file and interprocedural:
Type inference with class inheritance analysis (typed metavariables already interprocedural in Semgrep)
Constant propagation
Taint analysis
This blog includes a quick start guide to DeepSemgrep and a comparison between Semgrep and DeepSemgrep.
Quick start guide to DeepSemgrep
DeepSemgrep performs a global analysis of all the files in a project, resolving names globally and extracting key data such as the type of each variable and method, or the known values of constant class fields. The global data is passed to the Semgrep engine, which then uses it to refine its findings.
To run deep analyses via the CLI, simply pass --deep
to Semgrep:
$ semgrep --deep
In the Editor, you should see a "DeepSemgrep" toggle switch if you have DeepSemgrep enabled (currently in private beta for Team and Enterprise tier users). Here are some examples that illustrate the global analyses currently implemented by DeepSemgrep.
Constant propagation
When enforcing guardrails, Semgrep rules can be used to flag dangerous functions that may receive potentially non-constant data, which could be user-controlled, thus posing a security risk. These rules follow the template below, where we find all calls to some dangerous
function, except those calls where dangerous
only receives a constant string.
1
2rules:
3- id: dangerous-call
4 patterns:
5 - pattern: dangerous(...)
6 - pattern-not: dangerous("...")
7 message: Call of dangerous on non-constant value
8 languages: [java]
9 severity: WARNING
The Semgrep engine can perform constant folding within a single file. But, in the following Java example, there is a constant, EMPLOYEE_TABLE_NAME
, that is defined in a Constants
class in another file. The Semgrep engine cannot see the constant value of Constants.EMPLOYEE_TABLE_NAME
by itself, as it only performs intra-file analyses, and the dangerous-call rule will incorrectly flag dangerous("Select * FROM " + EMPLOYEE_TABLE_NAME)
.
DeepSemgrep will not return a false positive in this case. DeepSemgrep looks into Constants.java
and picks the constant value of EMPLOYEE_TABLE_NAME
. This constant value is passed to the Semgrep engine, knowing that "Select * FROM " + EMPLOYEE_TABLE_NAME
is a constant string.
Figure 1: Constant propagation in DeepSemgrep
Typed metavariables
Following the disclosure of the Apache Log4Shell vulnerability, the Semgrep community quickly came up with a rule for it, see below. This rule uses typed metavariables to find objects of the Logger
class, and flags any dangerous-looking call to any of its methods.
1
2rules:
3- id: log4j2_tainted_argument
4 patterns:
5 - pattern-either:
6 - pattern: (Logger $LOGGER).$METHOD($ARG);
7 - pattern: (Logger $LOGGER).$METHOD($ARG,...);
8 - pattern-inside: |
9 import org.apache.log4j.$PKG;
10 ...
11 - pattern-not: (Logger $LOGGER).$METHOD("...");
12 message: log4j $LOGGER.$METHOD tainted argument
13 languages: [java]
14 severity: WARNING
Unfortunately, if a project defines a wrapper logger class MyLogger
that extends org.apache.log4j.Logger
as exemplified below, Semgrep will not report this! Semgrep is unaware of the inheritance relationship between classes, even if the information is contained within a single file.
Using the DeepSemgrep extension, however, Semgrep will flag logger.error(user_input)
, because it builds a class inheritance tree and it is aware that MyLogger
extends org.apache.log4j.Logger
.
Figure 2: Typed metavariables in DeepSemgrep
Taint tracking
Last but not least, DeepSemgrep extends taint mode to perform inter-file and inter-procedural taint tracking.
Using a taint rule like the one below, we want to find data flowing from get_user_input()
into vulnerable_function()
.
1
2rules:
3- id: unsafe-data-processing
4 mode: taint
5 pattern-sources:
6 - pattern: get_user_input(...)
7 pattern-sinks:
8 - pattern: vulnerable_function(...)
9 message: User input reaches vulnerable function
10 languages: [java]
11 severity: WARNING
Without DeepSemgrep, this rule only looks at each function or class method in isolation, so it is fairly limited in what it can find. With DeepSemgrep, get_user_input()
and vulnerable_function()
may be called in different packages and classes, but if there is a flow of data from the former to the latter, DeepSemgrep will find it!
Figure 3: Taint tracking in DeepSemgrep
Semgrep vs. DeepSemgrep
While Semgrep is a popular tool and very powerful in its own right, its intra-file (within-a-single-file) nature makes its use limited on codebases with multi-file coding paradigms. For example, in most object-oriented programming styles, classes are expected to be in different files, including ones that inherit from each other, making it difficult to write intra-file rules that cover all the cases. Though you can work around this limitation, many of our users and customers have asked us to expand the engine to handle it natively.
Although DeepSemgrep is proprietary, it uses the exact same rules and pattern syntax as the open-source Semgrep (and vice versa).
Feature summary
Semgrep | DeepSemgrep | |
---|---|---|
All existing Semgrep features (join mode, within-file taint mode, etc.) | yes | yes |
Analyze across multiple files | no | yes |
→ Interfile constant propagation | no | yes |
→ Interfile type inference | no | yes |
→ Interfile taint tracking | no | yes |
License | LGPL 2.1 | proprietary |
Rule syntax & schema | no difference | no difference |
Languages supported | 24+ languages | Java and Ruby |
Conclusion
Our goal with DeepSemgrep is to create an engine that enables simple rule-writing as with Semgrep and understands your entire program instead of a single file. We're thrilled to partner with early users and push Semgrep's analysis capabilities further. Check out DeepSemgrep documentation for more examples. If you'd like to join the private beta, request access here.