Data-flow analysis engine overview
Semgrep provides an intra-procedural data-flow analysis engine that opens various Semgrep capabilities. Semgrep provides the following data-flow analyses:
- Constant propagation allows Semgrep to, for example, match
return 42
againstreturn x
whenx
can be reduced to42
by constant folding. There is also a specific experimental feature of Constant propagation, called Symbolic propagation. - Taint tracking (known also as taint analysis) enables you to write simple rules that catch complex injection bugs, such as those that can result in cross-site scripting (XSS).
In principle, all data flow related features are available for any of Semgrep's supported languages. Interfile (cross-file) analysis also supports data-flow analysis. For more details, see Perform cross-file analysis documentation.
Ensure that you understand the design trade-offs and limitations of the data-flow engine. For further details, see also the data-flow status.
Semgrep provides no user-friendly way of specifying a new data-flow analysis. Please let us know if you have suggestions. If you can code in OCaml, your contribution is welcome. See Contributing documentation for more details.
Design trade-offs
Semgrep strives for simplicity and delivers a lightweight, and fast static analysis. In addition to being intra-procedural, here are some other trade-offs:
- No path sensitivity: All potential execution paths are considered, despite that some may not be feasible.
- No pointer or shape analysis: Aliasing that happens in non-trivial ways may not be detected, such as through arrays or pointers. Individual elements in arrays or other data structures are not tracked. The dataflow engine supports limited field sensitivity for taint tracking, but not yet for constant propagation.
- No soundness guarantees: Semgrep ignores the effects of
eval
-like functions on the program state. It doesn’t make worst-case sound assumptions, but rather "reasonable" ones.
Expect both false positives and false negatives. You can remove false positives in different ways, for example, using pattern-not and pattern-not-inside. We want to provide you with a way of eliminating false positives, so create an issue if run into any problems. We are happy to trade false negatives for simplicity and fewer false positives, but you are welcome to open a feature request if Semgrep misses some difficult bug you want to catch.
Data-flow status
In principle, the data-flow analysis engine (which provides taint tracking, constant propagation, and symbolic propagation) can run on any language supported by Semgrep. However, the level of support is lower than for the regular Semgrep matching engine.
When Semgrep performs an analysis of the code, it creates an abstract syntax tree (AST) which is then translated into an analysis-friendly intermediate language (IL). Subsequently, Semgrep runs mostly language-agnostic analysis on IL. However, this translation is not fully complete.
There can be features of some languages that Semgrep does not analyze correctly while using data-flow analysis. Consequently, Semgrep does not fail even if it finds an unsupported construct. The analysis continues while the construct is ignored. This can result in Semgrep not matching some code that should be matched (false negatives) or matching a code that should not be matched (false positives).
Please, help us to improve and report any issues you encounter by creating an issue on Semgrep GitHub page.
Not finding what you need in this doc? Ask questions in our Community Slack group, or see Support for other ways to get help.