Remove duplicate findings
Semgrep scans are performed on both mainline (trunk) and non-mainline branches. The scope of the scan can differ depending on if Semgrep is called on a mainline or non-mainline branch.
- Full scan
- Scans the repository in its entirety. It is recommended to perform full scans on mainline branches, such as
master
ormain
. Full scans are typically performed on a scheduled basis or on merge to a default branch. - Diff-aware scan
- Diff-aware scans are performed on non-mainline branches, such as in pull requests and merge requests. Diff-aware scans traverse the repository's files based on the commit where the branch diverged from the mainline branch.
How Semgrep distinguishes between new and duplicate findings
Semgrep generates a finding whenever it scans a repository and one of its rules matches a piece of code. Since Semgrep usually scans a repository multiple times, it needs a way to track the same finding in a file over time. Semgrep does this using two types of fingerprints: match_based_id
and syntactic_id
.
The calculations used to determine whether findings are new are subject to change at any time as Semgrep improves its deduplication logic.
match_based_id
Using the match_based_id
, Semgrep can determine if a given finding in a file is the same as a finding identified during a different scan, even if the code snippet that the rule matched had been moved to a different location in the file. This allows Semgrep to avoid generating a new finding and to deduplicate its records accordingly, even across multiple branches associated with the project. It also means that Semgrep can cross-correlate findings, so a finding that has been triaged in one branch will be flagged as triaged if it's identified in another branch.
Semgrep generates the match_based_id
for a finding using the following information:
- The file path
- The name of the rule that generated the finding
- The rule pattern with the metavariables' values substituted in
This information is combined and then hashed. At this point, Semgrep appends the index, a value generated by determining the number of times the rule involved matched code in the file. Note that the index is appended to the hash, not combined with the other finding information before hashing. This is done to preserve information on how findings are related. For example, finding0
with match_based_id = 123_0
and finding1
with match_based_id = 123_1
indicate that both were generated from the same rule matching the same code pattern in the same file.
Figure. Semgrep AppSec Platform groups together the same finding identified as present on multiple branches.
Semgrep uses the rule pattern with the metavariables' values, which originate from the code itself, substituted in to generate match_based_id. Therefore, code changes that result in the same rule pattern when abstracted to the level of the rule pattern match, with the appropriate values substituted, in don't hinder Semgrep's ability to recognize that the finding isn't a duplicate of an existing finding.
For example, if the original file scanned is:
a = 1
b = 2
spcd.get("foo")
c = 3
d = 4
sink("foo")
The rule pattern identified and used in generating the match_based_id
is:
spcd.get($X)
...
sink($X)
Which, with metavariables substituted in, becomes:
spcd.get("foo")
...
sink("foo")
If the following change is made to the original file:
a = 1
b = 2
spcd.get("foo")
c = 3
c_1 = 5
d = 4
sink("foo")
The rule pattern identified and used in generating the match_based_id
doesn't change:
spcd.get("foo")
...
sink("foo")
This means that the match_based_id
itself doesn't change, allowing Semgrep to identify that the two findings are the same and to deduplicate them. Furthermore, this process enables Semgrep to ignore lines that do not impact code function.
syntactic_id
Semgrep generates the syntactic_id
for a finding using the following information:
- The file path
- The name of the rule that generated the finding
- The code syntax, or the literal piece of code that matched the rule
- The index, a value generated by determining the number of times the rule involved matched code in the file
This information is combined and then hashed for privacy before being stored.
The syntactic_id
is primarily used by Semgrep for internal debugging purposes, since no code is stored except in cases where you have provided code access permissions to Semgrep.
Update findings by rescanning the project
Semgrep's correlation of findings across branches based on their unique fingerprint allows for automatic consolidation of findings and makes it simpler to triage findings.
If a finding is fixed in one branch (such as main
), possibly because there hasn't been a follow-up scan on the branch, but open in another (such as production
), and the code fixes are present in both branches, initiate scans through your CI job or SCM tool on the branches with open findings. Semgrep will reconcile the findings and mark them as fixed.
Remove duplicate findings using Semgrep API
Semgrep API does not automatically group findings with the same match-based ID across branches. If you use Semgrep API to receive or pull findings data, set the dedup
flag to true
to deduplicate findings across refs or branches. Refer to List all findings in the Semgrep API docs for more information.
Not finding what you need in this doc? Ask questions in our Community Slack group, or see Support for other ways to get help.