Remove duplicate findings

Semgrep scans are performed on both mainline (trunk) and non-mainline branches. The scope of the scan can differ depending on if Semgrep is called on a mainline or non-mainline branch.

Full scan: Scans the repository in its entirety. It is recommended to perform full scans on mainline branches, such as master or main. Full scans are typically performed on a scheduled basis or on merge to a default branch.
Diff-aware scan: Diff-aware scans are performed on non-mainline branches, such as in pull requests and merge requests. Diff-aware scans traverse the repository's files based on the commit where the branch diverged from the mainline branch.

How Semgrep distinguishes between new and duplicate findings

Semgrep generates a finding whenever it scans a repository and one of its rules matches a piece of code. Since Semgrep usually scans a repository multiple times, it needs a way to track the same finding in a file over time. Semgrep does this using two types of fingerprints: match_based_id and syntactic_id.

info

The calculations used to determine whether findings are new are subject to change at any time as Semgrep improves its deduplication logic.

`match_based_id`

Using the match_based_id, Semgrep can determine if a given finding in a file is the same as a finding identified during a different scan, even if the code snippet that the rule matched had been moved to a different location in the file. This allows Semgrep to avoid generating a new finding and to deduplicate its records accordingly, even across multiple branches associated with the project. It also means that Semgrep can cross-correlate findings, so a finding that has been triaged in one branch will be flagged as triaged if it's identified in another branch.

Semgrep generates the match_based_id for a finding using the following information:

The file path
The name of the rule that generated the finding
The rule pattern with the metavariables' values substituted in

This information is combined and then hashed. At this point, Semgrep appends the index, a value generated by determining the number of times the rule involved matched code in the file. Note that the index is appended to the hash, not combined with the other finding information before hashing. This is done to preserve information on how findings are related. For example, finding0 with match_based_id = 123_0 and finding1 with match_based_id = 123_1 indicate that both were generated from the same rule matching the same code pattern in the same file.

Figure. Semgrep AppSec Platform groups together the same finding identified as present on multiple branches.

Semgrep uses the rule pattern with the metavariables' values, which originate from the code itself, substituted in to generate match_based_id. Therefore, code changes that result in the same rule pattern when abstracted to the level of the rule pattern match, with the appropriate values substituted, in don't hinder Semgrep's ability to recognize that the finding isn't a duplicate of an existing finding.

For example, if the original file scanned is:

a = 1
b = 2
spcd.get("foo")
c = 3
d = 4
sink("foo")

The rule pattern identified and used in generating the match_based_id is:

spcd.get($X)
...
sink($X)

Which, with metavariables substituted in, becomes:

spcd.get("foo")
...
sink("foo")

If the following change is made to the original file:

a = 1
b = 2
spcd.get("foo")
c = 3
c_1 = 5
d = 4
sink("foo")

The rule pattern identified and used in generating the match_based_id doesn't change:

spcd.get("foo")
...
sink("foo")

This means that the match_based_id itself doesn't change, allowing Semgrep to identify that the two findings are the same and to deduplicate them. Furthermore, this process enables Semgrep to ignore lines that do not impact code function.

`syntactic_id`

Semgrep generates the syntactic_id for a finding using the following information:

The file path
The name of the rule that generated the finding
The code syntax, or the literal piece of code that matched the rule
The index, a value generated by determining the number of times the rule involved matched code in the file

This information is combined and then hashed for privacy before being stored.

info

The syntactic_id is primarily used by Semgrep for internal debugging purposes, since no code is stored except in cases where you have provided code access permissions to Semgrep.

Update findings by rescanning the project

Semgrep's correlation of findings across branches based on their unique fingerprint allows for automatic consolidation of findings and makes it simpler to triage findings.

If a finding is fixed in one branch (such as main), possibly because there hasn't been a follow-up scan on the branch, but open in another (such as production), and the code fixes are present in both branches, initiate scans through your CI job or SCM tool on the branches with open findings. Semgrep will reconcile the findings and mark them as fixed.