Skip to main content

Semgrep Privacy Policy

Semgrep may collect aggregate metrics to help improve the product. This document describes:

Principles

These principles inform our decisions around data collection:

  1. Transparency: Collect and use data in a way that is clearly explained to the user and benefits them
  2. User control: Put users in control of their data at all times
  3. Limited data: Collect what is needed, pseduoanonymize where possible, and delete when no longer necessary

Automatic collection, opt-in, and opt-out

$ semgrep --config=myrule.yaml  # → no metrics (loading rules from local file)
$ semgrep --config=p/python # → metrics enabled (fetching Registry)

Semgrep does not enable metrics when running with only local configuration files or command-line search patterns.

Semgrep does enable metrics if rules are loaded from the Semgrep Registry. This helps maintainers improve the correctness and performance of registry rules.

Metrics may also be configured to be sent on every run, or never sent.

To configure metrics, pass the --metrics option to Semgrep:

  • --metrics auto: (default) metrics are sent whenever rules are pulled from the Semgrep Registry
  • --metrics on: metrics are sent on every Semgrep run
  • --metrics off: metrics are never sent

Instead of the --metrics option, collection can also be controlled by setting the SEMGREP_SEND_METRICS environment variable to any of auto, on, or off.

Note that certain Semgrep integrators turn on metrics for every run. For example, both the Semgrep CI agent and GitLab's Semgrep SAST analyzer use --metrics on by default.

Data NOT collected

We strive to balance our desire to collect data for improving Semgrep with our users' need for privacy and security. After all, we are a security tool! The following never leave your environment and are not sent or shared with anyone.

  • Source code
  • Filenames, file contents, or commit hashes
  • User-identifiable data about Semgrep’s findings in your code, including finding messages
  • Private rules

Data collected

Semgrep collects data to improve the user experience. Four types of data are collected:

Environmental

Environmental data provide contextual data about Semgrep’s runtime environment, as well as information that helps debug any issues users may be facing; e.g.

  • How long the command took to run
  • The version of Semgrep
  • Value of the CI environment variable, if set
  • Pseudoanonymized hash of the scanned project’s name
  • Pseudoanonymized hash of the rule definitions run
  • Pseduoanonymized hash of the config option

Performance

Performance data enable understanding of which rules and types of files are slow in the aggregate so r2c can improve the Semgrep program-analysis engine, query optimizer, and debug slow rules; e.g.

  • Runtime duration
  • Total number of rules
  • Total number of files
  • Project size in bytes

Errors

High-level error and warning classes that Semgrep encounters when run; e.g.

  • Semgrep’s return code
  • The number of errors
  • Compile-time error names, e.g., MaxFileSizeExceeded, SystemOutOfMemory, UnknownFileEncoding

Value

Semgrep reports data that indicate how useful a run is for the end user; e.g.

  • Number of raised findings
  • Number of ignored findings
  • Pseudoanonymized hashes of the rule definitions that yield findings

Pseudoanonymization

Certain identifying data (e.g. project URLs) are pseudoanonymized before being sent to the r2c backend.

"Pseudoanonymized" means the data are transformed using a deterministic cryptographically secure hash. When the input data are unknown, this hash is expensive to reverse. However, when input data are known, a reverse dictionary of identifiers to hashes can be built. Hence, data are anonymous only when the source values are unknown.

We use a deterministic hash to:

  • Track performance and value improvements over succesive runs on projects
  • Remove test data from our metrics

Using a deterministic hash, however, implies:

  • An entity that independently knows the value of an input datum AND who has access to r2c's metrics data could access metrics for that known datum

r2c will:

  • Treat collected metrics data as secret, using application-security best practices, including (but not limited to)
    • Encryption during transit and rest
    • Strict access control to data-storage systems
    • Application-security-policy requirements for third parties (e.g. cloud-service providers; see "data sharing" below)
  • Only correlate hashed data to input data when these inputs are already known to r2c (e.g. publicly available project URLs for open-source projects, or projects that log in to the Semgrep Registry)

Description of fields

CategoryFieldDescriptionUse CaseExample DatumType
Environment
TimestampTime when the event firedUnderstanding tool usage over time2021-05-10T21:05:06+00:00String
VersionSemgrep version being usedReproduce and debug issues with specific versions0.51.0String
Project URLProject URL (sent only if config=auto)Fetch pre-configured rules for the org or project by namegit@github.com:returntocorp/semgrep.gitString
Project hashOne-way hash of the project URLUnderstand performance and accuracy improvementsc65437265631ab2566802d4d90797b27fbe0f608dceeb9451b979d1671c4bc1aString
Rules hashOne-way hash of the rule definitionsUnderstand performance improvementsb03e452f389e5a86e56426c735afef13686b3e396499fc3c42561f36f6281c43String
Config hashOne-way hash of the config argumentUnderstand performance and accuracy improvementsede96c41b57de3e857090fb3c486e69ad8efae3267bac4ac5fbc19dde7161094String
CINotes if Semgrep is running in CI and the name of the providerReproduce and debug issues with specific CI providersGitLabCI v0.13.12String
Performance
DurationHow long the command took to runUnderstanding agregate performance improvements and regressions14.13Number
Total RulesCount of rulesUnderstand how duration is affected by #rules137Number
Total FilesCount of filesUnderstand how duration is affected by #files4378Number
Total BytesSummation of target file sizeUnderstand how duration is related to total size of all target files40838239Number
Rule StatsPerformance statistics (w/ rule hashes) for slowest rulesDebug rule performance[{"ruleHash": "7c43c962dfdbc52882f80021e4d0ef2396e6a950867e81e5f61e68390ee9e166","parseTime": 0,"matchTime": 0.05480456352233887,"runTime": 0.20836973190307617,"bytesScanned": 0}]StatsClass[]
File StatsPerformance statistics for slowest filesDebug rule performance[{"size": 6725,"numTimesScanned": 147,"parseTime": 0.013289928436279297,"matchTime": 0.05480456352233887,"runTime": 0.20836973190307617}]StatsClass[]
Errors
Exit CodeNumeric exit codeDebug commonly occurring issues and aggregate error counts1Number
Number of ErrorsCount of errorsUnderstanding avg #errors2Number
Number of WarningsCount of warningsUnderstanding avg #warnings1Number
ErrorsArray of Error Classes (compile-time-constant)Understand most common errors users encounter["UnknownLanguage", "MaxFileSizeExceeded"] ErrorClass[]
WarningsArray of Warning Classes (compile-time-constant)Understand most common warnings users encounter["TimeoutExceeded"]WarningClass[]
Value
Rule hashes with findingsMap of rule hashes to number of findingsUnderstand which rules are providing value to the user; diagnose high false-positive rates{"7c43c962dfdbc52882f80021e4d0ef2396e6a950867e81e5f61e68390ee9e166": 4}Object
Total FindingsCount of all findingsUnderstand if rules are super noisy for the user7Number
Total NosemsCount of all nosem annotations that tell semgrep to ignore a findingUnderstand if rules are super noisy for the user3Number

Sample metrics

This is a sample blob of the aggregate metrics described above:

{
"environment": {
"version": "0.51.0",
"ci": "true",
"configNamesHash": "ede96c41b57de3e857090fb3c486e69ad8efae3267bac4ac5fbc19dde7161094",
"projectHash": "c65437265631ab2566802d4d90797b27fbe0f608dceeb9451b979d1671c4bc1a",
"rulesHash": "b03e452f389e5a86e56426c735afef13686b3e396499fc3c42561f36f6281c43",
},
"performance": {
"runTime": 37.1234233823,
"numRules": 2,
"numTargets": 573,
"totalBytesScanned": 33938923,
"ruleStats": [{
"ruleHash": "7c43c962dfdbc52882f80021e4d0ef2396e6a950867e81e5f61e68390ee9e166",
"parseTime": 0,
"matchTime": 0.05480456352233887,
"runTime": 0.20836973190307617,
"bytesScanned": 0
}],
"fileStats": [{
"size": 6725,
"numTimesScanned": 147,
"parseTime": 0.013289928436279297,
"matchTime": 0.05480456352233887,
"runTime": 0.20836973190307617
}]
},
"errors": {
"returnCode": 1,
"errors": ["UnknownLanguage"],
"warnings": ["MaxFileSizeExceeded", "TimeoutExceeded"]
},
"value": {
"ruleHashesWithFindings": {"7c43c962dfdbc52882f80021e4d0ef2396e6a950867e81e5f61e68390ee9e166": 4},
"numFindings": 7,
"numIgnored": 3
}
}

Registry fetches

Certain Registry resources require log-in to the Semgrep Registry. Log in may be performed using your project URL, or a Semgrep.dev API token. When using these resources, your project's identity will be recorded by the Semgrep Registry servers.

Data sharing

We use some third party companies and services to help administer and provide Semgrep, for example for hosting, customer support, product usage analytics, and database management. These third parties are permitted to handle data only to perform these tasks in a manner consistent with this document and are obligated not to disclose or use it for any other purpose.

We do not share or sell the information provided to us with other organizations without explicit consent, except as described in this document.