Semgrep Assistant metrics and methodology

Metrics for evaluating Semgrep Assistant's performance are derived from two sources:

User feedback on Assistant recommendations within the product
Internal triage and benchmarking conducted by Semgrep's security research team

This methodology ensures that Assistant is evaluated from both user and expert perspectives. This gives Semgrep's product and engineering teams a holistic view into Assistant's real-world performance.¹

User feedback

User feedback shows the aggregated and anonymized performance of Assistant across more than 1000 customers, providing a comprehensive real-world dataset.

Users are prompted in-line to "thumbs up" or "thumbs down" Assistant suggestions as they receive Assistant suggestions in their PR or MR. This ensures that sampling bias is reduced, as both developers and AppSec engineers can provide feedback.

Results as of Aug 21, 2025:

Customers in dataset	3500+
Findings analyzed	6,500,000+
Average reduction in findings²	60%
Human-agree rate	96%
Median time to resolution	22% faster than baseline
Average time saved per finding	30 minutes

Internal benchmarks

Internal benchmarks for Assistant use a process in which a rotating team of security engineers conduct periodic reviews of findings and their Assistant generated triage recommendations or remediation guidance. This is the same process used to evaluate Semgrep's SAST engine and rule performance.

Internal benchmarks for Assistant run on the same dataset used by Semgrep's security research team to analyze Semgrep rule performance. This means the dataset is not prone to cherry-picked findings that are easier for AI to analyze, and accurately represents real-world performance across a variety of contexts.

Findings analyzed	2000+
False positive confidence rate³	96%
Remediation guidance confidence rate⁴	80%

Learn more about how Semgrep achieved these numbers in How we built an AppSec AI that security researchers agree with 96% of the time. ↩
The average % of SAST findings that Assistant filters out as noise. ↩
False positive confidence rate measures how often Assistant is correct when it identifies a false positive. A high confidence rate means users can trust when Assistant identifies a false positive - it does not mean that Assistant catches all false positives. ↩
Remediation guidance is rated on a binary scale of "helpful" / "not helpful". ↩

Not finding what you need in this doc? Ask questions in our Community Slack group, or see Support for other ways to get help.

User feedback​

Internal benchmarks​

Footnotes​

User feedback

Internal benchmarks

Footnotes