Three months ago, we launched the private beta of Semgrep Assistant, which was our best guess at how AI can actually be helpful in cybersecurity. We're now looking back on the past three months and asking big questions. Is AI useful in the context of cybersecurity? Did AI hype just blind us all? How do you build AI features that are not just gimmicks? Okay, spoiler alert: we think Assistant is pretty damn good, so we're making it available to all Semgrep users (who use github.com) today. You can enable it right now if you want.
A recap: Semgrep is a static code analysis tool that alerts you about security issues and bugs. Our big AI idea was: reviewing security alerts takes human effort, but maybe AI can do some of that thinking for us. Assistant looks at every finding and considers if there's any reason the finding might be wrong and safe to ignore. Otherwise, it recommends an update to the code to fix the bug. Anything Assistant outputs is added as a threaded reply in GitHub pull request comments and Slack notifications. This gives security engineers and developers a head start on addressing findings.
Data from our private beta users
Looking over the past 30 days across all private beta organizations, Semgrep Assistant marked 230 findings as likely false positives. These messages include buttons to 'Agree' or 'Disagree'. Of the 230 cases, we got 35 agree and 2 disagree ratings from our users, which means 95% of ratings are 'Agree'. This looks nice at first glance, but to be fair, the response rates are quite low. People also might be more likely to report good feedback than bad, especially if they are not fully certain.
We went a bit further and tried to find out how well Assistant works based on user behavior. After all, a real true positive finding should be fixed, and a real false positive finding should be ignored, right? If Assistant's verdicts are correct, that's what users will actually do. We looked at the data on our largest private beta customer and, comparing the findings Assistant marked as true vs. false positive, found that they're 1.5x more likely to fix the true positives and 2.2x more likely to ignore the false positives.
These metrics have been improving over time as we got better at using the tool and GPT-4 itself got better, too. Speaking of GPT-4's changes, you might have heard some anecdotal complaints that its answers have worsened over time. Our dataset does not point to any such trends. In fact, the pass rate on the test dataset we use for development slightly increased when OpenAI released June's revision, gpt-4-0613.
Here are some strategies we learned on our quest to improve these metrics.
There's no shame in using big, specific prompts
In software development, you usually try to avoid being overly specific about code, preferring elegant and generic solutions instead. For prompt engineering, we had to unlearn this habit. Increasing the number of specific instructions being at GPT-4's disposal seems to help it immensely. Instead of lofty, generic descriptions on how to triage findings, we've seen a lot of improvement from treating it more like a junior security engineer that you'd try to teach all that you know, piece-by-piece, trusting them to know when to apply which piece of advice.
An example of such advice: injection attacks are a lot less likely in test files since users cannot really supply malicious data to those pieces of code. This is a line we just outright added to the prompt, and GPT-4 now just picks up on cases where it's applicable. It felt a bit dirty at first, with it being a special case, but this addition is responsible for a surprising amount of 'Agree' feedback, and who are we to argue with those numbers? And after all, it's just a line of text in the prompt, which is a lot less complex than special case handling in code normally gets.
But do make prompts focused when you can
While the top priority is increasing the amount of knowledge available to the model, sometimes you can easily filter out irrelevant pieces of advice. We saw an interesting case with the above injection-attack example, where GPT-4 got distracted by a code quality finding coming from a test file, and named it 'not exploitable'. Technically true, but that doesn't mean the code quality finding can just be ignored; the concern was never about security there.
We realized there's an easy fix to this. Semgrep rules contain metadata about the category of finding they're pointing out. Instead of just passing this information to the model, we can outright change the pieces of advice we include based on the rule category. Now, we have unique instructions for reviewing security, performance, and code quality findings.
You can trick GPT-4 into revealing its confidence
This one's for those excited about the autofix capabilities. Writing correct code is a much more complex task than reading a finding and matching the right pieces of information to each other. We thought it would be too ambitious of us to try to get high quality autofixes in all cases. Because of this, our first step was to have something in place to filter out incorrect suggestions before trying to improve their quality. You can now set your minimum expectation for quality for autofixes:
If you ever tried prompting GPT-4 about confidence, you'll know this wouldn't just work straight away and definitely requires some grade-A trickery. The model is built on predicting text token-by-token, always picking the most likely option. It does not, however, have a semantic understanding of what the token probabilities mean for its confidence. Furthermore, the model is constantly rewarded during training for confident (and correct) answers, while presumably, there are not many cases where it would be rewarded for saying a wrong answer with low confidence. All this is what leads to hallucinations, Bing Chat arguing with users, and the fact that if you ask the model to include a confidence rating along with its answer, it pretty much always gives a 70+ percent rating.
So how do we extract confidence from it? The trick is to break down what 'confidence' in a fix means into gathering more focused data points that indirectly contribute to the likelihood of a fix being good. Once GPT-4 outputs a fix recommendation, we take note of that and move on to a brand new prompt. In this separate prompt, we provide the old and the new code and ask a series of questions: after applying this fix, how likely is it that the code fails to compile due to a syntax error? That the code won't work without making corresponding changes in another file? And so on. We take the likelihood of these issues arising, and combine it into a confidence rating for the autofix itself. This way, we can avoid posting suggestions that probably wouldn't work when applied.
With that said, we recommend keeping even low confidence autofixes enabled. These fixes are often still a good starting point and can save valuable engineer time, even if it's only half of the required fix.
Where do we go from here?
A very common request has been getting Assistant to write Semgrep rules. We have built a prototype of this feature which asks the user for three things as input: a rule description, an example of bad code that should be matched, and good code that should not. This tool can often correctly generate rules that are around a 3 out of 10 on a complexity scale. We'd like to aim higher than that, so we'll continue testing this internally. We're also considering smaller scale AI features that might help with rule writing, such as including AI-written explanations alongside error messages, with suggestions on how to fix them. An additional challenge for rule writing applications is GPT-4's training data cutoff date; most content about Semgrep rule writing has been published after September 2021, so we need to include a lot of the fundamentals in every prompt.
Another thing we could focus on is plugging GPT in multiple parts of a system to create a positive feedback loop of increased data quality across the product. For example, we mentioned that a rule's category can help construct a more focused prompt when triaging. Many rules that customers have custom-written lack this kind of metadata. We'd like Semgrep Assistant to automatically classify rules and suggest changes to be made to customers' rules to add this data. Even better, we'd like auto-triage results to feed into pattern-writing, to adapt the rules themselves to filter out common false positives. And when the rule itself is more specific, that will make triaging its findings easier.
Auto-triage itself has a lot of room to grow in scope. Instead of just reacting to notifications, we could use the same understanding of triaging to work through a whole backlog of findings and give security teams the top 5 issues to address instead of just a list of 10,000 findings sorted by rule priority. This part could even help security teams prioritize across multiple backlogs, such as both Semgrep Code and Semgrep Supply Chain.
There are several possible directions; please let us know on our community Slack which one you're most excited about! Until then, we'll be taking a nap on our comfy pile of prompts.
Semgrep is a fast, open-source, code scanning tool for finding bugs, detecting dependency vulnerabilities, and enforcing code standards.