client.chat.completions.create
: the most effective and least reliable four words I've written in my career in software. I've used these four words a lot this past year; today, Semgrep Assistant's various AI features are powered by 14 separate prompt chains. The sheer variance of behaviors in production gave us quite a lot to figure out — here's a summary of what we, the AI team at Semgrep, have learned, so you can also tell if your LLM product or feature actually works.
Laboratory, Feedback, and Behavior
The three categories we have for quality metrics are Behavior, Feedback, and Laboratory.
Behavior metrics are ideally the reason you tried using AI to begin with. If you're not just adding AI features to ride the hype for marketing purposes, you should have a clear hypothesis about how your feature will improve users' lives, such as code fix suggestions decreasing the mean time to resolution on security vulnerabilities. However, it can be hard to isolate the impact of an AI enhancement within a larger user workflow. If AI's part is small, then the expected impact will be small, and you'll therefore need much larger sample sizes to see if there was a significant improvement. In these cases, it is acceptable to not target Behavior metrics.
Feedback metrics are explicit quantifiable feedback on AI outputs reported by users. We recommend a pair of thumbs up / thumbs down buttons. While numeric scoring (1-5 stars, perhaps) promises higher resolution, you'll likely need larger sample sizes than you have access to because users interpret the meaning of the range differently, and that variance would only balance out over hundreds of records. If you're thinking you have enough users for hundreds of records, do consider that ideally you'll segment this collected feedback by model and prompts, both of which can change any month and require starting a new dataset, so that you can see if users actually consider your new prompts better.
A common pitfall with Feedback metrics is that users will always be biased towards good or bad ratings. Your user interface can cause bias; we used to record feedback in pull request comments with slash-commands, which is fairly high friction. We hypothesize that people excited about a correct AI answer would then happily type out the command, but anyone unimpressed by a wrong answer would not bother to type out feedback. Bias can also come from your user base. AI early adopters might be more forgiving when AI is wrong, while users whose managers made the decision to use Semgrep might be quite critical. So, what can you do? Start by interviewing and observing users to see which way your dataset's bias leans. More importantly, run interface experiments until you get a more balanced mixture of feedback; negative ratings are what make improving your product possible! It's also easier to recognize shifts in quality due to a change when feedback is more balanced.
Now, on to Laboratory metrics, which is the big one. Laboratory metrics include evals, which are basically test suites for AI outputs — test suites that are possibly flaky-by-design. It also covers your team reviewing and scoring outputs. We consider these both Laboratory metrics, because they are reproducible by your team without waiting on anything external to happen. This is what makes Laboratory metrics the most rewarding for an engineering team; the feedback loop is orders of magnitude shorter than observing what happens in production. Somehow this cruel universe made it so the shorter the feedback loop, the more difficult it gets to define and build the infrastructure for a quality metric, which is why the rest of this article goes in depth on Laboratory metrics specifically.
Setting up your lab
Now, we know we need to write some code that queries an LLM the way your application would and score the responses. For some features this scoring is exactly what you'd put in a normal test suite: assert that the output matches or contains a given string, especially if you have the LLM output structured data as JSON. This is applicable, for example, to Semgrep’s component tagging feature, which classifies vulnerable source code based on the code’s responsibility, such as user authentication. In these cases there is one known correct answer that we want to see. But other features are not so easy. When returning English language instructions on how to fix a finding, we cannot expect the model to always use the same phrasing, and there might be multiple correct answers to begin with. In more complex cases, we'll ask an LLM to grade responses, which is the final bit of complication in this whole process. Here's a map of where we are, and how much unreliability each branch introduces to the system:
The further we delve, the fuzzier it all gets
Now we can further specify what our system needs to do:
Query the LLM the same way as production does
Grab the AI answer, and a human-written gold standard answer for the same prompt
Ask another LLM to compare the two answers and rate how much they agree[1]
Optionally, ask yet another LLM to rate more attributes of the AI answer, such as its writing style, against a style guide
Query like production
How do you run your production prompts while testing? There's a tradeoff to be made here, akin to unit tests versus end-to-end tests. Either you set up your entire application and simulate actual requests, observing what results the whole application outputs, which we'd consider end-to-end testing. Or you isolate just the part that sends prompts and receives AI replies, which we'd consider more on the unit testing side — although the "unit" here includes a third party, non-deterministic black box.
We tried the end-to-end approach first, but found that it's too heavyweight for our purposes. It requires a lot of supporting infrastructure to run each test case, has slow startup times, and requires us to maintain test executor scaffolding, even though better open source tooling already exists. Tying everything to the app also meant that to compare quality with different hyperparameters (such as the model and temperature), we needed to make changes to the runtime code of the app. This also had the downstream effect of results being hard to cache. Updating only one test case in our dataset, theoretically there should be no need to re-run all the other cases. But if code changes in the app can alter AI outputs, any change in the repository has to bust cached LLM responses, and waste a lot of time and money on running every prompt from scratch.
There's always a relevant XKCD
What's the alternative? We could replicate just the code running that LLM request at the top of the diagram. That box should (but in an existing system often does not) need only two things: the prompt templates and the template variables. Your runtime code must follow some core rules that are super easy to inadvertently break when first starting out hacking with LLMs.
Rule 1: By the time you start rendering prompts, your template variables must be immutable and serializable. You want the ability to snapshot an execution by dumping all variables to JSON, so that you can resume from that point elsewhere with a different prompt, or different hyperparameters. It's okay for this JSON artifact to contain extraneous data — even if you don't use a value currently, you might want to try using it in an alternative prompt. Dynamically rendering templates is also unacceptable. For instance, we used to have a constant list of acceptable categories that the LLM can respond with for our component tagging feature, and we'd just add a message generated from this list to the prompt. To make this reproduce this outside the application, where the constant list is not importable, we now gave this text a dedicated placeholder in the templates, and each execution snapshots the constant value in its JSON.
Rule 2: When rendering prompts, all control flow must be done by the templating engine. That is to say, you are not allowed to use if conditions or for loops in your application code to decide what text gets rendered. All these decisions must be made by the templating engine, based on the variables snapshot from above. One case where your faith in this rule will be tested, is when you generate messages based on constants in your backend. For instance, when we ask LLMs to tag a file with what they're responsible for, we tell it the list of tags our backend recognizes. The straightforward solution is to render a message from that list of tags when sending off a request, but this would again introduce a dependency on our app code when running evals. So we instead follow Rule 1 and do the slightly surprising thing of saving the rendered list of tags as a template variable with each generation.
Okay. Now what?
Tooling, and why you should use promptfoo
Following the two rules above you should have something that resembles a test suite's parameters. Now you need a test runner: something that knows how to call an LLM, and how to check if its results are correct. We wrote our own scripts at first and realized we did not want to start building UIs to compare responses, and we did not want to write plumbing for every new LLM provider that could plausibly outperform our current one just to be able to test them.
Thankfully, there are plenty of vendors out there selling shovels. We quickly discarded anything with no source code published on GitHub—we bias towards business models such as our very own Semgrep project, but especially in the early years of a product category, it’s way too early to blindly commit to a proprietary system. There were very few projects designed primarily to score evals, and of the ones we’ve seen promptfoo had the healthiest community. In retrospect we made an excellent choice back then: prestigious organizations such as Shopify are standardizing on promptfoo, and the project secured $5M led by A16Z just last month.
The main development loop and benefits of promptfoo look like this for us:
Decide on a set of possibly beneficial changes to the LLM feature, some examples: a) change the prompt’s system message, b) include/exclude a piece of context in the form of a template variable, c) switch to a different model provider, d) try a new fine-tuned model, or e) adjust hyperparameters such as temperature.
Express these variants in a promptfooconfig.yaml file.
Run our wrapper script around promptfoo (more on the wrapper in the next section)
After around a minute, review each variant’s correctness rate, style score, cost, and latency on promptfoo’s self-hosted web UI.
The view of promptfoo after running your first eval
Anatomy of an actual promptfoo run
Promptfoo comes with built-in management of the three inputs it needs: the prompts, the template variables, and the providers. It expects all three to be found via a YAML file. However, we found it hard to keep all these data in sync between a test workspace and our actual app, so things get a bit trickier. Of the three, provider configuration is easiest: we just keep those committed in promptfooconfig.yaml files.
The prompts themselves are a bit harder, but because of the guarantees provided by Rule 2 above, a simple wrapper script we call ev (or Eevee) can grab those for us just in time before a promptfoo test execution. Eevee knows which feature we’re testing, and it knows the directory structure of our application. It can then look up the correct templates directory, which would contain files such as 00-system-identity.j2, 10-user-finding.j2, and 20-system-documentation.j2. The filenames encode the order of the messages, and the roles that should be attached to each.[2] After collecting these templates, Eevee drops them in the format promptfoo expects in a temporary directory. By sheer luck, even though our backend is in Python and promptfoo is a TypeScript project, the templating engines they each use (Jinja2 and Nunjucks respectively) have identical syntax for every templating feature we use.
To gather template variables is hardest, and of course we even have two ways to go about that problem. For features that don’t rely on the LLM seeing realistic account history, we keep around a staging-like application database filled with test data generated from a large selection of open source repositories on GitHub. We then import the application code responsible for translating app database entries into our immutable template variables dictionary[3] (remember Rule 1?) and run that code pointed at the shared database pre-seeded with test cases. The repository contains a file which maps row IDs in this database to the specific strings we expect the LLM to reply with for each specific row. Eevee then saves a new promptfooconfig.yaml alongside the previously-rendered prompts, where it combines our committed provider configuration with all these variables under the tests: key.
We use another method to gather template variables when the feature being tested relies on referencing realistic account activity. For example, when we have an LLM triage findings, we make it reference past triage decisions and comments that humans on the same account have made. It’s unrealistic for us to be able to create a diverse set of ~100 accounts each with their own activity history going years back by hand, and we don’t trust ourselves to be able to generate diverse enough data synthetically, either. Instead, we capture a random sample of immutable template variable dictionaries in production, and keep those in an S3 bucket for up to 6 months. We then look at the prompts those would generate, and save human-written gold standard responses alongside. When Eevee tests these sorts of features, it directly pulls test cases from S3 as test vars, and generates model-graded assertions comparing the LLM output to what we wrote ourselves. Specifically, we use a factuality assertion to ask GPT-4o to compare the AI and human answers to each other, and give a thumbs up or thumbs down on whether they agree with each other. We also add some additional assertions to each test case to score the answer on some criteria for style, to ensure that not only is the answer correct, but it’s also presented appropriately for the context it will appear.
We've been using this system for at least half a year to make decisions about which models to use for specific features, letting us upgrade with confidence the day a new one is released. We've also been able to rapidly test variations of fine-tuned models, changes to prompt text, and including or excluding different pieces of context.
What’s next
That should cover your needs for… the next 2-3 months at the rate AI product development is evolving. Watch this space for more tips on how to evaluate even more complex features, such as if you use the consensus of different models, or if you look at the logprobs as opposed to just the text output of the LLM.
Footnotes
[1]: In some cases, we have even more complex testing requirements, such as scanning generated code with Semgrep to ensure our code scanner does not flag any vulnerabilities in the new code. Such assertions are highly specialized, so this article skips going into detail on these.
[2]: Just to be technically correct, as of publishing we’re using an equivalent but more complex way to mark the order and role of the messages. It involves dynamically importing application code and running its rendering methods with blank variables. But surely you don’t care about the specifics of our technical debt?
[3]: To enforce immutability, these are actually implemented as Pydantic models.