Guardrails for PromQL using Semgrep

Michael Hoffmann is a Site Reliability Engineer at Aiven, working on all things logs and metrics.

Motivation

Prometheus is a popular open-source monitoring system backed by the CNCF with a vibrant community and a rich ecosystem of exporters and libraries to instrument applications in almost any language and comes with built-in support for alerting.

At Aiven, we use Thanos as a distributed Prometheus setup with long-term storage capabilities, which has served us well.

Prometheus comes with its own well-documented and feature-rich query language called PromQL. It’s well adopted, and the internet contains plenty of PromQL examples, guides, and query builders.

But writing good expressions is surprisingly hard in my experience! It usually requires a fair bit of domain knowledge about the metrics and how they are collected. Some basic knowledge of the query engine is also somewhat required - just consider the famous “rate then sum, never sum then rate” rule.

For the sake of the article, let's go over two examples (only looking at the “expr” field of the alerting rules for brevity).

Example 1:

Let's say you collect metrics for multiple teams, disambiguated by a tenant_id label for each team. Now consider the following expression:

expr: |
    sum(rate(http_requests_total{route="/health", status~="5.."}[5m])) 
    /
    sum(rate(http_requests_total{route="/health"}[5m])) 
    > 0.1

By forgetting to specify the tenant_id label matcher, we changed the meaning of the alert in an unexpected way.

Example 2:

Matrix selectors and subqueries return a range vector, so if you want to execute a function that takes a range vector like “sum_over_time” you might well write:

expr: |
    sum_over_time(my_metric{tenant_id=”my-team”}}[5m:]) > 10

By using [5m:] instead of [5m] you used a subquery instead of a matrix selector, two constructs with different semantics which can lead to surprising issues down the road, mainly because of lookback and possibly different collection intervals.

Catching problems like these manually, in review, is too fragile, so ideally, we want a tool to catch them automatically and the tool is configurable enough to address any in-house rules we might have around our metrics.

Enter Semgrep!

Semgrep to the rescue!

Semgrep is a static analysis tool that allows you to write patterns that resemble the language's syntax to be analyzed (and far more, please read the documentation for details). And it recently got PromQL support!

Installing the Semgrep CLI is as simple as

pip install semgrep

Now, having our PromQL examples from above in mind, consider the following rules:

rules:
- id: extract-alerting-rule-expression-to-promql
  mode: extract
  languages: [yaml]
  pattern: |
    expr: $QUERY
  extract: $QUERY
  dest-language: promql

rules:
- id: selectors-should-have-tenant-id
 languages: [promql]
 severity: ERROR
 - patterns:
  - pattern-either:
  - pattern: |
     {...}
 - pattern: |
     {...}[$D]
 - pattern-not: |
     {..., tenant_id="...", ...}
 - pattern-not: |
     {..., tenant_id="...", ...}[$D]
 message: Selector expressions   should contain an equality match on the tenant_id label.

rules:
- id: use-matrix-selector-instead-of-subquery
  languages: [promql]
  severity: ERROR
  patterns:
  - pattern-either:
    - pattern: |
        {...}[$R:]
    - pattern: |
        {...}[$R:$S]
  message: Matrix selectors should be preferred over subqueries.

We have defined 3 rules, one for extracting all PromQL expressions by looking for them in the expr key of a yaml file and subsequently feeding them into rules that target the PromQL language. The remaining rules define patterns that we want to catch.

Let's go over some patterns we used here:

{...} matches any vector selector
{..., tenant_id="...", ...} matches vector selectors with the label tenant_id
{..., tenant_id=”...”, …}}[$D] matches any matrix selector with the label tenant_id
{...}[$R:$S] and {...}[$R:] together match a vector selector in a subquery

We can now use semgrep together with those rules to find our offending expressions:

semgrep scan -c rule.yaml target.yaml

┌─────────────┐

│ Scan Status │

└─────────────┘

  Scanning 1 file tracked by git with 3 Code rules:

  Scanning 1 file.

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00                                                                         
┌─────────────────┐

│ 3 Code Findings │

└─────────────────┘

target.yaml
    selectors-should-have-tenant-id
       Selector expressions should contain an equality match on the tenant_id label.
         3┆ sum(rate(http_requests_total{route="/health", status~="5.."}[5m]))
         ⋮┆----------------------------------------
         5┆ sum(rate(http_requests_total{route="/health"}[5m]))
         ⋮┆----------------------------------------

    use-matrix-selector-instead-of-subquery
       Matrix selectors should be preferred over subqueries.
         8┆ sum_over_time(my_metric{tenant_id="my-team"}[5m:]) > 10

┌──────────────┐

│ Scan Summary │

└──────────────┘
Some files were skipped or only partially analyzed.
  Partially scanned: 1 files only partially analyzed due to parsing or internal Semgrep errors

Ran 3 rules on 1 file: 3 findings.

Running this rule now in CI we can rest assured that this class of problem is solved.

The grammar is already pretty expressive. Here are some things we did not use in these examples of PromQL and Semgrep synergy, but that are possible:

$F(...) will match any function call and capture the function name into $F
(...)[$R:$S] will match any subquery and capture itsrange and step into $R and $S
sum without (..., foo, …) (...) will match all sums that sum atleast over the foo label
all of this can be mixed and matched and combined with semgreps rule syntax
we can compare captured ranges using parse_promql_duration in metavariable comparisons ( if you want to ban too long ranges, for example )

PromQL support in Semgrep is still experimental; feel free to help us out either by reviewing or providing feedback in the Semgrep Community Slack.

I also want to thank the Semgrep community, which has been very welcoming and helpful. Adding PromQL support was a very enjoyable experience all around!

Guardrails for PromQL using Semgrep

Share

Motivation

Semgrep to the rescue!

About

Dive deeper into Security Research or continue reading our featured posts.

From idea to (secure) app: Semgrep + Replit

Take control of sensitive code without developer frustration

Announcing an AI AppSec engineer that users agree with 95% of the time

Find and fix the issues that matter before build time