Guardrails for PromQL using Semgrep

In this blog post, our rockstar community member and contributor, Michael Hoffman, shares information on how to make alerting and recording rules more reliable and consistent for PromQL using Semgrep.

Michael Hoffmann is a Site Reliability Engineer at Aiven, working on all things logs and metrics.


Motivation

Prometheus is a popular open-source monitoring system backed by the CNCF with a vibrant community and a rich ecosystem of exporters and libraries to instrument applications in almost any language and comes with built-in support for alerting.

At Aiven, we use Thanos as a distributed Prometheus setup with long-term storage capabilities, which has served us well.

Prometheus comes with its own well-documented and feature-rich query language called PromQL. It’s well adopted, and the internet contains plenty of PromQL examples, guides, and query builders.

But writing good expressions is surprisingly hard in my experience! It usually requires a fair bit of domain knowledge about the metrics and how they are collected. Some basic knowledge of the query engine is also somewhat required - just consider the famous “rate then sum, never sum then rate” rule. 

For the sake of the article, let's go over two examples (only looking at the “expr” field of the alerting rules for brevity).


Example 1:

Let's say you collect metrics for multiple teams, disambiguated by a tenant_id label for each team. Now consider the following expression:


1expr: |
2    sum(rate(http_requests_total{route="/health", status~="5.."}[5m])) 
3    /
4    sum(rate(http_requests_total{route="/health"}[5m])) 
5    > 0.1


By forgetting to specify the tenant_id label matcher, we changed the meaning of the alert in an unexpected way.


Example 2:

Matrix selectors and subqueries return a range vector, so if you want to execute a function that takes a range vector like “sum_over_time” you might well write:

1expr: |
2    sum_over_time(my_metric{tenant_id=”my-team”}}[5m:]) > 10

By using [5m:] instead of [5m] you used a subquery instead of a matrix selector, two constructs with different semantics which can lead to surprising issues down the road, mainly because of lookback and possibly different collection intervals.

Catching problems like these manually, in review, is too fragile, so ideally, we want a tool to catch them automatically and the tool is configurable enough to address any in-house rules we might have around our metrics.

Enter Semgrep!


Semgrep to the rescue!


Semgrep is a static analysis tool that allows you to write patterns that resemble the language's syntax to be analyzed (and far more, please read the documentation for details). And it recently got PromQL support!

Installing the Semgrep CLI is as simple as

1pip install semgrep


Now, having our PromQL examples from above in mind, consider the following rules:

1rules:
2- id: extract-alerting-rule-expression-to-promql
3  mode: extract
4  languages: [yaml]
5  pattern: |
6    expr: $QUERY
7  extract: $QUERY
8  dest-language: promql

1rules:
2- id: selectors-should-have-tenant-id
3  languages: [promql]
4  severity: ERROR
5  - patterns:
6    - pattern-either:
7      - pattern: |
8         {...}
9      - pattern: |
10         {...}[$D]
11    - pattern-not: |
12       {..., tenant_id="...", ...}
13    - pattern-not: |
14       {..., tenant_id="...", ...}[$D]
15 message: Selector expressions should contain an equality match on the tenant_id label

1rules:
2- id: use-matrix-selector-instead-of-subquery
3  languages: [promql]
4  severity: ERROR
5  patterns:
6  - pattern-either:
7    - pattern: |
8        {...}[$R:]
9    - pattern: |
10        {...}[$R:$S]
11  message: Matrix selectors should be preferred over subqueries.


We have defined 3 rules, one for extracting all PromQL expressions by looking for them in the expr key of a yaml file and subsequently feeding them into rules that target the PromQL language. The remaining rules define patterns that we want to catch. 


Let's go over some patterns we used here:

  •  {...} matches any vector selector

  • {..., tenant_id="...", ...} matches vector selectors with the label tenant_id

  • {..., tenant_id=”...”, …}}[$D] matches any matrix selector with the label tenant_id

  • {...}[$R:$S] and {...}[$R:] together match a vector selector in a subquery


We can now use semgrep together with those rules to find our offending expressions:

1semgrep scan -c rule.yaml target.yaml
2
3┌─────────────┐
4
5│ Scan Status │
6
7└─────────────┘
8
9  Scanning 1 file tracked by git with 3 Code rules:
10
11  Scanning 1 file.
12
13  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00                                                                         
14┌─────────────────┐
15
163 Code Findings │
17
18└─────────────────┘
19
20target.yaml
21    selectors-should-have-tenant-id
22       Selector expressions should contain an equality match on the tenant_id label.
23         3sum(rate(http_requests_total{route="/health", status~="5.."}[5m]))
24         ⋮┆----------------------------------------
25         5sum(rate(http_requests_total{route="/health"}[5m]))
26         ⋮┆----------------------------------------
27
28    use-matrix-selector-instead-of-subquery
29       Matrix selectors should be preferred over subqueries.
30         8┆ sum_over_time(my_metric{tenant_id="my-team"}[5m:]) > 10
31
32┌──────────────┐
33
34Scan Summary │
35
36└──────────────┘
37Some files were skipped or only partially analyzed.
38  Partially scanned: 1 files only partially analyzed due to parsing or internal Semgrep errors
39
40Ran 3 rules on 1 file: 3 findings.


Running this rule now in CI we can rest assured that this class of problem is solved.

The grammar is already pretty expressive. Here are some things we did not use in these examples of PromQL and Semgrep synergy, but that are possible:

  • $F(...) will match any function call and capture the function name into $F

  • (...)[$R:$S] will match any subquery and capture itsrange and step into $R and $S

  • sum without (..., foo, ...) (...) will match all sums that sum atleast over the foo label

  • all of this can be mixed and matched and combined with semgreps rule syntax

  • we can compare captured ranges using parse_promql_duration in metavariable comparisons ( if you want to ban too long ranges, for example )

PromQL support in Semgrep is still experimental; feel free to help us out either by reviewing or providing feedback in the Semgrep Community Slack.

I also want to thank the Semgrep community, which has been very welcoming and helpful. Adding PromQL support was a very enjoyable experience all around!

About

Semgrep Logo

Semgrep lets security teams partner with developers and shift left organically, without introducing friction. Semgrep gives security teams confidence that they are only surfacing true, actionable issues to developers, and makes it easy for developers to fix these issues in their existing environments.

Find and fix the issues that matter before build time

Semgrep helps organizations shift left without the developer productivity tax.

Get started in minutesBook a demo