Michael Hoffmann is a Site Reliability Engineer at Aiven, working on all things logs and metrics.
Motivation
Prometheus is a popular open-source monitoring system backed by the CNCF with a vibrant community and a rich ecosystem of exporters and libraries to instrument applications in almost any language and comes with built-in support for alerting.
At Aiven, we use Thanos as a distributed Prometheus setup with long-term storage capabilities, which has served us well.
Prometheus comes with its own well-documented and feature-rich query language called PromQL. It’s well adopted, and the internet contains plenty of PromQL examples, guides, and query builders.
But writing good expressions is surprisingly hard in my experience! It usually requires a fair bit of domain knowledge about the metrics and how they are collected. Some basic knowledge of the query engine is also somewhat required - just consider the famous “rate then sum, never sum then rate” rule.
For the sake of the article, let's go over two examples (only looking at the “expr” field of the alerting rules for brevity).
Example 1:
Let's say you collect metrics for multiple teams, disambiguated by a tenant_id
label for each team. Now consider the following expression:
expr: |
sum(rate(http_requests_total{route="/health", status~="5.."}[5m]))
/
sum(rate(http_requests_total{route="/health"}[5m]))
> 0.1
By forgetting to specify the tenant_id
label matcher, we changed the meaning of the alert in an unexpected way.
Example 2:
Matrix selectors and subqueries return a range vector, so if you want to execute a function that takes a range vector like “sum_over_time” you might well write:
expr: |
sum_over_time(my_metric{tenant_id=”my-team”}}[5m:]) > 10
By using [5m:]
instead of [5m]
you used a subquery instead of a matrix selector, two constructs with different semantics which can lead to surprising issues down the road, mainly because of lookback and possibly different collection intervals.
Catching problems like these manually, in review, is too fragile, so ideally, we want a tool to catch them automatically and the tool is configurable enough to address any in-house rules we might have around our metrics.
Enter Semgrep!
Semgrep to the rescue!
Semgrep is a static analysis tool that allows you to write patterns that resemble the language's syntax to be analyzed (and far more, please read the documentation for details). And it recently got PromQL support!
Installing the Semgrep CLI is as simple as
pip install semgrep
Now, having our PromQL examples from above in mind, consider the following rules:
rules:
- id: extract-alerting-rule-expression-to-promql
mode: extract
languages: [yaml]
pattern: |
expr: $QUERY
extract: $QUERY
dest-language: promql
rules:
- id: selectors-should-have-tenant-id
languages: [promql]
severity: ERROR
- patterns:
- pattern-either:
- pattern: |
{...}
- pattern: |
{...}[$D]
- pattern-not: |
{..., tenant_id="...", ...}
- pattern-not: |
{..., tenant_id="...", ...}[$D]
message: Selector expressions should contain an equality match on the tenant_id label.
rules:
- id: use-matrix-selector-instead-of-subquery
languages: [promql]
severity: ERROR
patterns:
- pattern-either:
- pattern: |
{...}[$R:]
- pattern: |
{...}[$R:$S]
message: Matrix selectors should be preferred over subqueries.
We have defined 3 rules, one for extracting all PromQL expressions by looking for them in the expr
key of a yaml file and subsequently feeding them into rules that target the PromQL language. The remaining rules define patterns that we want to catch.
Let's go over some patterns we used here:
{...}
matches any vector selector{..., tenant_id="...", ...}
matches vector selectors with the labeltenant_id
{..., tenant_id=”...”, …}}[$D]
matches any matrix selector with the labeltenant_id
{...}[$R:$S]
and{...}[$R:]
together match a vector selector in a subquery
We can now use semgrep together with those rules to find our offending expressions:
semgrep scan -c rule.yaml target.yaml
┌─────────────┐
│ Scan Status │
└─────────────┘
Scanning 1 file tracked by git with 3 Code rules:
Scanning 1 file.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
┌─────────────────┐
│ 3 Code Findings │
└─────────────────┘
target.yaml
selectors-should-have-tenant-id
Selector expressions should contain an equality match on the tenant_id label.
3┆ sum(rate(http_requests_total{route="/health", status~="5.."}[5m]))
⋮┆----------------------------------------
5┆ sum(rate(http_requests_total{route="/health"}[5m]))
⋮┆----------------------------------------
use-matrix-selector-instead-of-subquery
Matrix selectors should be preferred over subqueries.
8┆ sum_over_time(my_metric{tenant_id="my-team"}[5m:]) > 10
┌──────────────┐
│ Scan Summary │
└──────────────┘
Some files were skipped or only partially analyzed.
Partially scanned: 1 files only partially analyzed due to parsing or internal Semgrep errors
Ran 3 rules on 1 file: 3 findings.
Running this rule now in CI we can rest assured that this class of problem is solved.
The grammar is already pretty expressive. Here are some things we did not use in these examples of PromQL and Semgrep synergy, but that are possible:
$F(...)
will match any function call and capture the function name into $F(...)[$R:$S]
will match any subquery and capture itsrange and step into $R and $Ssum without (..., foo, …) (...)
will match all sums that sum atleast over the foo labelall of this can be mixed and matched and combined with semgreps rule syntax
we can compare captured ranges using
parse_promql_duration
in metavariable comparisons ( if you want to ban too long ranges, for example )
PromQL support in Semgrep is still experimental; feel free to help us out either by reviewing or providing feedback in the Semgrep Community Slack.
I also want to thank the Semgrep community, which has been very welcoming and helpful. Adding PromQL support was a very enjoyable experience all around!