Improving ReDoS detection and finding more bugs using Dlint and r2c

Improving regular expression denial-of-service detection

Matt Schwager
April 13th, 2020
Share

get this check now in Bento: $ pip3 install bento-cli && bento init

Before getting started, if you haven’t seen the first post in this series I highly recommend checking it out. It will give you a good understanding of the ReDoS bug class and we will build on its concepts here. See Bento check: Catch catastrophic backtracking ReDoS bugs.

This post will highlight how we can not only use the r2c distributed analysis platform to analyze code and triage findings, but also to improve our tooling. As the creator of Dlint, that seems like the most appropriate tool to take a look at. From there we will break down another important ReDoS finding (CVE-2020-6817) and wrap up with next steps. Note Dlint has recently been added to Bento’s list of supported tools.

Improving Dlint

Dlint is a tool for encouraging best coding practices and helping ensure Python code is secure. Previously we took a look at how we can use the r2c platform to run custom Dlint static analysis jobs. When performing this type of analysis it’s natural to find false positives. That is, findings the tool detected, but it wasn’t supposed to. After triaging a job's findings the platform provides a convenient way to see the results in aggregate:

7/123 true positives isn’t great, but hey, it’s a start. ReDoS detection is difficult for a variety of reasons and it’s tough to anticipate all false positives, so we might as well run the tool and look for deficiencies. This is where the scale of the r2c platform really shines. Running a tool against thousands of repositories will quickly tell you if your rules are working as intended. In this situation we decided to tackle one class of false positives: subexpressions that aren't "backtrackable."

Backtracking occurs when a regular expression engine encounters a character it can’t match and begins to move backwards. Backtracking is a necessary ingredient for catastrophic backtracking and ReDoS. It’s the reason why one of these two very similar expressions causes ReDoS and the other doesn't:

>>> import re
>>> re.search('(a+)+b', 'a' * 64 + 'c')
...Spins...
>>> import re
>>> re.search('(a+)+', 'a' * 64 + 'c')
<_sre.SRE_Match object; span=(0, 64), match='aaa...'>

When b in the first expression fails to match, the engine begins to backtrack. In this case it’s catastrophic because the engine exponentially alternates in and out of the nested quantifier. The issue here is Dlint was detecting both of these expressions as catastrophic. To remedy this, a change was made in the ReDoS detection algorithm to ensure subexpressions are backtrackable before reporting a finding. This was accomplished by ensuring each catastrophic subexpression has a non-optional subexpression following it. Or, per above, something like b comes after something like (a+)+. For more information on the fix please see the above change.

After a change like this we'd like to confirm the fix and get hard numbers on what’s changed. The r2c platform provides exactly this:

Here we fixed 28 / 123 = 22.8% of our false positives. Not too bad! Subsequent platform jobs automatically move findings from false positive to true negative if it detects the tooling has been fixed!

There’s still plenty of work to be done to reduce false positives. If you’d like to help out, take a look at this issue.

CVE-2020-6817

The r2c platform is continuously evaluating large input sets (groups of code repositories). This allows us to constantly be on the hunt for new bugs and security vulnerabilities. Embracing this mentality is crucial in modern DevSecOps practices. With that, we can always be improving our tooling, running analysis jobs, and finding new issues.

After running a Dlint job against a new input set there was one finding that piqued my interest:

To confirm the finding we can pull down the repository and run Dlint manually:

$ python -m flake8 --select=DUO138 bleach/
bleach/sanitizer.py:595:20: DUO138 catastrophic "re" usage - denial-of-service possible
bleach/sanitizer.py:604:16: DUO138 catastrophic "re" usage - denial-of-service possible

Manual inspection reveals that the finding on line 604 is a false positive and the finding on line 595 is a true positive. Next we have to confirm that the vulnerability is exploitable. First, let's understand what the bleach library actually does.

Bleach is a fantastic library developed by Mozilla that performs HTML sanitization. Or, as they put it:

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes.

The confirmed finding is in the bleach.BleachSanitizerFilter.sanitize_css function. This means we'll likely be using CSS to reach the code path. The main interface to this code path is the bleach.clean function. HTML elements with style attributes eventually make their way to the sanitize_css function. Let's take a closer look at the ReDoS bug:

def sanitize_css(self, style):

    ...

    gauntlet = re.compile(
        r"""^([-/:,#%.'"\s!\w]|\w-\w|'[\s\w]+'\s*|"[\s\w]+"|\([\d,%\.\s]+\))*$""",
        flags=re.U
    )

    for part in parts:
        if not gauntlet.match(part):
            return ''

Let’s break down the regular expression further. We’ll be considering alternation statements with overlapping character space. To do so let’s look at the alternation statements individually:

        ..1..      ..2..     ..3..      ..4..        ..5..
^([-/:,#%.'"\s!\w]|\w-\w|'[\s\w]+'\s*|"[\s\w]+"|\([\d,%\.\s]+\))*$

The issue here is that alternation statements 2, 3, and 4 have complete character overlap with statement 1. Since the alternation subexpression is also nested inside a quantifier we can cause catastrophic backtracking. Let’s consider statements 1 and 2: [-/:,#%.'"\s!\w] and \w-\w. It's clear that statement 2 completely overlaps with statement 1. \w and - are in both statements. This means our ReDoS payload can be something like:

'b-b' * 64

Remember, \w means word characters. But first we must also cause the engine to backtrack. Remember the Dlint improvements above? We can accomplish this via the $ at the very end of the expression. $ matches the end of the string, so we need something at the end of our payload that doesn't match the end of the string. This causes the engine to backtrack. This is a little tricky because that character also must not match anything in any of the alternation statements. Matching an alternation statement means the character will be consumed and there will be no backtracking. Looking through all the alternation statements reveals some symbols like $, ^, and * are not included in any statement. So our full payload is:

'b-b' * 64 + '^'

Remember the vulnerability is in the sanitize_css function. We must pass our payload into the CSS style sanitization functionality. Looking through the code reveals that the following will work:

>>> payload = "'" + 'b-b' * 64 + '^' + "'"
>>> bleach.clean('<a style=' + payload + '></a>', attributes={'a': ['style']})
...Spins...

Note the attributes={'a': ['style']}. style attributes are not parsed by default, so we must enable this functionality first. This is a mitigating factor that means the functionality is not vulnerable by default. Also note that a is not special here. Any HTML element included in bleach.sanitizer.ALLOWED_TAGS or passed in via the tags keyword argument that also enables the style attribute will be vulnerable.

So what does this actually mean? This means we can cause a DoS on calls to the bleach.clean function if style attributes are enabled for the HTML element. Since bleach is often used to sanitize untrusted HTML input this is not good. Fortunately the Mozilla security team was exemplary to work with. They were always quick to respond and had no problem understanding and fixing the bug. For more information see CVE-2020-6817 and the mozilla/bleach security advisory.

Next Steps

Since the last post, Dlint and r2c’s distributed analysis platform also helped find the following ReDoS bugs:

References

About

Semgrep lets security teams partner with developers and shift left organically, without introducing friction. Semgrep gives security teams confidence that they are only surfacing true, actionable issues to developers, and makes it easy for developers to fix these issues in their existing environments.