Finding More Zero Days Through Variant Analysis

Guest blog post from Eugene Lim adapted from his book "From Day Zero to Zero Day: A Hands-On Guide to Vulnerability Research". The book is available now for purchase from No Starch Press and other retailers.

July 10th, 2025
Share

Vulnerability research has increased in difficulty over time, as developers implement system-level mitigations and write more secure code. New vulnerabilities are regularly discovered, but it’s a far cry from the Wild West of the past. Accordingly, you’ll now have to invest significantly more time and expertise to discover impactful vulnerabilities in the most popular software. For example, large open source applications like LibreOffice can easily contain millions of lines of code.

Automated source code analysis tools like Semgrep can cut down the time needed to analyze that code, but you still have to triage all the results and understand the context of each finding. For example, an unsafe memcpy in one file may have been mitigated earlier on by a size check elsewhere. You can tweak the rules to reduce false positives by narrowing down the search criteria, but that risks increasing the number of false negatives as well, causing you to miss actual vulnerabilities.

Fortunately, many other researchers have walked the same path as you. While they may not publish their research, breaking down every single detail about the vulnerabilities they’ve discovered, open source software has two key pieces of evidence you can access: the patched code diff and the public vulnerability advisory, typically published as a Common Vulnerabilities and Exposures (CVE) record. By analyzing these sources of information, you may be able to parlay a previous vulnerability into multiple new ones.

Single-Repository Variant Analysis

Vulnerabilities don’t often exist in isolation. If a developer made a mistake in their code that caused a vulnerability, they likely made that mistake elsewhere in the codebase, too. Additionally, vulnerability researchers may not be interested in enumerating all possible variants of a vulnerability, but rather exploring a particular exploit path and content with finding something. Finally, in their rush to patch one bug, developers may fail to perform deeper root cause analysis of why that vulnerability occurred, and then fail to build secure guardrails to prevent future occurrences. These factors can give rise to a surprisingly rich source of vulnerabilities and facilitate a less resource-intensive approach to vulnerability research.

Avenues to pursue include:

  • Variants: A particular code pattern that caused a vulnerability exists elsewhere in the code, creating more vulnerabilities.

  • Insufficient patches: A patch for a vulnerability does not adequately resolve the root cause, leaving various bypasses available for the vulnerability to still be exploited.

  • Regression: A vulnerability is patched in the code but, due to lack of regression testing or secure guardrails, is revived when future changes in the code weaken or remove the patch.

Thanks to the previously mentioned vulnerability advisory and patch code diff, you know exactly how and why the original vulnerability occurs. With some root cause analysis, you can quickly pivot to scanning the code for similar vulnerable patterns. After that, you can triage the results based on whether they repeat the original vulnerability, rather than starting afresh in your analysis each time. The rules you write can be a lot more specific to patterns that would not make sense in a general ruleset.

Integer Overflow in Expat

You can try out this method with a collection of integer overflow vulnerability variants in Expat, a C library for parsing XML files. Given the ubiquity of XML files, Expat has applications in countless other software, including Firefox and Python (see Software Using Expat).

As such, a vulnerability in Expat has significant downstream impact, especially since you can use the library in ways the original developers may not have expected. If you look at the CVEs for Expat, you’ll notice that it has suffered from multiple integer overflows, including CVE-2022-22822 through CVE2022-22827. If you browse to the individual pages for any of those vulnerabilities, you’ll see a link under the “References” section to the merged commit on GitHub that patched the vulnerability. For the shared patch for CVE-2022-22822 through CVE-202222827 titled “[CVE-2022-22822 to CVE-2022-22827] lib: Prevent more integer overflows”, the pull request comment notes that the patch is related to pull requests 534 and 538. In turn, those pull requests patch earlier integer overflows in CVE-2021-46143 and CVE-2021-45960.

Root Cause Analysis of Vulnerabilities

To practice single-repository variant analysis, try to rediscover the variants CVE-2022-22822 through CVE-2022-22827 by writing a code analysis rule based on CVE-2021-46143. The first step in writing a rule is performing root cause analysis to understand how the vulnerability occurred and determine which patterns to target.

Take a look at the patch for CVE-2021-46143. The pull request is titled “[CVE-2021-46143] lib: Prevent integer overflow on m_groupSize in function doProlog.” The “Files changed” section lists only two updated files. The changelog adds the following lines:


#532 #538  CVE-2021-46143 (ZDI-CAN-16157) --

Fix integer overflow on variable m_groupSize in function doProlog leading to realloc acting as free.

Impact is denial of service or more.


This helpfully informs you that the integer overflow in CVE-2021-46143 leads to “realloc acting as free”. The realloc standard library function takes two arguments, void *ptr and size_t size. As noted on its manual page, the function tries to change the size of the allocated memory that ptr points at to size, but if size is zero, it frees the memory instead.

You can glean further information in the diff for the other updated file, expat/lib/xmlparse.c:

@@ -5019,6 +5046,11 @@ doProlog
       if (parser->m_prologState.level >= parser->m_groupSize) {
           if (parser->m_groupSize) {
               {
+                /* Detect and prevent integer overflow */
+            ❶ if (parser->m_groupSize > (unsigned int)(-1) / 2u) {
+                    return XML_ERROR_NO_MEMORY;
+                }
+
               char *const new_connector = (char *)REALLOC(
                   parser, parser->m_groupConnector, parser->m_groupSize *= 2);
               if (new_connector == NULL) {
@@ -5029,6 +5061,16 @@ doProlog
           }
 
           if (dtd->scaffIndex) {
+              /* Detect and prevent integer overflow.
+               * The preprocessor guard addresses the "always false" warning
+               * from -Wtype-limits on platforms where
+               * sizeof(unsigned int) < sizeof(size_t), e.g. on x86_64. */
+#if UINT_MAX >= SIZE_MAX
+      ❷ if (parser->m_groupSize > (size_t)(-1) / sizeof(int)) {
+              return XML_ERROR_NO_MEMORY;
+          }
+#endif
+
           int *const new_scaff_index = (int *)REALLOC(
               parser, dtd->scaffIndex, parser->m_groupSize * sizeof(int));
           if (new_scaff_index == NULL)

This tells you exactly where the patch occurs and, more importantly, what it patches. In this case, it adds two comparison checks on parser-> m_groupSize to ensure that it’s no larger than (unsigned int)(-1) / 2u ❶ or (size_t)(-1) / sizeof(int)❷, the values that you multiply parser->m_groupSize by before passing it as the size argument to the REALLOC macro.

Take a moment to analyze the REALLOC macro. In the C programming language, macros are named fragments of code. To find the definition of the REALLOC macro, search for #define REALLOC in the code:

#define REALLOC(parser, p, s) (parser->m_mem.realloc_fcn((p), (s)))

When compiling the code, the C preprocessor expands all occurrences of REALLOC and their arguments to (parser->m_mem.realloc_fcn((p), (s))). However, this doesn’t confirm whether m_mem.realloc_fcn is equivalent to the realloc standard library function. If you search for realloc_fcn in the code, you’ll find the following:

parserCreate(const XML_Char *encodingName,
             const XML_Memory_Handling_Suite *memsuite, const XML_Char *nameSep,
             DTD *dtd) {
    XML_Parser parser;

    if (memsuite) { ❶
        XML_Memory_Handling_Suite *mtemp;
        parser = (XML_Parser)memsuite->malloc_fcn(sizeof(struct XML_ParserStruct));
        if (parser != NULL) {
            mtemp = (XML_Memory_Handling_Suite *)&(parser->m_mem);
            mtemp->malloc_fcn = memsuite->malloc_fcn;
            mtemp->realloc_fcn = memsuite->realloc_fcn;
            mtemp->free_fcn = memsuite->free_fcn;
        }
    } else {
        XML_Memory_Handling_Suite *mtemp;
        parser = (XML_Parser)malloc(sizeof(struct XML_ParserStruct));
        if (parser != NULL) {
            mtemp = (XML_Memory_Handling_Suite *)&(parser->m_mem);
            mtemp->malloc_fcn = malloc;
            mtemp->realloc_fcn = realloc; ❷
            mtemp->free_fcn = free;
        }
    }

Unless you pass an alternative memory handling suite to parserCreate❶, realloc_fcn is assigned as realloc❷. This may seem like a long detour to confirm your suspicions, but it’s important to be thorough. After all, the REALLOC macro could be a safe wrapper around the realloc function, a common practice by many developers.

Returning to the patch for CVE-2021-46143, you may wonder how the comparison checks prevent an integer overflow, or what an integer overflow means in this context. As a quick experiment, compile and run the following C code:

#include <stdio.h>

int main() {
    printf("SIZE_MAX: %zu\n", ((size_t)(-1)));
    printf("no overflow: %zu\n", ((size_t)(-1) / sizeof(int)) * sizeof(int));
    printf("overflow: %zu\n", ((size_t)(-1) / sizeof(int) + 1) * sizeof(int));
    return 0;
}

You should get the following output:

SIZE_MAX: 18446744073709551615
no overflow: 18446744073709551612
overflow: 0

There’s a maximum number that unsigned integer types can represent, which in binary is 11111..., up to the number of bits for that type. Since unsigned integers can’t be negative, casting -1 to an unsigned integer type performs a two’s complement operation that ends up with the same binary representation as the maximum for that type.

In binary arithmetic, multiplying by two is represented by “shifting left” by 1 bit, and dividing by two (rounding down) is the converse. For example, multiplying 7 (111 in binary) by two results in 1110, which corresponds to 14. If the operation exceeds the number of bits for the type in question, it truncates the most significant bits. As such, the unsigned integer overflow here occurs when the multiplication ends up with 1000000..., which it truncates to 000000..., representing 0. Integer overflows are a common vulnerability class that can lead to all sorts of undefined behavior if the value is used for other functions; in the case of Expat, it can lead to freeing memory instead of reallocating it.

To complete the root cause analysis, you must understand how to reach this vulnerable code path, or sink. Fortunately, the pull request comment also links to the corresponding issue, titled “[CVE-2021-46143] Crafted XML file can cause integer overflow on m_groupSize in function doProlog”. The issue notes that an anonymous white hat researcher reported the vulnerability via the Zero Day Initiative (ZDI), which facilitates zero-day vulnerability disclosures and provides financial rewards. Additionally, it states that “the issue is an integer overflow (in multiplication) near a call to realloc that takes a 2 GiB size craft XML file, and then will cause denial of service or more.” Finally, the issue comment includes a snippet of the vulnerability disclosure’s analysis section.

This is an integer overflow vulnerability that exists in expat library. The vulnerable function is doProlog:


doProlog(XML_Parser parser, const ENCODING *enc, const char *s, const char *end,
         int tok, const char *next, const char **nextPtr, XML_Bool haveMore,
         XML_Bool allowClosingDoctype, enum XML_Account account) {
         
#ifdef XML_DTD
    static const XML_Char externalSubsetName[] = {ASCII_HASH, '\0'};
#endif /* XML_DTD */
    static const XML_Char atypeCDATA[]
    ___--snip--___
        case XML_ROLE_GROUP_OPEN:
            if (parser->m_prologState.level >= parser->m_groupSize) {
                if (parser->m_groupSize) {
                    {
                        char *const new_connector = (char *)REALLOC(
                            parser, parser->m_groupConnector, parser->m_groupSize *= 
                            2);// (1)
                        if (new_connector == NULL) {
                            parser->m_groupSize /= 2;
                            return XML_ERROR_NO_MEMORY;
                        }
                        parser->m_groupConnector = new_connector;
                    }

This provides you with the final piece of the puzzle: the attack vector, a large crafted XML file. In order for m_groupSize to reach such a large number, it must include enough tokens that match the XML_ROLE_GROUP_OPEN case in the XML file.

It isn’t necessary to re-create the proof of concept during root cause analysis, but doing so can be helpful in improving your understanding of the vulnerability. Try reproducing CVE-2021-46143 by creating an XML file that would trigger it. Hint: Look at the pull request and related issue for CVE-2021-45960, which includes more detail about the proof of concept and includes a link to a script to create it. You can adapt this for CVE-202146143.

Although Expat extensively documents its vulnerability remediation process, more often than not you’ll have only scraps of information from published vulnerability advisories. Depending on the criticality of a bug, developers may choose to obfuscate a vulnerability patch by burying it inside a much larger update or fix it at a higher level in the code. Additionally, vulnerability advisory descriptions may be deliberately unclear to prevent malicious actors from deducing the real vulnerability and exploiting it via n-day attacks on unpatched users. Nevertheless, it’s usually easier to patch diff a known vulnerability and analyze it than to discover a brand-new vulnerability. Root cause analysis of disclosed vulnerabilities is a skill that yields rich rewards for the careful researcher.

Variant Pattern Matching with Semgrep Rules

Now that you understand the root cause of the vulnerability, you can write a pattern to find other variants of it in the code. To recap the key features of CVE-2021-46143:

  1. An integer overflow occurs when multiplying some variable of an unsigned integer type beyond its maximum.

  2. The overflowed integer is passed as the third argument to the REALLOC macro, which leads to an unintended free if the variable overflows to 0.

  3. The variable is attacker-controlled via the XML file, which can take the form parser->m_groupSize.

Typically, for single-repository variant analysis, you can afford to be more specific with your patterns because the developer’s style often repeats throughout the code. Start with an almost-exact match of the original vulnerable code, then slowly generalize the rule until you begin finding variants. This iterative approach allows you to make sure you aren’t overgeneralizing from the start and keeps your scope small. As such, it’s better to begin with pattern matching rather than a full data flow analysis rule. In this case, focus on the sink of the vulnerability rather than the source-to-sink flow.

For CVE-2021-46143, the sink is the REALLOC macro’s third argument, which the developers patched by adding a comparison check right before the two REALLOC invocations:

char *const new_connector = (char *)REALLOC(
     parser, parser->m_groupConnector, parser->m_groupSize *= 2);
 int *const new_scaff_index = (int *)REALLOC(
     parser, dtd->scaffIndex, parser->m_groupSize * sizeof(int));

When drafting Semgrep rules, it’s helpful to use Semgrep Playground due to its support for Semgrep Pro features and convenient user interface for debugging rules. Begin drafting your rule by placing these two REALLOC invocations in the test code section of the Playground. In the rule section, switch to the “advanced” tab and start with a skeleton rule that matches the first invocation exactly:

rules:
  - id: CVE-2021-46143
    pattern: REALLOC(parser, parser->m_groupConnector, parser->m_groupSize *= 2);
    message: Detected variant of CVE-2021-46143.
    languages: [c]
    severity: ERROR

Click Run and confirm that the rule matches the line where the first REALLOC invocation appears. Next, generalize the rule to match both invocations. You might do this by abstracting away the last two arguments with the ellipsis operator, since those are the only differences between the first and second invocations:

    pattern: REALLOC(parser, ...);

While this works, it greatly increases the number of false positives because it also fails to differentiate safe and vulnerable REALLOC invocations.

Recall that the root cause of this vulnerability is an integer overflow in the third argument passed to REALLOC (and consequently realloc) caused by multiplying it (parser->m_groupSize = 2 and parser->m_groupSize sizeof(int)). As such, you should match this pattern by using metavariables:

patterns:
        - pattern-either:
            - pattern: REALLOC(parser, $POINTER, $SIZE * $CONSTANT);
            - pattern: REALLOC(parser, $POINTER, $SIZE *= $CONSTANT);

Notice the proper usage of the patterns, pattern-either, and pattern operators. You cannot nest the two pattern operators under patterns because patterns performs a logical AND operation, meaning that the code must match both patterns rather than either of them. To perform a logical OR operation instead, use pattern-either.

After completing this basic rule, you can now test it on the vulnerable commit of Expat. Save the rule to a file called cve-2021-46143-variant-1.yml, then check out the commit and run the Semgrep rule on it with the following commands:

$ git clone https://github.com/libexpat/libexpat
$ cd libexpat
$ git checkout 0adcb34c
$ semgrep -f ../cve-2021-46143-variant-1.yml .

If all goes well and depending on version of Semgrep, you should get the following results:


Scanning 18 files.
18/18 tasks 0:00:00

Results                                                                                                         
Findings:

  expat/lib/xmlparse.c 
     CVE-2021-46143
        Detected variant of CVE-2021-46143.

       3271┆ temp = (ATTRIBUTE *)REALLOC(parser, (void *)parser->m_atts,
       3272┆                             parser->m_attsSize * sizeof(ATTRIBUTE));
          ⋮┆----------------------------------------
       3279┆ temp2 = (XML_AttrInfo *)REALLOC(parser, (void *)parser->m_attInfo,
       3280┆                                 parser->m_attsSize * sizeof(XML_AttrInfo));
          ⋮┆----------------------------------------
       5049┆ char *const new_connector = (char *)REALLOC(
       5050┆     parser, parser->m_groupConnector, parser->m_groupSize *= 2);
          ⋮┆----------------------------------------
       5059┆ int *const new_scaff_index = (int *)REALLOC(
       5060┆     parser, dtd->scaffIndex, parser->m_groupSize * sizeof(int));
          ⋮┆----------------------------------------
       6130┆ temp = (DEFAULT_ATTRIBUTE *)REALLOC(parser, type->defaultAtts,
       6131┆                                    (count * sizeof(DEFAULT_ATTRIBUTE)));
          ⋮┆----------------------------------------
       7131┆ temp = (CONTENT_SCAFFOLD *)REALLOC(
       7132┆     parser, dtd->scaffold, dtd->scaffSize * 2 * sizeof(CONTENT_SCAFFOLD));

Scan Summary

Some files were skipped or only partially analyzed.
  Scan was limited to files tracked by git.
  Partially scanned: 1 files only partially analyzed due to a parsing or internal Semgrep error
  Scan skipped: 6 files matching .semgrepignore patterns
  For a full list of skipped files, run semgrep with the --verbose flag.

Ran 1 rule on 18 files: 6 findings.

The rule correctly identifies the original two vulnerabilities as well as four additional potential variants. The variants all use some potentially attacker-controlled value multiplied by the size of a data structure.

Take a closer look at the first variant, which occurs at line 3271 of xmlparse.c:

/* Precondition: all arguments must be non-NULL;
   Purpose:
   - normalize attributes
   - check attributes for well-formedness
   - generate namespace aware attribute names (URI, prefix)
   - build list of attributes for startElementHandler
   - default attributes
   - process namespace declarations (check and report them)
   - generate namespace aware element name (URI, prefix)
*/
static enum XML_Error
storeAtts(XML_Parser parser, const ENCODING *enc, const char *attStr,
          TAG_NAME *tagNamePtr, BINDING **bindingsPtr,
          enum XML_Account account) {
    DTD *const dtd = parser->m_dtd; /* save one level of indirection */
    ELEMENT_TYPE *elementType;
    int nDefaultAtts;
    const XML_Char **appAtts; /* the attribute list for the application */
    int attIndex = 0;
    int prefixLen;
    int i;
    int n;
    XML_Char *uri;
    int nPrefixes = 0;
    BINDING *binding;
    const XML_Char *localPart;

    /* lookup the element type name */
    elementType = (ELEMENT_TYPE *)lookup(parser, &dtd->elementTypes, 
                   tagNamePtr->str, 0);
    if (! elementType) {
        const XML_Char *name = poolCopyString(&dtd->pool, tagNamePtr->str);
        if (! name)
            return XML_ERROR_NO_MEMORY;
        elementType = (ELEMENT_TYPE *)lookup(parser, &dtd->elementTypes,
                       name, sizeof(ELEMENT_TYPE));
        if (! elementType)
            return XML_ERROR_NO_MEMORY;
        if (parser->m_ns && ! setElementTypePrefix(parser, elementType))
            return XML_ERROR_NO_MEMORY;
      }
  ❶ nDefaultAtts = elementType->nDefaultAtts;

      /* get the attributes from the tokenizer */
  ❷ n = XmlGetAttributes(enc, attStr, parser->m_attsSize, parser->m_atts);
      if (n + nDefaultAtts > parser->m_attsSize) {
          int oldAttsSize = parser->m_attsSize;
          ATTRIBUTE *temp;
#ifdef XML_ATTR_INFO
    XML_AttrInfo *temp2;
#endif
❸ parser->m_attsSize = n + nDefaultAtts + INIT_ATTS_SIZE;
    temp = (ATTRIBUTE *)REALLOC(parser, (void *)parser->m_atts,
                                ❹ parser->m_attsSize * sizeof(ATTRIBUTE));

With experience, you’ll build intuition about what a particular snippet does without having to enumerate everything, which will come in handy as you encounter more complex source code. Expat is considered a fairly straightforward codebase, with most of the logic contained in a single file. Even with imperfect information, you can pick up a few clues regarding whether the code is vulnerable. First, the storeAtts function, in which the potential variant occurs, is commented with details about what it does. In short, it appears to handle parsing XML attributes, which would indeed be attacker-controlled if the library was handling untrusted XML documents. More specifically, you’ll be interested in parser->m_attsSize rather than sizeof(ATTRIBUTE), because while both are used in the third argument to REALLOC (the sink) ❹, the former is potentially attacker-controlled, while the latter is a fixed value.

Going back a few lines, note that parser->m_attsSize is set to the sum of several variables ❸. You can ignore INIT_ATTS_SIZE, which is a constant. Meanwhile, nDefaultAtts is set to another value ❶, and you can make the reasonable guess based on the variable names that this value is equal to the number of default attributes for the type of element being parsed. This appears to be less likely to be attacker-controllable, as it relies on fixed defaults, but you can file it away for further investigation. Finally, n is set to the return value of a function ❷ that, according to the comment, gets the attributes from the tokenizer. If you look up XmlGetAttributes, you’ll find that it’s actually a macro defined in expat/lib/xmltok.h:

#define XmlGetAttributes(enc, ptr, attsMax, atts)
    (((enc)->getAtts)(enc, ptr, attsMax, atts))

The macro essentially calls the getAtts member function of the enc struct instance on the same arguments. Searching for getAtts provides the actual implementation of the function in expat/lib/xmltok_impl.c. While you can fully analyze the code yourself, the comment above the function definition is sufficient to tell you what it does:

/* This must only be called for a well-formed start-tag or empty
   element tag. Returns the number of attributes. Pointers to the
   first attsMax attributes are stored in atts.
*/

Fortunately, this suggests that n is indeed an attacker-controllable value, since it is the number of attributes in the XML element that’s being parsed. Although attsMax initially caused some concern because it could potentially limit the number of attributes returned, the comment tells you that it limits only the number of attributes stored in atts. You can confirm this by observing that the function increments the return value nAtts regardless of whether it has exceeded attsMax:

case BT_QUOT:
    if (state != inValue) {
        if (nAtts < attsMax)
            atts[nAtts].valuePtr = ptr + MINBPC(enc);
        state = inValue;
        open = BT_QUOT;
    } else if (open == BT_QUOT) {
        state = other;
        if (nAtts < attsMax)
            atts[nAtts].valueEnd = ptr;
        nAtts++;
    }
    break;

For example, although it checks whether nAtts < attsMax, nAtts++; falls outside the if statement’s body and executes regardless of the result of the if statement. In C, only the statement right after an if statement is executed unless it is contained inside braces.

This confirms that the eventual value passed as the third argument to REALLOC is partially attacker-controllable and is a valid vulnerability. We took the long road here, but as highlighted earlier, you could skip various steps in the sink-to-source analysis by making reasonable guesses based on variable names and developer comments. That’s a judgment call you’ll have to make based on the size of the codebase and the amount of time you can spend on it. Looking at the pull request that fixed CVE-2022-22822 through CVE2022-22827, you’ll see that it added a validation check prior to the REALLOC invocation in storeAtts .

+    /* Detect and prevent integer overflow */
+    if ((nDefaultAtts > INT_MAX - INIT_ATTS_SIZE)
+        || (n > INT_MAX - (nDefaultAtts + INIT_ATTS_SIZE))) {
+        return XML_ERROR_NO_MEMORY;
+    }
+
     parser->m_attsSize = n + nDefaultAtts + INIT_ATTS_SIZE;
+
+    /* Detect and prevent integer overflow.
+     * The preprocessor guard addresses the "always false" warning
+     * from -Wtype-limits on platforms where
+     * sizeof(unsigned int) < sizeof(size_t), e.g. on x86_64. */
+#if UINT_MAX >= SIZE_MAX
+    if ((unsigned)parser->m_attsSize > (size_t)(-1) / sizeof(ATTRIBUTE)) {
+        parser->m_attsSize = oldAttsSize;
+        return XML_ERROR_NO_MEMORY;
+    }
+#endif

The description for CVE-2022-22827 states that “storeAtts in xmlparse.c in Expat (aka libexpat) before 2.4.3 has an integer overflow” This confirms that your rule was able to detect a real variant of CVE-2021-46143. Based on the other results, the rule also correctly identifies integer overflows in defineAttribute (CVE-2022-22824) and nextScaffoldPart (CVE-2022-22826), but it fails to identify the ones in addBinding (CVE-2022-22822), build_model (CVE-2022-22823), and lookup (CVE-2022-22825). The latter two are due to the fact that the overflowed integer is passed to a malloc call instead of realloc. For addBinding, the offending code is:

XML_Char *temp = (XML_Char *)REALLOC(
    parser, b->uri, sizeof(XML_Char) * (len + EXPAND_SPARE));

Note that if you copy this code into a separate file, false_negative.c, and scan it with your Semgrep rule, it’ll detect the vulnerability. If you recall the “Partially scanned: 1 files only partially analyzed due to a parsing or internal Semgrep” error message from the Semgrep output earlier, this is because the Semgrep engine does not yet fully support all aspects of the C language syntax. Due to trade-offs in its design (performance, coverage) and nuances of the C programming language it can fail to properly parse parts of the code.

Use Semgrep’s dump-ast feature to understand how Semgrep represents the code internally:

$ ***semgrep --lang c --dump-ast false_negative.c***
❶ Call(
        N(
            Id(("REALLOC", ()),
                {id_info_id=3; id_hidden=false; id_resolved=Ref(
                 None); id_type=Ref(None); id_svalue=Ref(
                 None); })),
        [Arg(
             N(
                 Id(("parser", ()),
                     {id_info_id=4; id_hidden=false; id_resolved=Ref(
                      None); id_type=Ref(None); id_svalue=Ref(
                      None); })));

In the abbreviated output, the AST for the REALLOC invocation starts with that node ❶. Semgrep does not differentiate between macro invocations and function calls, which is why the root of this tree is the Call element.

Despite its limitations, using a code scanning engine like Semgrep allows you to scan for patterns that go beyond what a simple regex can do. For example, consider another scenario in which a variable is first assigned the result of a multiplication or addition operation and then passed to REALLOC as the third argument, rather than the third argument passed to REALLOC being the multiplication or addition operation. This creates the same integer overflow vulnerability but allows for a more generic pattern. To check for this, use the pattern-inside operator as well as metavariables:

rules:
  - id: CVE-2021-46143
    patterns:
  ❶ - pattern-either:
          - pattern-inside: |
             ❷ (int $SIZE) = $VARIABLE * $CONSTANT;
                ...
          - pattern-inside: |
                (int $SIZE) *= $CONSTANT;
                ...
          - pattern-inside: |
                (int $SIZE) = $VARIABLE + $CONSTANT;
                ...
          - pattern-inside: |
                (int $SIZE) += $CONSTANT;
                ...
  ❸ - pattern: REALLOC(parser, $POINTER, $SIZE);
    message: Detected variant of CVE-2021-46143.
    languages: [c]
    severity: ERROR

In the new rule, observe how the various permutations of pattern-inside are nested under pattern-either❶, paying attention to the Boolean operations. The rule uses typed metavariables ❷ to increase its accuracy, since the integer overflow should technically apply only to integer variables. It also uses the same $SIZE metavariable in both the pattern-inside and pattern❸ operators to match them up.

If you run this rule on the repository, you should get two new results:

Scanning 19 files.
19/19 tasks 0:00:00

Results                                                                                                         
Findings:

  expat/lib/xmlparse.c 
     CVE-2021-46143
        Detected variant of CVE-2021-46143.

        1938┆ temp = (char *)REALLOC(parser, parser->m_buffer, bytesToAllocate);
          ⋮┆----------------------------------------
        2573┆ char *temp = (char *)REALLOC(parser, tag->buf, bufSize);

Scan Summary

Some files were skipped or only partially analyzed.
  Scan was limited to files tracked by git.
  Partially scanned: 1 files only partially analyzed due to a parsing or internal Semgrep error
  Scan skipped: 6 files matching .semgrepignore patterns
  For a full list of skipped files, run semgrep with the --verbose flag.

Ran 1 rule on 18 files: 2 findings.

By analyzing these results, you’ll discover that one of them is in fact yet another integer overflow that was discovered later (CVE-2022-25315). Take some time to understand why one finding is a true positive while the other is a false positive. Hint: Are there any validation checks before the REALLOC invocation?

Conclusion

Let’s quickly recap the path you took to discover variants of CVE-2021-46143. First, you performed a root cause analysis of the original vulnerability by checking the diffs of the patch as well as metadata like patch notes. You then wrote an exact match pattern of the vulnerability sink, before iteratively generalizing the rule to catch more variants. You can tweak your rules to be as strict or loose as you want. For example, you can exclude all matches that include a validation check by using the pattern-not-inside operator. However, each design choice creates a trade-off between higher rates of false positives and false negatives.

If you enjoyed this snippet, check out https://fromdayzerotozeroday.com/ for more information about “From Day Zero to Zero Day” by Eugene Lim.

About

Semgrep enables teams to use industry-leading AI-assisted static application security testing (SAST), supply chain dependency scanning (SCA), and secrets detection. The Semgrep AppSec Platform is built for teams that struggle with noise by helping development teams apply secure coding practices.