Scaling Semgrep rule coverage by spidering language documentation

How we made it easy to follow .NET coding best practices by scraping the MSDN documentation for recommendations and concerns.

Semgrep’s expanded coverage of the .NET standard library

.NET developers, rejoice!

r2c's Security Research team has been hard at work expanding our C# rule coverage and we are happy to announce that C# now has expanded coverage of the .NET standard library for XML External Entities, Cross-Site Request Forgery in ASP.NET, Cross-Site Scripting in Razor, path traversal, Razor template injection, SQL injection, HTTP wildcard bindings, Debug/Trace configurations in production, and more.

Add or update Semgrep coverage to your C# projects using Semgrep App, view the rules on the Semgrep Registry, in the semgrep-rules repo, or scan your C# locally via Semgrep CLI:

1$ semgrep --config=auto path/to/repo

Please let us know areas you'd like to see additional .NET support!

In addition to user feedback, the security research team is always looking for new ways to improve our rule coverage. This brings us to our next topic...

Expanding Semgrep’s C# coverage of MSDN advisories with Go and Colly

Problem statement

Static Application Security Testing (SAST) aims to prevent known hazards, pitfalls, and mistakes. You can’t prevent what you don’t know about, so how do you go about generating a robust list of known antipatterns to prevent?

The OWASP Top 10 list is an excellent source of design patterns to search for, but what about language-specific issues?

Microsoft's .NET documentation is excellent, comprehensive, and consistent. From my prior experience, I know that BinaryFormatter is a class with significant security vulnerabilities. Taking a quick look at Microsoft’s public-facing documentation, we see the following:

BinaryFormatter is insecure and can't be made secure. For more information, see the BinaryFormatter security guide.

These advisories are helpfully in big, yellow boxes:

In order to generate a bunch of interesting test cases for Semgrep, we just need to…read the documentation for all of .NET.

Sounds like a job for automation!

Initial investigation

So, can we get a computer to read Microsoft’s .NET documentation for us and fish out the interesting stuff? Let’s take a closer look at the source of BinaryFormatter’s alert box:

Neat. It looks like Microsoft uses a standard HTML class for these. Let’s spot check a few and see if this is the case. Floating-point computing has some finicky edge cases, so let’s go have a look at the Double class:

Yahtzee! Consistent container names for significant advisories. That makes these relatively easy to scrape - they have a consistent class name which will make grabbing them via, e.g. a CSS selector, pretty straightforward.

Teaching computers to read

Web scrapers/spiders have gotten really good - there are libraries for spidering in most mature languages. I knew I was looking at a large number of pages to spider and I had some prior projects that could be altered to solve my problem so I went with Colly, a scraper/spider library in Go.

The boilerplate of writing a Go CLI application with Colly is left as an exercise to the reader. The interesting, meaty bits of the scraper are entirely contained within the scraper function here:

1func MsdnCrawl(target string) ([]string, error) {
2    results := make([]string, 0)
3    cache := map[string]bool{}
4    docrefRegexp := regexp.MustCompile("view=net-[[:digit:]].[[:digit:]]")
5    //we are collecting domain-specific keywords, restrict to target domain
6    targetUrl, urlErr := url.Parse(target)
7    if urlErr != nil {
8        log.Fatal(urlErr)
9    }
10    c := colly.NewCollector(
11        colly.AllowedDomains(targetUrl.Hostname()),
12        //colly.MaxDepth(8),
13    )
14
15    c.OnHTML("div.WARNING", func(h *colly.HTMLElement) {
16        //found a MSDN warning box, append to list of interesting URLs
17        results = append(results, h.Request.URL.String())
18    })
19    c.OnHTML("a", func(e *colly.HTMLElement) {
20        link := e.Attr("href")
21        //must be API docs
22        _, exists := cache[link]
23        if !exists && docrefRegexp.MatchString(link) {
24            cache[link] = true
25            e.Request.Visit(link)
26        }
27    })
28    err := c.Visit(target)
29    return results, err
30}

Selecting warning boxes

Simply put, the primary objective of this spider is "record any page with a yellow warning box".

With a little bit of trial and error, picking the correct Colly selector was straightforward. From there, all we needed to do was define an onHTML listener:

1    c.OnHTML("div.WARNING", func(h *colly.HTMLElement) {
2        //found a MSDN warning box, append to list of interesting URLs
3        results = append(results, h.Request.URL.String())
4    })

Reading docs at scale

Now that we have a reliable warning box selector, we just need to make sure we get to every other .NET documentation page once (and only once!) We accomplish this by:

  • Adding a Colly listener for anchor tags.

    1c.OnHTML("a", func(e *colly.HTMLElement) {
    2    link := e.Attr("href")
    3    ...
    4})
  • Checking the cache map for the link.

    1_, exists := cache[link]
  • Verifying that the link is, in fact, .NET documentation. On docs.microsoft.com, .NET docs will have a trailing .NET version query parameter.

    1docrefRegexp := regexp.MustCompile("view=net-[[:digit:]].[[:digit:]]")
    2...
    3if !exists && docrefRegexp.MatchString(link) {
    4    cache[link] = true
    5    e.Request.Visit(link)
    6}

From there, we just need to kick off the Colly spider from the root of the .NET API docs and away we go!

Results

The spider turned up approximately 60 documentation pages worth investigating, most of which were correctness-related. The security research team at r2c is currently implementing checks for many of these advisories in Semgrep, such as the limitations of the Double.Epsilon property. Floating-point equality is tricky! From MSDN docs:

Double.Epsilon is sometimes used as an absolute measure of the distance between two Double values when testing for equality. However, Double.Epsilon measures the smallest possible value that can be added to, or subtracted from, a Double whose value is zero. For most positive and negative Double values, the value of Double.Epsilon is too small to be detected. Therefore, except for values that are zero, we do not recommend its use in tests for equality.

Furthermore, on ARM systems (like the Apple M1), Double.Epsilon is zero for all values.

Future work

The results from scraping Microsoft documentation are promising: well-formatted, consistent documentation is an excellent place to programmatically mine for language nuance and advisories.

Future areas of inquiry might include:

  • "Fuzzy" matching against DOM element text via keywords. This will empower spiders to handle structurally inconsistent language or library documentation.

  • "Documentation sentiment" analysis. A binary classifier for "cautionary language" might turn up new paths of inquiry, in addition to more robustly handling documentation without standardized formatting.

About

Semgrep Logo

Semgrep lets security teams partner with developers and shift left organically, without introducing friction. Semgrep gives security teams confidence that they are only surfacing true, actionable issues to developers, and makes it easy for developers to fix these issues in their existing environments.

Find and fix the issues that matter before build time

Semgrep helps organizations shift left without the developer productivity tax.

Get started in minutesBook a demo