Redefining security coverage for Python with framework-native analysis

We’ve supercharged Semgrep Code’s Python support with new, framework-specific analysis capabilities. The engine now tracks implicit data flows in popular frameworks like Django, FastAPI, and Flask, providing accurate detection of impactful security issues (OWASP Top Ten) for nearly 100 common Python libraries.

Chushi Li
September 5th, 2024
Share

TL;DR: Frameworks like Flask introduce significant challenges for code scanning tools due to obfuscated data and control flows. 

For most SAST products, framework coverage starts and ends with rule support. Semgrep Code now has framework-specific analysis capabilities built into the engine, meaning it can reason about Python source code in the context of specific frameworks. This ensures that implicit flows are captured and analyzed effectively.

As a result, our latest update for Python support includes highly differentiated coverage for Django, Flask, FastAPI, and nearly 100 of the most popular Python libraries, with benchmarks showing an 84% true positive rate (before Semgrep Assistant's AI processing). Benchmark details are at the bottom of this post!


SAST tools struggle with frameworks

When analyzing an application’s source code, static application security testing (SAST) tools must reason about how values flow within the app. 

Control flow tracks how a program moves through functions and statements, while data flow monitors how data travels and transforms within these paths. Tracking these flows is critical as many security vulnerabilities arise from improper handling of user input and its journey through the application. 

It's easy for a SAST tool to follow control flow in raw Python - consider the following example:

1def run_before_request():
2return request.form['username']
3
4def hello_route():
5user = run_before_request()
6db = await aiosqlite.connect("db.sqlite")
7cursor = await db.execute(f"select * from Users where name = '{user}'")

It's clear how data moves from the request.form to the uservariable, and finally into the SQL query. A SAST tool can trace each step of execution, ensuring that user input isn’t reaching sensitive areas like file system access or database queries without proper validation.

In the real world, an application's code is rarely this straightforward - developers use frameworks which introduce an additional layer of abstraction that make control and data flow more difficult for SAST tools to trace.

In a framework like Flask, decorators and global objects come into play, leading to implicit, rather than explicit, control flow. For example:

1app = Flask(__name__)
2@app.before_request
3def run_before_request():
4g.user = request.form['username']
5
6@app.route("/hello")
7async def hello_route():
8db = await aiosqlite.connect("db.sqlite")
9cursor = await db.execute(f"select * from Users where name = '{g.user}'")

Annotations are used to indicate that the hello_route function is a route. The run_before_request function needs to run before every request in this app and there is no explicit call to run_before_request in the hello_route route. No data is returned by the run_before_request function, instead some data is added to a global object g , which can store data for the duration of a single request.

All this means is that the framework itself determines the order and flow of execution - which is not readily apparent from simply looking at function calls in source code.

Without understanding the implicit control flow of a framework, a SAST tool may miss critical security issues. Conversely, if the analysis is less conservative but over-approximates flows, it will generate far too many false positives.

TL;DR: An effective security product must be able to determine if attacker-controlled data is capable of reaching security-sensitive functions. Frameworks make this difficult for SAST tools.

Framework-native analysis with Semgrep

To address these challenges, our team was able to incorporate key framework-specific analysis capabilities into the Semgrep Pro Engine.

Overall, the approach for Python involved:

  1. Framework-Aware Analysis: Semgrep Code understands that certain functions or handlers are executed automatically by the framework in a specific order. The engine goes beyond tracking explicit function calls and reasons about how the specific framework will stitch function calls together at runtime.

  2. Tracking Data in Global Objects: Since frameworks often use global or context-specific objects (like g in the Flask example), Semgrep Code recognizes these. This ensures that data passed implicitly through these objects doesn’t evade security checks.

  3. Framework-Specific Rules: This isn’t anything new! Semgrep Code’s Pro Rules have always been a big part of why our language and framework coverage stands out in the space. For a SAST tool to claim support for a framework, it needs to be able to find security problems in real applications using that framework. Many frameworks provide functions for security-sensitive actions (for example, sending back HTML responses, which could lead to XSS vulnerabilities if the response contains user input). At Semgrep, our goal with coverage is to comprehensively and accurately detect common, OWASP Top Ten issues in source code. Our process for prioritizing framework-specific rules aligns with this goal.

Semgrep doesn’t just understand Python source code/syntax - it understands it in the context of specific frameworks. This level of abstraction lets us build out excellent coverage for the most commonly used Python frameworks and libraries. 

Benchmark results

Our benchmarking process for language support was developed and is used internally by our security research team for the purpose of building out language coverage, monitoring rule performance, iterating and updating coverage, etc.

The process involves scanning open-source repositories and manually triaging the findings. These repositories include a few purposely vulnerable repositories, but are mostly actively maintained, real-world applications.

Python coverage benchmark (as of 2024-09-04)

  • Benchmark true positive rate for latest ruleset (before AI/Assistant processing): 84% over ~1000 findings

  • Repositories scanned: 192

  • Lines of code scanned: ~20 million

It’s worth noting that while SAST benchmarks should be met with a healthy dose of skepticism, the metrics above do come from an internal process that’s designed to give our research team an accurate picture of coverage. While every benchmarking process is imperfect and actual results will certainly vary across projects, there is certainly no “marketing slant” to these numbers.

Conclusion

Python’s flexibility and the power/popularity of frameworks like Flask introduce significant challenges to traditional SAST tooling, especially when it comes to control and data flow analysis.

Semgrep’s framework-specific approach ensures that implicit flows are captured and analyzed effectively for the most popular frameworks.

This differentiated Python coverage, combined with easily customizable rules, lightning fast scans, and developer-first workflows make Semgrep a standout choice for Python shops - whether they have an AppSec team of 1 or 1000.

Try Semgrep for free - you can get up and running in minutes and scan locally, in CI, or in our cloud.

About

Semgrep lets security teams partner with developers and shift left organically, without introducing friction. Semgrep gives security teams confidence that they are only surfacing true, actionable issues to developers, and makes it easy for developers to fix these issues in their existing environments.