Today we’re extremely excited to launch GA support for C and C++, exclusive to Semgrep Code! For the past six months, our amazing program analysis and security research teams have been hard at work bringing static analysis for C/C++ into the modern era of application security.
For those who are unfamiliar with Semgrep, here are just a few of the things you can do with Semgrep Code:
Scan C/C++ projects and get accurate, low-noise results in minutes (not hours).
Easily identify and prioritize true positives with the help of Semgrep Assistant, and surface them to developers in their native workflows (via PR comments, Jira tickets, etc).
Cut down your vulnerability backlog, and work towards eliminating classes of vulnerabilities instead of chasing down their individual instances (over-and-over again).
Now that we’ve launched with great feedback from some of our largest customers, we’re excited to share more details into how we approached parsing C/C++ and how we were able to mitigate some of the challenges posed by the preprocessor.
First attempt: limitations with Semgrep Open Source
The demand for a modern tool that can analyze C/C++ codebases quickly and precisely has been clear to us for a while - Semgrep OSS actually had experimental support for C and C++ all the way back in 2021 (in tech startup years this is a really long time).
However, the limitations of our OSS engine and its strict adherence to single-function analysis made true support for C/C++ infeasible due to the complexity of the languages.
As C and C++ source code is very ambiguous and inherently multi-file (analysis often relies on information spread across header and preprocessor files), adding C/C++ support that met our standards for accuracy and scan speed necessitated the use of our Pro engine, and required a ton of additional investment into parsing abilities, analysis techniques, and rule-writing - all of which brought us to where we are today!
The first PR for C/C++ support in Semgrep Open Source
The traditional (slow) way to analyze C/C++
C/C++ are difficult to analyze because they are compiled languages that rely on the C preprocessor (cpp) to run first and expand macros before compilation (think of a macro as a placeholder in source code with instructions on what to replace itself with).
Since C/C++ grammar is very ambiguous (see Jim Roskind’s note on ambiguity in C, or the blog post Full Proof that C++ Grammar is Undecidable), it’s extremely difficult for SAST tools to parse C/C++ source code if any preprocessor directives are included (which is almost always the case).
The traditional method used by most SAST tools to parse C/C++ involves integrating and understanding the language's many build systems (Make, Cmake, Bazel, Buck, Ninja) and the various C preprocessor (cpp) flags used to compile projects (-I for include paths and -D for macros; we'll explain these more in-depth shortly). The multiplicative complexity here should be apparent!
This method requires first compiling the project and analyzing the artifacts produced during compilation (primarily the macro-expanded .c and .cpp source).
There are pros and cons to this approach, but the major disadvantage is speed - compiling C/C++ is infamously known to take ages. Add in the analysis complexity that comes from the various build systems and compilation flags and it's apparent why SAST tools using this method can take hours to scan C/C++ projects.
Another downside to this approach is the added complexity that comes with the various required build-steps of these tools (CodeQL's autobuilder, Sonarqube's build-wrapper, etc):
An error message during a build step - not only do they slow down scans, they can be complicated to configure and troubleshoot
The Semgrep way
Semgrep scans are ludicrously fast because they run on source code, and the same holds true for our C/C++ support. Semgrep bypasses the traditional complexities associated with the preprocessor that slow down and complicate C/C++ analysis, and parses source code prior to macro expansion.
Parsing as-is means that Semgrep doesn’t even require a buildable project, and can be set up and running in CI in minutes. This is aligned with our commitment to rapid analysis and security that can keep up with the speed of development.
The question is, how does Semgrep understand and parse C/C++ source code so effectively without requiring a build or compile step?
Some of our shiny new C/C++ rules - check them out in the registry
The preprocessor predicament
Writing a parser for C/C++ grammar is not trivial because the language is massive and contains many ambiguities. Fortunately, we use the excellent tree-sitter grammars for C and C++. Tree-sitter is a parser generator and incremental parsing library that resolves many of the ambiguities in C/C++ thanks to its Generalized LR (GLR) technology.
Creating a parser specifically for the C preprocessor (cpp) is not complicated as it’s a relatively simple macro processing language (cpp directives are only weakly related to the actual grammar of C). However, developing a parser that simultaneously understands both C/C++ code and its preprocessor directives is challenging.
This difficulty arises because directives like #ifdefs
and macros can theoretically be placed anywhere in the code, like comments or spaces. Parsers typically ignore comments and spaces during the initial scanning phase to avoid complexity - if they didn't, every rule in C/C++ would need modifications to account for the potential presence of preprocessor directives interspersed between other code elements like statements, expressions, and declarations.
Thankfully, in real-world scenarios programmers tend to use cpp directives in a highly disciplined manner. This is not out of principle, it’s a natural result of programming in C since code that is difficult for machines to parse would also be challenging for humans to understand. As a result, programmers apply cpp idioms and directives in very specific locations and manners, greatly simplifying the parsing process.
Navigating preprocessor directives and macros
#ifdef
ifdefs are usually used at the beginning and end of a file, or between entire functions and statements. It is thus possible to extend the tree-sitter C/C++ grammar to allow for either occurrence. Here is a recent example of an extension to tree-sitter grammar that allows ifdefs between enums.
Semgrep uses this extension to parse and analyze both branches of an ifdef. This means Semgrep actually parses more code than tools that require a build step (CodeQL, Coverity) despite being orders of magnitude faster.
This is because tools that require compilation will only exercise one configuration of the project. Projects like the Linux kernel contain thousands of possible configurations, all of which must be analyzed in order to provide complete coverage. Take the example of an Intel machine with all of the features ‘on’: gcc compiles only 54% of the Linux kernel C files, as the code of other architectures is not considered.
#include
includes are normally used at the beginning of a file, but also sometimes inside class/struct definitions. As was the case with #ifdef
, it’s pretty simple to extend C/C++ grammar to allow for these occurrences. Note that since our approach doesn’t expand #include
, type information like typedefs and class names are not known to Semgrep when parsing a file (since those types are defined in deeply nested files).
This results in even more parsing ambiguities because traditional C/C++ parsers need to know whether an identifier corresponds to a type or not. Fortunately, the GLR technology of tree-sitter allows us to try multiple parse trees that get resolved after reading more tokens to a single tree.
This is why parsing as-is is significantly faster than the traditional approach. A C++ file that’s 10 lines of code can expand to millions of lines - for example <windows.h>
, a header file for the Windows API, when expanded contains declarations for all of the functions in the API, all the common macros used by Windows programmers, and all the data types used by its various functions and subsystems.
C++ compilers are slow because each C++ file, even the smallest ones, usually expand to gigantic sizes that must then be processed by the compiler (parsed, typechecked, etc.). This is also why the C++ community pushed for a module system to enable modular analysis - unfortunately, modules are a recent addition that most projects do not leverage yet.
Macros
A macro is a label defined in the source code whose text is replaced by the preprocessor before compilation. Most macros look like regular function calls or constants, and can be parsed as-is. In fact, many C++ features like inline functions or constants were designed to avoid the use of macros. For other use cases, we can often extend the tree-sitter C/C++ grammar to deal with common idioms.
In theory, developers can define a macro in a header file:
#define INC 1 +
and proceed to use it in another file:
void foo(int a) { INC a; }
If you follow the C/C++ grammar, INC a;
looks like a declaration of a local variable with the type INC
, but after macro expansion we can see that it’s actually an addition expression. In practice, such code is hard for humans to understand without knowing the definition of INC, which is why programmers are disciplined in the way they write and use macros. For more information, see our own Yoann Padioleau’s paper on parsing C/C++ without pre-processing.
Limitations of parsing as-is
A notable drawback of the direct parsing method is the difficulty of accurately parsing code that employs complex macros or conditional compilation directives (#ifdefs
). Fortunately, tree-sitter is designed to handle parsing errors and aids us in overcoming this challenge.
Semgrep leverages tree-sitter's error recovery capabilities to maximize the amount of code it can successfully parse - for example, Semgrep will typically analyze both branches of a conditional compilation directive.
For an in-depth look into the rules and analysis techniques we used for C/C++ (specialized constant propagation, taint tracking, cross-file analysis) stay tuned as we have an engineering blog post on the horizon that will cover these topics - this is the real "secret sauce" of Semgrep's C/C++ support.
Final thoughts
Our team’s journey to fully support C/C++ in Semgrep Code has been challenging but immensely rewarding. Our approach finally gives C/C++ shops the freedom to use a modern, shift-left solution that loops developers in at the right time and reduces their reliance on legacy tooling.
While there are inherent limitations in parsing C/C++ code as-is, our utilization of tree-sitter grammars and focus on practical programming patterns have enabled us to overcome these obstacles and deliver coverage that is as comprehensive as traditional SAST 1.0 tools (and often more accurate), while scanning faster than a developer’s commit flow.
Huge thanks to Iago, Heejong, Brandon, Claudio, Pad, Phil, Romain, and Milan for all of the hard work and creativity they put into building out Semgrep Code's C/C++ coverage (and for sharing their insights and contributing to this post).