The journey of a language from experimental to GA in Semgrep

Introduction

Semgrep is a code analysis tool many use for application security scanning (SAST). One of the design principles that has guided Semgrep’s development so far is supporting every programming language. To that effect, we have added support for 30+ languages. Due to the wide range of supported languages, Semgrep is the go-to SAST product for many users and organizations. This post describes the process of maturing the support for a language like Kotlin to Generally Available (GA). Oh, and by the way, we also wanted to announce that the support for Kotlin is now GA!

The program analysis perspective

From the program analysis side, getting a language like Kotlin to GA, our highest level of support maturity for a language, is mostly a question of two metrics – parse rate and rule coverage.

We use parse rate as our primary metric to measure a language’s availability by selecting a corpus of open-source repositories in the language in question and measuring what percentage of them Semgrep can parse. This is an important metric because parsing is the first step that Semgrep needs for analysis to be possible. No parsing, no findings.

For Kotlin, we make use of the open-source tree-sitter-kotlin repository. The parse rate for Kotlin at the time of this venture was not super low, around 98%, so not a whole lot of work was necessary here to improve the parser’s ability to interpret Kotlin programs. Some changes were made, which mostly involved manually inspecting unparsable files to figure out where the inconsistencies were between the Kotlin grammar and the tree-sitter grammar.

This process was fairly straightforward. What about rule coverage? Before we can get adequate rule coverage, written by our star security research team, we have to have adequate rule-writing tools to support them. This was the next step of the process.

This involved adding Semgrep features like typed metavariables, literal metavariables, and other rule-writing aids, which were not previously available in our support for Kotlin. Thanks to our set-up in ocaml-tree-sitter-semgrep, this is generally a very easy process, which just involves augmenting the original tree-sitter grammar with some Semgrep-specific extensions. There are cases where this is not always super straightforward, and in fact, this was one of them.

In Kotlin, interpolated identifiers are specified in strings via dollar sign notation, such as writing

“The ident $id is interpolated in this string”, where $id is to be interpolated. Unfortunately, this means that in Kotlin the string “$X” is meant to be interpreted as the interpolation of the identifier X into a string. This is a problem, because when writing ”$X” in a Semgrep pattern, this is supposed to be a literal metavariable matching the contents of a literal string. This is an ambiguity!

To solve this conflict, we had to make sure there were ways of writing rules for both of the interpretations of the Semgrep pattern ”$X”, namely

as a Kotlin string with an interpolated identifier $X in it
as a Semgrep literal metavariable matching the contents of a literal string

Thankfully, Kotlin offers more syntax for interpolation, which is by putting ${...} in a string, where the ellipses denote any expression. This means that the first interpretation can be equivalently expressed via the pattern ”${X}”.

So, what we ended up doing was make the pattern ”$X” parse as the second interpretation (a literal metavariable), and require specifying the first as ”${X}” instead. Now, there is no more ambiguity, and rule-writers are empowered to be able to still match either case using Semgrep.

Writing high-confidence Kotlin rules

For rule writing, getting a language to GA means writing rules for common high-impact vulnerabilities for its most popular frameworks and testing them across a wide range of open source repositories on GitHub.

For Kotlin, this meant writing rules for XSS, SQL Injection, NoSQL Injection, Command Injection, Code Injection, XXE, CSRF, and Active Debug Code for the Ktor and Spring Boot frameworks. Kotlin is also very popular for Android development, but we decided to focus on web frameworks for now and will revisit mobile frameworks in the near future.

Generally, most of the rules for our high signal Pro rules projects utilize taint-tracking/dataflow-analysis (using Semgrep’s Taint Mode) because most of the vulnerabilities that we need to find are injection-based vulnerabilities. The taint source was always “specific attacker-controlled data in an HTTP request” and the sinks have always been similar, but depend on the vulnerability class and framework that we’re writing the rule for. For example, we always have SQL/Command exec sinks, however, they may look. To that end, writing rules for Kotlin wasn’t too different from any other language.

However, there was one thing that I found interesting:

Encoding preconditions for taint rules using taint labels

This isn’t necessarily Kotlin-specific behavior, but when writing XSS rules for Ktor, I found that some of the ways that Ktor allows you to access data from an HTTP request will return URL-encoded data. This data has to be URL decoded before it can actually trigger an XSS payload if put into a raw HTML sink.

So, I used one of my favorite Semgrep features, Taint Labels, to enforce that precondition.

Here, the sink requires a taint source from either the no_processing_needed or url_decode labels (but not the url_encoded label), meaning that the only taint source that can’t reach the sink directly is the url_encoded one. If data from the url_encoded source reaches the sink, it has to pass through the pattern in url_decode to trigger a finding.

Works like a charm!

Conclusion

So, those are the most interesting bits about our journey in getting Kotlin to GA maturity for Semgrep! We went over resolving the ambiguity with interpolated strings, using Taint Labels to encode preconditions in XSS rules! Brandon (@onefiftyman) worked tirelessly to deliver engine improvements when we needed them (often with < 1-day turnaround! ). Vasilii (@ermil0v), Lewis (@LewisArdern), and I (Enno (@enncoded)) spent a lot of time writing these Kotlin Pro rules. If you are interested in our Pro rules, be sure to also check out the Semgrep Code and our docs for more information.

The journey of a language from experimental to GA in Semgrep

Share

Introduction

The program analysis perspective

Writing high-confidence Kotlin rules

Encoding preconditions for taint rules using taint labels

Conclusion

About

Dive deeper into Announcements or continue reading our featured posts.

From idea to (secure) app: Semgrep + Replit

Take control of sensitive code without developer frustration

Announcing an AI AppSec engineer that users agree with 95% of the time

Find and fix the issues that matter before build time