Sense and (path) sensitivity: My experience adding a new feature as a Semgrep intern

I spent 10 weeks at Semgrep as a Software Engineering (SWE) Intern on the Semgrep Analysis Foundations Team, the team that owns and maintains the core static analysis functionality of the Semgrep tool. In this blog post, I am going to share my experience of adding path sensitivity to Semgrep, as well as what I learned and loved along the way!

Katrina Liu
July 26th, 2024
Share

“You have fresh eyes!”

Interning at Semgrep was a new experience for me in many ways: It was my first time living in SF, working an actual SWE job, and writing OCaml code for a real industry product. Being new to all this, I had a lot to learn. Now that my 10-week internship at Semgrep is coming to an end, I want to share a bit about my personal journey and the exciting new feature—path sensitivity—that I added to the Semgrep code analysis engine!

(If you are only interested in the technical portion, jump to the section “Let's ship this #$@!”).

About Me

My name is Katrina Liu, a rising senior (class of 2025) majoring in Computer Science at the University of Pennsylvania. I am interested in programming languages research, and I am a TA for a course that teaches OCaml. Before this internship, my experiences ranged from academic research and working at a student-led startup to editing computer science textbooks. As I searched for summer opportunities, I was eager to find a role where I could apply my theoretical knowledge to real-world problems, particularly in areas related to my interests. I was also excited to experience how engineering works in the industry firsthand, and Semgrep seemed like the perfect place to do all that!

“Say what you will do, do what you said, say what you did, compare.”

During my internship, I was treated like everyone else on the team. I attended weekly stand-ups and monthly retros, scoped out my projects, and went through many rounds of code reviews. Going through this process was incredibly valuable for me. Not only did I learn to write better code, but I also became better at breaking down complex, user-facing problems into manageable engineering tasks. This is an invaluable skill that's often hard to learn in academic settings. I also became more confident communicating my views and ideas to both technical and non-technical team members.

The most important lesson I learned was how to reflect on the work we have done as a team, and how to go about making the process and the results better next time. This focus on relentless improvement is at the heart of Semgrep’s culture. I especially appreciated the monthly retros, where the team talked about what worked and didn’t work, it really pushed me to think back and reflect. I found this continuous loop of planning, doing, and reflecting to be crucial for my growth as an engineer.

“Let's ship this #$@!”

Undeniably, I learned a lot through navigating the engineering process, but what excited me the most was the project that I worked on.

My project at Semgrep involved adding path sensitivity, a feature that reduces false positives in code detection and makes rule writing easier, to the code analysis engine. For those unfamiliar, Semgrep Code is a static analysis tool for finding bugs and enforcing coding standards. It allows users to write rules that describe code patterns, which are then used for identifying potential issues and security vulnerabilities in code.

So... what problem does this new feature solve?

Below is a code snippet where fclose is mistakenly called on NULL . For context, when we call fopen, it can either return a file pointer or NULL to indicate failure. Also, we should call fclose on the file pointer after we are done, but it crashes if we call it on NULL.

int f() {
  FILE * fp = fopen("foo.txt", "w");
  if (fp == NULL) {
    // bad: fclose(NULL) crashes  
    fclose(fp);
    return 1;
  } else {
    // ok
    fclose(fp);
    return 0;
  }
}

As always, when we see bad code, we write a Semgrep rule to catch it! In this case, we want to match fclose(fp) when fp is NULL. To write the rule, we could use features like pattern-inside and focus-metavariable, which would allow us to capture patterns syntactically in the then branch. However, that is not the same as capturing patterns semantically when the condition is true.

Currently, there is no straightforward way to extract and use information based on the program's control flow - that is, the different paths code can take through constructs like if-else statements. This ability to analyze code differently depending on which path it follows is essentially what path sensitivity is.

How does it work?

Going back to the example, we know the first fclose is called on NULL due to the if condition, and thus we consider it bad. Conversely, we know that the second fclose is correct because it is in the else branch, which only executes when fp is not NULL.

The condition fp == NULL creates two distinct execution paths: one where the condition is true and one where it is false. To introduce path sensitivity into the engine, we need to distinguish between these paths and track the conditions. And we can do it all using control flow graphs!

A control flow graph is a representation of all the paths a program might take during execution. It's like a roadmap of the code, showing how different statements and conditions connect and influence the program's flow.

The example above translates roughly into the following control flow graph. By traversing this graph, we can then extract and store path information and use it to match patterns in the code.

control-flow-graph

Since the Semgrep codebase already generates control flow graphs for other data-flow analyses, my work primarily focused on leveraging these graphs to implement path sensitivity in our engine.

What’s the result?

At the end of the internship, I managed to ship this feature under a feature flag! I had a working implementation quite early on, but there were many diverging opinions that needed to be debated and reconciled.

Some of the discussions we had were related to performance in terms of time and memory, where to store the extracted information, and if my code affected other parts of the codebase. Navigating these ideas taught me a lot, and I realized that debates, arguments, and complaints are not inherently bad and often lead to new and better ideas. This experience highlighted one of Semgrep's core values: Embrace Debate.

Now, for the grand finale: you can write rules like the one below that match fclose(fp) whenever fp equals NULLin a given control-flow path!

rules:
- id: fclose-null-ptr
  severity: WARNING
  languages:
    - cpp
  message: |
    Should only call fclose on fp if fp is not null
  match:
    pattern: fclose($FP);
    where:
      - comparison: $FP == None # python syntax

It also works with taint mode, as shown in the following example where we catch instances of forgetting to call fclose after calling fopen, but only when fopen successfully opens a file.

rules:
  - id: fclose-return-condition-taint
    severity: WARNING
    languages:
      - c
    message: |
	    Should call fclose after fopen
    mode: taint
    pattern-sources:
      - by-side-effect: true
        patterns:
          - pattern: FILE *$FP = fopen(...);
          - focus-metavariable: $FP
    pattern-sanitizers:
      - by-side-effect: true
        patterns:
          - pattern: fclose($FP);
          - focus-metavariable: $FP
    pattern-sinks:
      - at-exit: true
        pattern-either:
          - patterns:
            - pattern: return $X;
            - metavariable-comparison:
                comparison: fp != None # gives false positive without the comparison
int test() {
    FILE *fp = fopen("test.txt");
    if (fp == NULL) {
        // ok: fclose-return-condition-taint
        return 1; // false positive
    } else {
        // ruleid: fclose-return-condition-taint
        return 0;
    }
}

int test2() {
    FILE *fp = fopen("test.txt");
    if (fp == NULL) {
        // ok: fclose-return-condition-taint
        return 1; // false positive
    } else {
        fclose(fp);
        // ok: fclose-return-condition-taint
        return 0;
    }
}

Note that my project is still under a feature flag and does not work in the playground at the time of writing this blog.

And that’s it for my project (for now). What I loved most about this project was the opportunity to apply programming language research concepts, such as control flow graphs and constant propagation, in a practical setting.

On a similar note, I found it fascinating that despite the increase in scale and customers of Semgrep, the company remains deeply focused on engineering and research. During my time here, I participated in Hack Week, a Semgrep tradition where we minimized meetings and had a week to focus solely on hacking on challenging projects. I had the freedom, and was even encouraged, to work on a project involving some frontend and backend work, which was different from my project, and I was so amazed by how much work everyone managed to finish in a week!

Additionally, I loved how my peers were so excited about new ideas and research. Cool papers, like this one on distributed data-flow analysis, were shared and discussed, which can potentially help improve the product.

“What if we push a Waymo onto the Golden Gate Bridge?”

It is true that we spent a lot of time talking about programming languages, research, and functional programming. While these conversations were super interesting (and I admittedly didn’t always understand everything), some of the most memorable moments came from the random conversations we had during lunch.

We debated whether Flat Stanley could fit in an envelope, which temperature scale—Celsius or Fahrenheit—was superior, and, of course, what would happen if we pushed a Waymo onto the Golden Gate Bridge. If I had a nickel for every time we talked about pushing a Waymo onto the Golden Gate Bridge, I'd have two nickels. Which isn't a lot, but it's weird that it happened twice.

Besides the fun lunch conversations, we also spent time together outside of work. We did a lot of fun activities like going to escape rooms, getting boba, and having team lunches at Chinese restaurants. It was funny how we always ended up at Chinese restaurants, with the Chinese speakers translating the menu items for everyone else.

We escaped!We escaped!

PA Guild lunchPA (Program Analysis) guild lunch

“Clap out on Semgrep!”¹

Through this experience, I have learned and grown a lot both as a person and as an engineer, and I really enjoyed working at Semgrep. Everyone at Semgrep genuinely cares about both the product and each other, and I am so glad that this is my first ever SWE internship.

Although I can still think of many things to write about, I assume you don’t have more time to spare reading about some stranger’s internship experience, so I will stop here. However, if you are interested in more perspectives on interning at Semgrep, you should definitely read the blogs written by Charissa and Vivek. For those who are more interested in the development of the Semgrep engine, there are many other blogs right here that you might like!

Finally, thanks to everyone who helped along the way, especially to those whose quotes added a touch of personality and wisdom to this blog.


¹ At Semgrep, we have a tradition of clapping out on something, like a new hire’s name, at the end of meetings. This practice has many different origin stories, but its main purpose is to avoid the awkwardness of ending a large meeting. Plus, it's an fun way to wrap things up!

About

Semgrep lets security teams partner with developers and shift left organically, without introducing friction. Semgrep gives security teams confidence that they are only surfacing true, actionable issues to developers, and makes it easy for developers to fix these issues in their existing environments.