The following is a guest post from Hannes Mehnert, a programmer at Robur, who has been helping translate the Semgrep OSS from Python to OCaml.
Semgrep was initially developed partly in Python (network & system code, HTTP interaction with the Semgrep Cloud Platform) and OCaml (static analysis). This leads to a maintenance burden and there are missed optimization opportunities due to data being serialized to be communicated between the two languages.
We at robur started to work on migrating the Python codebase to OCaml in April 2023 since we have a lot of experience working on systems code and OCaml, including MirageOS - a single-purpose operating system developed entirely in OCaml. The pure OCaml version of semgrep can be used by passing the --experimental
flag to semgrep
since version 1.29.
'HTTP request failed' issue in OCaml
While migrating the codebase and dogfooding it in the CI, we encountered the following error in CI:
Failed to download config from https://semgrep.dev/c/p/ocaml: HTTP request failed: connection failed: timeout
Locally, running the same hook worked fine, i.e., this issue could not be reproduced.
Taking back a step, the CI uses pre-commit in a GitHub action and executes the just compiled Semgrep on the Semgrep codebase itself. This is real dogfooding - use the just compiled version of Semgrep and check whether it works (and doesn’t report any findings) on the entire Semgrep codebase.
Taking another step back, Technically, the CI uses pre-commit, which runs as a GitHub action, and then pulls the returntocorp/semgrep:develop Docker image and runs it. So, a pretty complex setup inside of GitHub action (a Docker container), the pre-commit program is launched (a Python program), which then executes a Docker container inside.
Some weeks into the project, we wanted to migrate the CI system to use the pure OCaml Semgrep, replacing the Python version.
Getting to the root of the issue
When there is a timeout, let’s try to increase the number of seconds. This was attempted with https://github.com/returntocorp/semgrep/pull/7984 (bumping from 1 second for DNS resolution to 2 seconds). But this didn’t solve the issue. Next, we increased the log verbosity and found the following output:
[00.11][DEBUG]: trying to download from https://semgrep.dev/c/p/ocaml
[00.12][DEBUG]: connect: id 1 host semgrep.dev
[00.12][DEBUG]: timer
[00.12][DEBUG]: timer 0 actions
[00.12][DEBUG]: connect_ip id 2 dsts 168.63.129.16, 168.63.129.16
[00.12][DEBUG]: timer
[00.12][DEBUG]: timer 0 actions
[00.13][DEBUG]: timer
...
[06.14][DEBUG]: timer
[06.14][DEBUG]: timer 0 actions
[06.15][DEBUG]: timer
[06.15][DEBUG]: timer 1 actions
[06.15][DEBUG]: HTTP request failed: connection failed: timeout
Fatal error: exception Error.Semgrep_error("Failed to download config from https://semgrep.dev/c/p/ocaml: HTTP request failed: connection failed: timeout", 0)
We dug a bit deeper into the packages involved. The OCaml version of Semgrep utilizes http-lwt-client for downloading its configuration. It is an HTTP client written in OCaml, utilizing httpaf (for HTTP/1.1) and h2 (for HTTP/2). It supports HTTPS using our tls stack (see the Usenix security 2015 paper). But before using HTTPS, it needs to establish a TCP connection to the remote server.
Establishing a TCP connection to a remote server used to be straightforward in the old days, but nowadays, with multiple Internet protocol versions (IPv4 and IPv6) around, and multi-homed servers (with multiple IP addresses), it is no longer easy. The Internet Engineering Task Force (IETF), they wrote up a Request For Comments (RFC) about that, titled “happy eyeballs version 2” - the abstract is worth reading: “Many communication protocols operating over the modern Internet use hostnames. These often resolve to multiple IP addresses, each of which may have different performance and connectivity characteristics. Since specific addresses or address families (IPv4 or IPv6) may be blocked, broken, or sub-optimal on a network, clients that attempt multiple connections in parallel have a chance of establishing a connection more quickly. This document specifies requirements for algorithms that reduce this user-visible delay and provides an example algorithm, referred to as ‘Happy Eyeballs’.”.
The RFC nicely describes what should happen, but there’s no formal specification (i.e., no state machines, which concrete things need what timeouts, no test suite). We had previously implemented happy-eyeballs in OCaml to easily “connect to a remote host”, independent of the network setup (IPv4 only, IPv6 only, broken IPv6 setup with working IPv4).
Happy eyeballs basically resolves the hostname for IPv6 and IPv4 addresses and then attempts to connect to it - with a slight preference (50 ms) to IPv6. The principle is to quickly establish connections, potentially wasting some network data (starting to open multiple connections instead of one-by-one) to ensure a timely established connection (or failure).
The first step, “resolving the hostname”, is pretty tricky. Conventionally on a Unix system, there is a /etc/resolv.conf file with multiple resolvers listed (by IP address to avoid circularity). Resolving names uses the DNS protocol, which used to use only UDP and TCP on port 53, but recently DNS-over-TLS for privacy was standardized (on TCP port 853), thus a network observer can’t track which hostnames you attempt to connect to. We also have a DNS implementation purely in OCaml, which supports DNS-over-TLS, and this is used in our happy-eyeballs implementation.
So, we parsed the resolv.conf file, and attempted to connect to the nameservers listed there, first on port 853 (privacy by default), if that fails on port 53. For this, we used the lower part of the happy-eyeballs implementation – the “attempt to connect” – since it is the same mechanism: we have a list of IP addresses and port numbers, with a slight preference for one port, as input, and as output, we appreciate an established connection or a failure.
Then, we looked at the above-mentioned log output:
[00.12][DEBUG]: connect: id 1 host semgrep.dev
[00.12][DEBUG]: connect_ip id 2 dsts 168.63.129.16, 168.63.129.16
In the first line, happy-eyeballs is asked to establish a connection to semgrep.dev. The second line is the DNS client establishing a connection to the resolver (internal, and only reachable to Azure - where GitHub actions are executed) – once port 853, once 53 (omitted from the log, but that log output was improved with the new happy-eyeballs release).
And establishing a connection to the resolver never succeeded but ran into a timeout. How strange is that? And isn’t happy-eyeballs there to avoid such an issue and quickly fall back to the other IP/port?
The issue was rather isolated - somewhere in happy-eyeballs during DNS resolution - but how can we reproduce it with a small setup, and why did it not occur previously?
Reproducing the issue
A first test was to use GitHub actions and download a website from the http-lwt-client repository. This was a success. Adding verbosity showed that nameserver 127.0.0.53 was used (systemd-resolvd). But wait - the pre-commit executed semgrep has a slightly different setup and a different resolver. Retrying with a GitHub action that uses dig (from the bind project) to resolve a host at 168.63.129.16 via (a) UDP (b) TCP (c) DNS-over-TLS showed that (a) and (b) work, but (c) times out. Nice!
A step forward: we were able to adapt http-lwt-client to accept a nameserver IP address and run that as GitHub action, which then resulted in a failure for 168.63.129.16.
A brief excursion to TCP, which is a complex protocol: a connection attempt is sending a single IP packet to the remote and then we wait for replies. A host can reply with a negative answer “this port is closed”, or simply no reply at all (drop that packet). The latter should lead to a timeout. And now, we’re at the core of the issue: happy-eyeballs expected only the former failure case and didn’t handle the latter one properly. In addition, the connection attempts were done in sequence instead of concurrently – with the special case, if it was an IPv6 address, it was canceled after 200ms (instead of the default connect timeout of 10 seconds). Another issue was that the timeouts weren’t propagated clearly from happy-eyeballs to DNS.
Our OCaml happy-eyeballs implementation did connect via DNS-over-TLS to the resolver, and waited for the usual connect timeout, since no negative reply was received. Also, it did not attempt to connect to the same DNS server via TCP concurrently. Thus, the HTTP connection attempt resulted in the timeout from connecting to the DNS resolver.
Fixing happy-eyeballs
Looking into more detail of the fix: first, the interface of happy-eyeballs barely changed – note that happy eyeballs uses “action” for something the effectful layer (e.g. lwt) needs to do (i.e. an “action” is only ever returned), and “event” for something the effectful layer observed and informs happy-eyeball about it (i.e. an “event” is always input) – the action Connect_cancelled is removed, and the action Connect gets an integer value - the attempt count (to distinguish two connection attempts for the same host). The v6_connect_timeout is useless and gone, instead, a connect_delay is introduced: how much advantage a connection attempt should get before the next one is started.
In the same diff, the log messages have been improved - so actually, the port numbers aren’t dropped anymore :).
Canceling connection attempts (including closing the file descriptors) now needs to be done by the effectful layer (happy-eyeballs-lwt) once a connection is successful or failed. The rest of the changes in that commit deal with the internals - now multiple connection attempts can be concurrently running.
Similarly, the dns-client diff deals with the resource canceling and the revised API of happy-eyeballs (this is in line with the changes to happy-eyeballs-lwt and happy-eyeballs-mirage).
Fortunately, the interfaces of happy-eyeballs-lwt and dns-client-lwt did not change at all, so any user thereof (including http-lwt-client) does not need to be changed – just make sure the latest happy-eyeballs release has been used (e.g. https://github.com/returntocorp/semgrep/pull/8068).
Excursion: Why OCaml
As mentioned above, the entire stack (from DNS over happy-eyeballs and HTTP to TLS) is developed in OCaml. The reason behind this is that we at robur are working on MirageOS unikernels – custom-tailored applications that are executed as a virtual machine without a Linux kernel but only with an OCaml runtime. The advantages are: improved security (fewer attack vectors, drastically decreased attack surface), lowered complexity, and reduced carbon footprint, with the path paved to be able to formally verify the correctness of such a unikernel.
As a byproduct of MirageOS, we develop various system libraries in OCaml that can be used outside of MirageOS, such as http-lwt-client. We also develop reproducible build infrastructure for opam packages, where we supply binary packages for common Linux distributions and FreeBSD.
Of course, an alternative for e.g., Semgrep would be to just shell out to curl, but here, DNS-over-TLS wouldn’t be used, and on user-reported issues, the curl version being used would need to be inspected (or semgrep would need to ship a curl itself, which adds maintenance burden again).
Conclusion
This article is a journey from “HTTP request failed: connection failed: timeout” to the bugfix in about 2000 words, covering the internals of how to establish HTTP client connections in the modern internet. We hope it sheds some light on the issue at hand, how to debug systems code, and what the value is to be in charge of the entire stack. If you have feedback or questions, please join the Semgrep Community Slack!