How to upgrade the grammar for a language
Like for adding a language, most of these instructions happen in
Let's call our language "X".
- Update submodule tree-sitter-X.
lang/, ask an r2c developer to run
- In the semgrep repo, update submodule semgrep-X.
- In the semgrep repo, update the OCaml code that maps the CST to the generic AST.
In the end, make sure the generated code used by the main branch of semgrep can be regenerated from the main branch of ocaml-tree-sitter:
- Merge your semgrep branch.
- Merge your ocaml-tree-sitter branch.
Here are the main components:
- the OCaml code generator
generates OCaml parsing code from tree-sitter grammars extended
...and such. Publishes code into the git repos of the form
- the original tree-sitter grammar
tree-sitter-Xe.g., tree-sitter-ruby: the original tree-sitter grammar for the language. This is the git submodule
lang/semgrep-grammars/src/tree-sitter-Xin ocaml-tree-sitter. It is installed at the project's root in
- syntax extensions to support semgrep patterns, such as ellipses
...) and metavariables (
$FOO). This is
lang/semgrep-grammars/src/semgrep-X. It can be tested from that folder with
make && make test.
- an automatically-modified grammar for language X in
lang/X. It is modified so as to accommodate various requirements of the ocaml-tree-sitter code generator.
lang/X/ocaml-srccontain the C/C++/OCaml code that will published into semgrep-X e.g. semgrep-ruby and used by semgrep.
- semgrep-X: provides generated OCaml/C parsers as a dune project. Is a submodule of semgrep.
- semgrep: uses the parsers provided by semgrep-X, which produce a CST. The program's CST or pattern's CST is further transformed into an AST suitable for pattern matching.
Make sure the above is clear in your mind before proceeding further. If you have questions, the best way is reach out on the Semgrep Community Slack channel.
Make sure the
grammar.js file or equivalent source files
defining the grammar are included in the
fyi.list file in
Why: It is important for tracking and understanding the changes made at the source.
How: See How to add support for a new language.
Upgrade the tree-sitter-X submodule
Say you want to upgrade (or downgrade) tree-sitter-X from some old
commit to commit
602f12b. This uses the git submodule way, without
anything weird. The commands might be something like this:
git submodule update --init --recursive --depth 1
git checkout -b upgrade-X
git fetch origin --unshallow
git checkout 602f12b
First, build and install ocaml-tree-sitter normally, based on the instructions found in the main README.
Then, build support for your language in
lang/. The following
commands will build and test the language:
Check the generated code for the presence of
Blank nodes. Those
correspond to missing tokens.
grep Blank lang/X/ocaml-src/lib/CST.ml
If anything comes up, you must modify the grammar so as to create
a named rule for the node of the
Blank kind. Eventually, the generated
CST.ml should not have
Blank nodes anymore but a token type instead.
Blank node exists, we won't be able to get a token or its location
at parsing time.
If this works, we're all set. Commit the new commit for the tree-sitter-X submodule:
git commit semgrep-languages/semgrep-X
git push origin upgrade-X
Then make a pull request to merge this into ocaml-tree-sitter's main branch. It's ok to merge at this point, even if the generated code hasn't been exported (Publishing section below) or if you haven't done the necessary changes in semgrep (Semgrep integration below).
We can now consider publishing the code to semgrep-X.
Please ask someone at r2c to run this step.
lang folder of ocaml-tree-sitter, we'll perform the
release. This step redoes some of the work that was done earlier and
checks that everything is clean before committing and pushing the
changes to semgrep-X.
./release --dry-run X # dry-run release
... # 'git status' will show changes for language X
./release X # commits and pushes to semgrep-X
This step is safe. Semgrep at this point is unaffected by those
changes. There is now a new commit at
contains original files from which the code was generated.
shows the last change for each file, allowing you to check that you
got the correct version of
grammar.js or some other source file.
From the semgrep repository, point the submodule for semgrep-X to the
latest commit from the "Publishing" step. Then rebuild semgrep-core,
which will normally fail if the grammar changed. If the source
grammar.js was included in the
fyi folder for
semgrep-X (as it
git diff HEAD^ should help figure out the changes since the
The main difficulty is to understand how the different git projects interact and to not make mistakes when dealing with git submodules, which takes a bit of practice.