Tl;dr: In this blog post, you can read just how much of a mess Java XML security is. If by the end of this post, you feel confident parsing XML in Java without any tools or cheat sheets, I have done a terrible job conveying our findings.
The XML standard and the Java programming language have been around for a long time. It is well-known that processing XML can expose applications to a number of vulnerabilities, most famously XML External Entity (XXE) attacks.
As a result, the Java XML APIs have been improved over the years with various security features. But the large variety of Java XML APIs combined with the wide array of security features that are now available make it difficult to know how each parser should be secured.
My coworker, Vasilii Ermilov, and I thoroughly tested 10 different classes from three XML processing interfaces (DOM, SAX, StAX). Because we did not want to rely on documentation claims that are clearly incomplete and could even be wrong, we attempted to exploit each parsing method in combination with 16 different security features. To do this, we crafted 10 different attack payloads, each exploiting a different attack vector. You might be surprised (or saddened) by what we found:
Security-related flags and options are inconsistently available across various XML parsing methods.
The same security-related settings that work on some classes… don’t work on others!
Parsing XML in Java
The first Java API for XML Processing (JAXP) 1.0 was released as part of the Java Standard Edition (SE) 1.2 in 1998. Today, in 2022, there are a number of interfaces available to process XML content: the Document Object Model (DOM) interface, the Simple API for XML (SAX) interface, the Streaming API for XML (StAX) interface, and the XML Stylesheet Language for Transformations (XSLT) interface.
The full list of classes that we researched can be found in our research project on GitHub.Each of these APIs have been improved with a number of features that can help prevent XML-related attacks.
To understand what each feature does, and which one to use, we first must understand a few of the different types of XML-related attacks. Our two main concerns are exponential entity expansion and external entity injection. Both are related to the way XML documents allow embedding of external content. There are a lot of ways external resources can be referenced in XML documents, XML Schemas, and XSLT stylesheets.
External content can be loaded through an External Document Type Definition (DTD), an External Entity Reference to external data, a General Entity reference, or an External Parameter Entity reference. Additional XML code can also be included through
XInclude, or references to XML Schema components using the
schemaLocation attribute of
include elements. In stylesheets, multiple sheets can be combined using the
xsl:include element, the
?xml-stylesheet processing instruction, or the
document() function. We crafted a payload for each of these attack vectors, you can find them on GitHub.
Exponential entity expansion
Exponential entity expansion happens when there are several layers of nested entities that each refer to a number of other entities. This type of attack is also known as an XML bomb or billion laughs attack. As an example, here is the XML bomb payload we used in our research project on GitHub.
1<?xml version="1.0"?> 2<!DOCTYPE lolz [ 3 <!ENTITY lol "lol"> 4 <!ELEMENT lolz (#PCDATA)> 5 <!ENTITY lol1 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;"> 6 <!ENTITY lol2 "&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;"> 7 <!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;"> 8 <!ENTITY lol4 "&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;"> 9 <!ENTITY lol5 "&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;"> 10 <!ENTITY lol6 "&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;"> 11 <!ENTITY lol7 "&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;"> 12 <!ENTITY lol8 "&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;"> 13 <!ENTITY lol9 "&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;"> 14]> 15<lolz>&lol9;</lolz>
When such a document is parsed, the parser will expand each of the nested references. The expansion becomes exponentially large, which can lead to the parser consuming most or all of the resources of the server, leading to a denial-of-service (DoS) attack. We found that in Java, such excessive use of resources throws an exception. This exception is thrown after expanding 64000 entities, a limit set by the JDK. When this exception is carefully caught and handled, it will prevent an exponential entity expansion attack and so no explicit security measures need to be taken to secure your application from this type of attack. We tested this and found consistent behavior for all recent versions of Java (17, 18, 19) and all older versions with long-term support (8, 11).
XML External Entity Injection
XML External Entity Injection happens when the system identifier to some of the external content contains data controlled by an attacker. The parser will then dereference this identifier containing attacker-controlled XML code. This can lead to the disclosure of confidential data if the identifier supplied by the attacker is something like
file:///etc/passwd. In other cases, XXE payloads can be used to upload code files that can later be triggered in remote code execution attacks, like in this CVE where the identifier referenced a java code archive (`jar:http://10.0.220.200:9090/xxe-upload-test.jar!/myfile.txt`). In PHP, the right identifier by itself can even cause arbitrary code execution when the
expect module is loaded. In that case, a pseudo-uri like
expect://cmd will execute
cmd and return the output of the command. With some workarounds to avoid characters that are not valid in URLs, remote code execution can be achieved.
Researching security features
To protect your application, it is best to disable all ways that external content can be included if you do not need them. But there are quite a few methods to disable them.
Vasilii and I created attack payloads for each of the 10 ways to include external content into XML documents. We tested 10 classes and there are 16 security features to test, resulting in 160 parser configurations. To test these configurations, we tried to parse each of the 10 payloads and verified whether or not external requests were made. Thus, we ran 1600 tests! The full table of results can be found in our Java XXE Cheatsheet. But some noteworthy results are shown in the figure and discussed here.
Feature for Secure Processing
Feature for Secure Processing (FSP) is considered the central mechanism for secure XML processing. It is defined as
javax.xml.XMLConstants.FEATURE_SECURE_PROCESSING. According to the docs, this feature is turned on by default for the DOM and SAX parser and XML Schema validators but turned off for transformers and XSLT. We found that it is also turned off by default for
SchemaFactory, which we did not expect based on the documentation. This feature is available for all of the 10 classes we tested and can be turned on with
For a number of classes, this feature is able to protect the parser from all available payloads. Disappointingly, however, this is not the case for all of the classes. In fact, for the other classes, this feature has no effect on security at all.
Disabling DTD processing
To disable DTD processing, SAX and DOM parsers use a different mechanism compared to the StAX parsers. Disable DTD processing for SAX and DOM parsers with the
setFeature method. The argument for this method should be the following url:
http://apache.org/xml/features/disallow-doctype-decl. For the StAX parsers, use
setProperty to set
XMLInputFactory.SUPPORT_DTD to false. Or use
setAttribute and set
XMLConstants.ACCESS_EXTERNAL_DTD to an empty string (`“”`). For some parsers, multiple of these methods are available.
For most parsers, this setting adds additional security for processing XML documents, but it never has any effect on processing schemas or stylesheets. However, there are classes such as
SchemaFactory where both
setFeature are available, and setting the above mentioned features throws no exceptions, but they don’t have the same effect on the security of the parser.
setIncludeExternalDTDDeclarations is available for
SAXReader, but we measured no effect. On closer inspection, this was not surprising, given the fact that the documentation for this method is simply
DOCUMENT ME!. Similarly, we also saw incomplete implementations for security features in
Finally, there is also the features
http://apache.org/xml/features/nonvalidating/load-dtd-grammar that look related, but setting either of these to false has no effect on the security of your parser.
Disable external general entities and external parameter entities
setFeature method is available, it can be used to set
http://xml.org/sax/features/external-parameter-entities to false. For the classes where these features are available, the combination of these two protects the parser against all of the tested payloads! … Except for the
Validator class, where they have no effect at all.
Disabling external schema/stylesheet processing
setAttribute method is available for your parser, you can try
XMLConstants.ACCESS_EXTERNAL_STYLESHEET and use an empty string to disable these features. Whether or not this will succeed depends on if your parser can process schemas and stylesheets. However, even if it succeeds, the effect on security is limited, with no observed effect for the majority of the parsers.
There are still other ways to make mistakes with XML processing. One of the constructors for
SAXReader for example, takes an
XMLReader as input. But no matter how well this
XMLReader is configured, it has no effect on the security of the XML parsing that is done with this
If your JDK is not updated, you can run into problems as well! For some of the parsing tests Vasilii and I had different outcomes on our local machines. After some digging we found out that this was because Vasilii’s JDK was not up to date and we had just confirmed a known JDK bug where DOM parsers do not honor
setExpandEntityReferences(false) for certain JDK versions.
Are people parsing XML securely in practice?
We tested the Teams tier rules created from this research on the top 1000 open source Java repositories on GitHub and had a total of 690 findings! The classes for which we find the most findings are
SAXTransformerFactory. For the classes other than
XMLReader this could be explained by the fact that it might not be so obvious that both
SAXTransformers are actually parsing XML files (as opposed to building and transforming documents). But rest assured,
Documentbuilder has a
parse method, and
transform method will parse an XML source into a
javax.xml.transform.Result. For example, it can apply an XML stylesheet to an XML file. In this process both the style sheet and the file are parsed and entity expansion can take place.
However, 690 findings is a lot of findings to triage, and they are probably not all true positives. We can always use your help to improve the rules, if you see any false positives or false negatives report them, use Semgrep shouldafound, or file an issue in our rules repository.
One possible source of false positives could be when the parser is only used for certain documents that cannot contain user input. However, we believe there are only advantages to limiting the use of external entities from the start. (1) For performance reasons it is good practice to reduce dependencies on external resources. (2) It is difficult to guarantee that even a trusted XML file has not been tampered with, on your server or during communication, by a malicious third party. (3) By configuring your parser securely from the start, you do not have to worry about risks for XXE vulnerabilities in future, when the parser could be reused to process untrusted XML files.
The Team tier rules have already generated findings that have been triaged as true positives by our customers. And they could also have prevented known CVE’s.
In this remote code execution vulnerability in ManageEngine ADAudit Plus, the attackers were able to upload the Java payload through an XXE exploit caused by incorrect security configurations in a
DocumentBuilderFactory. This would have been prevented with our rules.
We were able to replicate finding this CVE that reports unrestricted XML External Entity references in three different components of Apache NiFi by cloning the vulnerable version of the repository and running our set of rules. We found two insecure instances of
XMLReaderand one instance of
SchemaFactoryin the files mentioned in the CVE report.
Detect these and other CVEs in your dependencies by using Semgrep Supply Chain. And prevent similar mistakes in your own code by scanning your code with Semgrep’s Team tier rules.
Holy cow, Java XML security is a mess! The large number of classes and security features are confusing enough by itself, and setting some of these security features requires you to write an entire URL! The inconsistency in availability and effect of these security features across different classes make it near impossible to securely configure your parser without a cheat sheet or a tool.
The simplest solution to try and remember is to use
setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true) for
Validator, and to use
setFeature(http://apache.org/xml/features/disallow-doctype-decl, true) for
We have summarized the research results in a clean table, available in our Java XXE Cheatsheet. Or even better, you can use the Semgrep rules available in Team tier that Vasilii and I created, to continuously scan your code to ensure you’re parsing XML securely.
Since we have a nice set of payloads to test, it’s likely that we will repeat this type of research for other languages! If you have a language or framework you think would be interesting to research, let me know! You can find me on our community slack or on Twitter.
Semgrep is a fast, open-source, code scanning tool for finding bugs, detecting dependency vulnerabilities, and enforcing code standards.