by Alex Drozdov, Wallarm Research
XXE or XML External Entities is a new issue in the 2017 [OWASP Top 10 vulnerability list](). This is the only new issue of the set that was introduced based on direct data evidence from the security issues database. XML is commonly used for metadata of everything from movies to Docker containers and is a basis of API protocols such as REST, WSDL, SOAP, WEB-RPC, and others, Moreover, a single application may contain several linked XML interpreters processing the data from different application tiers. This potential ability to inject an external entity at various points in the application stack via an XML interpreter is what makes XXE so dangerous.
Many [Web Application Firewalls]() are capable of protecting web-servers from XXE attacks.
In his [article in Security Boulevard](), Wallarm CEO, Ivan Novikov says:
_Actually, XXE is not a bug, but a well-documented feature of any XML parser. Yes, its true, an XML data format allows you to include the content of any external text file inside an XML document._
An example of an XML document containing the attack code:
![](https://cdn-images-1.medium.com/max/1024/1*IfdvuHbloMb0ftlnHFl2fg.png)
Here the text $attack; refers to the link to the entity registered earlier. The contents of the file specified in the link replace it in the document body.
The document above is divided into 3 important parts:
1. Optional header <?xml?> to define the basic document characteristics, such as version and encoding.
2. Optional declaration of the XML document schema <!DOCTYPE>. This declaration may be used to set external links.
3. Document body. It has a hierarchical structure, at the root of which is the tag specified in the <! DOCTYPE>
A correctly configured XML interpreter will either not accept a document with XML links for processing or will validate the links and their sources, If the validation is missing, an arbitrary file can be loaded via the link and integrated into the document body as in the example above.
In this article, we look at two types of WAF based on how they handle XML validation:
1. Diligent. WAFs that pre-process the XML document with its own parser.
2. Regex-based. WAFs that only search for certain substrings or regular expressions in the data.
Unfortunately, bypasses exist for the WAFs of both categories.
Below we show several methods the bad guys can use to fool a WAF and get XXE through.
### Method 1: Extra spaces in the document
Since [XXE]() are typically in the beginning of the XML document, a lazy WAF can avoid processing the entire document, and only parse its beginning. However, the XML format allows using an arbitrary number of spaces when formatting the tag attributes, so an attacker can insert extra spaces in <?xml?> or <!DOCTYPE> to bypass such WAFs.
![](https://cdn-images-1.medium.com/max/1024/1*0QxZQKolMHZcDetrJ6rI0g.png)
### Method 2: Invalid format
To bypass diligent WAFs, an attacker may send specially formatted XML documents so that a WAF would consider them invalid.
### Links to unknown entities
The settings of a diligent WAF usually prevent it from reading the contents of the linked files. This is strategy generally makes sense since otherwise, the WAF itself may also become a target of an attack. The problem is that the links to external sources can exist not only in the third part of the document (the body) but also in the declaration <! DOCTYPE>. This means that a WAF, which has not read the contents of the file, will not read the declarations of the entity present in the document. The links to unknown entities, in turn, will stop the XML parser causing an error.
Example:
![](https://cdn-images-1.medium.com/max/1024/1*bcb-7vLXdyo_NWmwoK6gcA.png)
Fortunately, guarding against such a bypass is quite simple just order the XML parser within WAF not to shut down after meeting unknown entities.
### Method 3: Exotic encodings
In addition to the three parts of an XML document mentioned earlier, there is a fourth part located above them, which also controls the encoding of the document (like <?xml?>) the first bytes of the document with an optional BOM (byte order mark).
More information:[ https://www.w3.org/TR/xml/#sec-guessing]()
An XML document can be encoded not only in UTF-8, but also in UTF-16 (two variants BE and LE), in UTF-32 (four variants BE, LE, 2143, 3412), and in EBCDIC.
With the help of such encodings it is easy to bypass a WAF using regular expressions since, in this type of WAFs, regular expressions are often configured only for a one-character set.
Exotic encodings may also be used to bypass diligent WAFs as they are not always able to process all the encodings listed above. For instance, the libxml2 parser only supports one type of UTF-32 UTF-32BE, specifically without BOM.
### Method 4: Two types of encoding in one document
In the previous section, we demonstrated that the encoding of the document is typically specified by its first bytes. But what happens when there is a
<?xml?> tag containing the encoding attribute referring to a different character set at the beginning of the document? In this case, some parsers change the encoding so that the beginning of the file has one set of characters, and the rest of it is in another encoding. That said, different parsers may switch the encoding at different times. A Java parser (javax.xml.parsers) changes the character set strictly after the <?xml?> ends, whereas the libxml2 parser may switch the encoding right after the value of the encoding attribute is executed or later before or after the <?xml?> has been processed.
A diligent WAF can protect against the attacks in such documents reliably only if it never processes them at all. We must also bear in mind that there are many synonymical encodings, for example, UTF-32BE and UCS-4BE. Moreover, some encodings may be different but compatible from the point of view of coding the initial part of the document <?xml?>. For instance, a seemingly UTF-8 document may contain the string <?xml version=1.0 encoding=windows-1251?>.
Here are some examples of this. For the sake of brevity, we wont place XXE in the documents.
The libxml2 parser treats the document as valid, however, the Java engine from javax.xml.parsers set considers it invalid:
![](https://cdn-images-1.medium.com/max/1024/1*e8XYEG2V2UUvVB_KSlMflQ.png)
Vice versa, the document is valid in terms of the javax.xml.parser, but not valid in terms of the libxml2 parser:
![](https://cdn-images-1.medium.com/max/1024/1*BFthsAHL64GljSX3hiAp2w.png)
Document for libxml2, encoding change from utf-16le to utf-16be in the middle of the tag:
![](https://cdn-images-1.medium.com/max/1024/1*ymPBUlNOZFLoVQ-EEuMr7Q.png)
Document for libxml2, encoding change from utf-8 to ebcdic-us:
![](https://cdn-images-1.medium.com/max/1024/1*qJRiVwmnNN1ywM3t1smBOA.png)
As you can see, there are many bypasses with no good protection for them. The best way to prevent XXE is to configure our application itself initialize your XML parser in a secure way. Mainly, there are two different options that should be disabled:
* External entities
* External DTD schema
We will continue our research on XXE WAF bypasses and expect to publish more soon. Stay tuned.
![](https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=98f679452ce0)
* * *
[XXE that can Bypass WAF Protection]() was originally published in [Wallarm]() on Medium, where people are continuing the conversation by highlighting and responding to this story.Read More