A colleague asked how he might sanitize client-submitted HTML to remove potential Cross-Site Scripting vulnerabilities. His use case was an ORDS REST service accepting HTML content, so he wanted to sanitize it using PL/SQL. After consulting the functions in the APEX_ESCAPE package, I convinced myself we needed a different strategy. The functionality could also come in handy for an APEX application, so I set out to implement a proof of concept solution.
Using XSL Transformations
Earlier in my career, I represented Oracle on the XSL Transformations working group at W3C. XSLT is an elegant, declarative standard to describe a transformation of an XML document into an alternative XML, HTML, or text result. The Oracle database supports XSLT 1.0 natively via the XMLTRANSFORM() SQL function and the DBMS_XSLPROCESSOR package. So, if I could convert the incoming HTML document into well-formed XML, I knew I could craft an appropriate XSLT stylesheet to sanitize the HTML document by selectively copying only a handful of allowed elements needed for basic content formatting and by suppressing undesirable attributes.
Using HTML Tidy
HTML Tidy is a utility Dave Raggett at the W3C created and released in 1998 to clean up HTML. You can read more about its history here, but the executive summary is that it has evolved over the years into an open source suite of command line tools for cleaning up HTML input. One kind of cleanup it supports is turning HTML into well-formed XHTML, which is perfect for XSLT to transform.
The JTidy project is an open-source Java port of HTML Tidy, so I decided to leverage the Oracle database’s support for Java stored procedures to incorporate JTidy into my HTML sanitizing solution.
I used the community edition of IntelliJ to create the HTMLToXHTML class you can find on my github with a public static toXML() method accepting a CLOB and returning a CLOB. I configured my project to use Maven to build a single JAR that contained my HTMLToXHTML class and the JTidy supporting classes. An abbreviated version of the Java source code appears below:
package com.stevemuench.utils;
import org.w3c.tidy.Tidy;
// ⋮
public class HTMLToXHTML {
static String toXHTML(String html) {
Tidy tidy = new Tidy();
tidy.setXHTML(true);
// ⋮ Set Tidy options and setup in/out streams
tidy.parseDOM(in, out);
// ⋮ Extract and return XHTML result
xhtml = out.toString("UTF-8");
// ⋮
return xhtml;
}
public static CLOB toXML(CLOB inputClob) throws Exception {
// Read CLOB into String
Reader reader = inputClob.getCharacterStream();
StringBuilder sb = new StringBuilder();
// ⋮ Convert CLOB HTML to XHTML
String cleanedXHTML = toXHTML(sb.toString());
// ⋮
// Create a new temporary CLOB for result
CLOB outputClob = CLOB.createTemporary(conn,
false,
CLOB.DURATION_SESSION);
outputClob.setString(1, cleanedXHTML);
return outputClob;
}
}
Loading Sanitizer JAR Into the Database
Using the Oracle loadjava utility, I took the htmltoxml-1.0-SNAPSHOT.jar that IntelliJ’s Maven package target produced and loaded the Java classes it contains into the Oracle database:
loadjava -thin -user un/pw@localhost:1521/freepdb1 \
htmltoxml-1.0-SNAPSHOT.jar
Exposing HTML to XML Utility to PL/SQL
Next, I created the EBA_DEMO_SANITIZE package the sample app below installs, including the following private function in the package body that lets PL/SQL invoke the Java HTMLToXHTML.toXML() method:
-- Private helper function in EBA_DEMO_SANITIZE package body
function to_xml(
input_html clob)
return clob
is language java name
'com.stevemuench.utils.HTMLToXHTML.toXML(oracle.sql.CLOB)
return oracle.sql.CLOB';
Combining the Ingredients
The package spec for EBA_DEMO_SANITIZE has just a single public function:
create or replace package eba_demo_sanitize is
function html(
p_html in clob)
return clob;
end eba_demo_sanitize;
Inside the body of the EBA_DEMO_SANITIZE this function is implemented as follows:
-- Public function code in EBA_DEMO_SANITIZE package body
function html(
p_html in clob)
return clob
is
l_xhtml_clob clob;
l_xhtml xmltype;
l_xslt xmltype;
l_output clob;
l_params apex_application.vc_assoc_arr;
begin
-- Simplify XSLT processing by removing
-- the default XHTML namespace that JTidy adds
l_xhtml_clob := replace(to_xml(p_html),
' xmlns="http://www.w3.org/1999/xhtml"');
l_xhtml := xmltype(l_xhtml_clob);
l_xslt := xmltype(get_application_static_file(
'sanitize-html.xsl'),c_utf8);
l_output := transform_xml(l_xhtml,l_xslt,l_params);
return l_output;
end html;
The pseudocode for what it’s doing is:
- Call
to_xml()to get JTidy to turn HTML to sanitize into XHTML - Remove the XML default namespace for XHTML to simplify XSLT stylsheet processing
- Create an
XMLTYPEobject using the JTidy-produced, default namespace-stripped XHTML CLOB - Create an
XMLTYPEobject using thesanitize-html.xslXSLT stylesheet in Static App Files - Transform the XHTML using the stylesheet to produce the sanitized HTML
- Return the sanitized result
Studying the XSLT Stylesheet
The sanitize-html.xsl stylesheet looks like this in Static App Files. It includes templates that match only a handful of allowed HTML tags – leaving everything else out, for example the <script> tag among many others. It includes a template that matches any attribute, and a higher-priority template that matches attributes whose name starts with on. This latter, more specific template prunes out any of these attributes that can contain JavaScript. This stylesheet shows one approach to selectively copy elements and suppress attributes. In other words, it shows the basic patterns. You would of course need to adjust this proof of concept to obtain a more robust sanitization.
The elements I chose to allow in my proof of concept were:
html,body,ul,ol,li,p,em,strong,b,i,img,a
The attributes I chose to suppress were:
- All attributes whose names start with
on(e.g.onclick,onblur, etc.)
NOTA BENE: This is only an example and does not represent a tested solution for sanitizing all forms of HTML JavaScript vulnerabilities! You’ll need to do more research and extend this proof of concept further to achieve that.
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="@*">
<xsl:copy/>
</xsl:template>
<xsl:template match="text()">
<xsl:copy/>
</xsl:template>
<!-- Only allow these HTML elements -->
<xsl:template match="html|body|ul|ol|li|p
|em|strong|b|i|img|a">
<xsl:copy>
<xsl:apply-templates select="@*|text()|body|ul|ol|li|p
|em|strong|b|i|img|a"/>
</xsl:copy>
</xsl:template>
<!-- Suppress attributes whose names start with "on" -->
<xsl:template match="@*[starts-with(name(),'on')]"
priority="10"/>
</xsl:stylesheet>
Example of Resulting Sanitization
The following shows the sample app illustrating an example of the server-side HTML sanitization this combination achieves.

You can download the sample app from here, and find the IntelliJ project including the example code for the Java class in this htmltoxml repo on my GitHub.
