HTML sanitization
HTML sanitization is the process of examining an HTML document and producing a new HTML document that preserves only whatever tags are designated "safe". HTML sanitization can be used to protect against cross-site scripting (XSS) attacks by sanitizing any HTML code submitted by a user.
Basic tags for changing fonts are often allowed, such as <b>
, <i>
, <u>
, <em>
, and <strong>
while more advanced tags such as <script>
, <object>
, <embed>
, and <link>
are removed by the sanitization process.
Sanitization is typically performed by using either a whitelist or a blacklist approach. An item left off a whitelist, makes the sanitization produce HTML code that lacks safe elements. If an item is left off a blacklist, a vulnerability will be present in the sanitized HTML output. New unsafe HTML features, introduced after a blacklist has been defined, causes the blacklist to become out of date.
In PHP, HTML sanitization can be performed using the strip_tags()
function at the risk of removing all textual content following an unclosed less-than symbol or angle bracket.[1] The HTML Purifier library is another popular option for PHP applications.[2]
In Java (and .NET), sanitization can be achieved by using the OWASP Java HTML Sanitizer Project.[3]
In .NET, a number of sanitizers use the Html Agility Pack, a HTML parser.[4][5]
See also
- Data sanitization