Prevent XSS through HTML sanitization with HTML purifier

Prevent XSS through HTML sanitization with HTML purifier

Sanitizing of HTML content from a user is one of the most important parts to secure an application against XSS attacks. In PHP you can use a simple strip_tags() call with a whitelist. But there are more sophisticated solutions like the following.

With the HTMLPurifier library you can not just define which tags you want to allow but also which attributes within the tags.

<?php

declare(strict_types=1);

namespace App\Service\Sanitization;

use HTMLPurifier;
use HTMLPurifier_Config;

final class SanitizationService
{
    /** @var HTMLPurifier */
    private $htmlPurifier;

    public function __construct()
    {
        $config = HTMLPurifier_Config::createDefault();
        
        // Remove this after you configured your final set
        $config->set('Cache.DefinitionImpl', null);
        
        $config->set('Core.Encoding', 'UTF-8');

        $allowedElements = [
            'p[style]',
            'br',
            'b',
            'strong',
            'i',
            'em',
            's',
            'u',
            'ul',
            'ol',
            'li',
            'span[class|data-custom-id|contenteditable]',
            'table[border|cellpadding|cellspacing]',
            'tbody',
            'tr',
            'td[valign]',
        ];

        $config->set('HTML.Allowed', implode(',', $allowedElements));

        $def = $config->getHTMLDefinition(true);
        $def->addAttribute('span', 'data-custom-id', 'Text');
        $def->addAttribute('span', 'contenteditable', 'Text');

        $this->htmlPurifier = new HTMLPurifier($config);
    }

    public function sanitizeHtml(string $content): string
    {
        return $this->htmlPurifier->purify($content);
    }
}

The downside is that you have to define them all and also have to setup custom attributes if your HTML content contains those. This should be an edge case but if you've got a more complex frontend editor you might run into this issue. The default set from the library also does not contain all "default attributes" like contenteditable, so you also have to add those.