5.4. Filtering Control Characters
After we've ensured that our incoming data is valid, we need to work on ensuring it's "good." An example is ASCII character 0x0E, the vertical tab. There is almost no situation in which you'd want to allow a vertical tab character as a username. Typically, you never want to accept ASCII characters below 0x20 (space).
Of course, things are never as simple as they appear. With Unicode, you don't just want to disallow vertical tabulation (U+000E), but also the invisible function application character (U+2061). Luckily, we can use the Unicode character classes we listed earlier to determine what we want to filter out. The category Cc ("Other, Control") contains all of the control characters we don't want. We might also want to filter out formatting characters, surrogates, private use, and noncharacters in order to be thorougheverything in the Cx categories.
If you're using PHP 4.4 or greater, then you can use the regular expression replace function in UTF-8 mode by specifying the "u" pattern modifier. You can then use the character class matcher in your expressions (the UTF-8 mode was available in earlier PHP versions, but character class matching was not). To remove all control characters, we simply need to call preg_replace( ) once:
$data = preg_replace('!\p{C}!u', '', $data);
There is an exception to this rule, however. For some data fields, you may want to accept carriage returns (and possibly tabs). If you've ever transferred files between a Windows PC and a Unix box, you know what a pain carriage returns can be. The problem is that nobody could agree on how to mark the end of a line. Windows/DOS uses the double character sequence \r\n (0x0d 0x0a), Unix uses \n (0x0a), and Mac OS classic uses \r (0x0d). To correctly filter carriage returns, you need to first normalize them, and then exclude them from your filtering. Normalizing is easy using a small regular expression:
$input = preg_replace('!\r\n?!', '\n', $input);
This code converts all three carriage return styles into the Unix style. This style is useful for a couple of reasons: it displays properly on all three platforms and uses ever so slightly less storage space than the Windows/DOS version.
Carriage returns are only useful in stored data in a limited set of circumstances. When you allow a user to enter multiparagraph text, carriage returns are useful, but when you're letting a user choose a username, they can be dangerous.
There's also an issue of easy spoofing and impersonation here. When you output a string as content in a block of HTML, the username foo bar appears to be exactly the same as foo\nbar. But a more important problem occurs when outputting XML containing the data. Consider this XML snippet:
<user username="foo bar" />
This is fine until our fictional hacker comes along and inserts a carriage return into his name. The XML snippet is then:
<user username="foo
bar" />
While this is technically well-formed XML, some XML parsers will die when they encounter an attribute containing a carriage return. Although the XML recommendation suggests that such programs are not actually XML parsers, we have to live with the tools we have. This is obviously undesirable. It brings us back to our initial data integrity policy: if we filter at the edges, we can be sure of the content we store. If we're sure that the content we store contains what we want, then we don't have to filter it whenever we output it.
|