< Day Day Up > |
Hack 81 Strip Formatting from Messages
Parsing the content of a message is difficult if it contains strange formatting characters. Make it easier for IRC bots to parse these messages by removing those characters. When you are writing a bot that needs to parse the input from other users, it is all too easy to write code that can break when users apply formatting to their messages. In some cases, this formatting is automatically added by a user's IRC client—possibly without the user being aware of it. 13.5.1 TimeBot ScenarioLet's take a look at a simple scenario. Charlotte writes a simple Perl bot called TimeBot. Charlotte programs the bot to say what the time is whenever somebody says "time." This soon starts to annoy the users in the channel, because they sometimes find themselves saying "time" in response to messages from other users. Although each utterance of "time" is not necessarily a request to know what the time is, TimeBot nonetheless assumes the user is asking for the time and responds. Charlotte decides to fix this problem by requiring the bot to be addressed directly. All this means is that the line must start with the bot's name. This prevents people accidentally interacting with TimeBot. There are many different styles of direct addressing: <Charlotte> TimeBot time <Charlotte> TimeBot, time <Charlotte> TimeBot: time <Charlotte> timebot time? ...etc. TimeBot should respond to all of the preceding styles of addressing, along with all permutations. Charlotte whips up a simple Perl regular expression to deal with these new styles: $input = ... if ($input =~ /^timebot[:,]?\s+time\??$/i) { tell_time( ); } Charlotte tests this new regular expression in her bot and finds that it works. A few minutes later, she notices that several users are being ignored by the bot. A little investigation reveals that these users are running IRC clients that automatically add formatting characters to the end of autocompleted nicknames. Some IRC clients, and even some add-on scripts, are designed to do this sort of thing on purpose—presumably to try to make the message stand out better. <Charlotte> TimeBot: time <TimeBot> The time is 13:40 <Paul> TimeBot: time <Paul> hello? In Charlotte's case, she was observing the "bold colon" effect. The IRC client was adding a colon character to the end of the autocompleted nickname. This should not pose a problem, as the regular expression can cope with this. However, the colon was prefixed with a bold control character and followed by a "normal" control character to remove the boldness from the rest of the message. These extra characters are not accounted for in the regular expression, so the test in the if statement will never return true. As a temporary measure, Charlotte fixes the problem by adding the extra characters to the regular expression: $input = ... if ($input =~ /^timebot(\x02:\x0f|[:,])?\s+time\??$/i) { tell_time( ); } This is clearly not the best way to fix the problem. What might happen if you encounter another type of IRC client that applies formatting somewhere else—for instance, making the entire nickname bold? The best solution is to remove all formatting and then parse the message as before. 13.5.2 Removing All FormattingRemoving formatting characters is remarkably easy if your programming language is suited to the task. Each style of formatting requires only one control character, so you can simply hunt them down and remove them from the message. The message will then contain exactly the same text as before, but without any formatting, so it can be easily parsed for commands. 13.5.2.1 Perl solutionA simple regular expression replacement is all that is needed to remove the control characters used for formatting. You can then go ahead and use the string as normal without having to worry about taking special account of any formatting. This single line of Perl will remove all formatting characters from a string: $input =~ s/[\x02\x1f\x16\x0f]//g; 13.5.2.2 Python solutionImporting the regular expression module allows you to do a similar replacement in Python: import re re.compile("[\x02\x1f\x16\x0f]").sub("", input) 13.5.2.3 Java solutionIn Java 1.4 and later, the static method replaceAll in the String class can be used to do the same thing. This method accepts two String arguments: a regular expression to match and the replacement: input = input.replaceAll("[\u0002\u001f\u0016\u000f]", ""); 13.5.2.4 Java Applet solutionRemoving formatting characters in a Java Applet [Hack #90] is a little trickier. The String class allows us to use the same approach as the Perl solution, but there are problems using this from Applets. The replaceAll method makes use of the java.util.regex package, and neither of these exists in any version of Java prior to 1.4. The obvious problem here is that Java Applets run in web browsers that commonly have only a 1.1-compatible Virtual Machine installed. If you want to remove formatting characters from within a Java Applet, you would therefore be wise to make use only of classes present in the 1.1 releases. Here is an efficient method that removes all formatting characters within a Java Applet: public static String removeFormatting(String message) { int length = message.length( ); StringBuffer buffer = new StringBuffer( ); for (int i = 0; i < length; i++) { char ch = message.charAt(i); if (ch == '\u000f' || ch == '\u0002' || ch == '\u001f' || ch == '\u0016') { // Don't add this character. } else { buffer.append(ch); } } return buffer.toString( ); } |
< Day Day Up > |