|   How Google WorksNow that you have a better idea of what Google is and what it does, let's take a look at how it does what it doesin particular, how a Google search works. There's a lot of sophisticated technology behind even the most simple search. How a Typical Search WorksThe typical Google search takes less than half a second to complete. That's because all the searching takes place on Google's own web servers. That's right; you may think you're searching the Web, but in effect you're searching a huge index of websites stored on Google's servers. That index was created previously, over a period of time; because you're only searching a server, not the entire Web, your searches can be completed in the blink of an eye. Note Google's servers are actually midpriced personal computers, just like the kind you have on your desktop. Google uses approximately 10,000 of these PCs, all of which run the Linux operating system. Google uses three types of servers: web servers (which host Google's public website), index servers (which hold the searchable index to the bigger document database), and document servers (which house copies of all the individual web pages in Google's database).
 So what happens when you enter a query into the Google search box? It's a process that looks something like this: When you click the Google Search button, your query is transmitted over the Internet to Google's web server.Google's web server sends your query to the company's array of index servers. These computers hold a searchable index to Google's database of web pages.Your query is matched to listings in the Google indexthat is, the index servers determine which actual web pages contain words that match your query.Google now passes your query to the document servers, which store all the assembled web listings (documents) in the Google database.The document servers assemble the results page for your query by pasting together snippets of the appropriate stored documents.The document servers send the assembled results page back to the main web server.Google's web server sends the results page across the Internet to your web browser, where you view it.
Of course, you're unaware of all this behind-the-scenes activity. You simply type your query into the search box on Google's main web page, click the Google Search button, and then view the search results page when it appears. All the shuffling of data from server to server is invisible to you. Note Google's document servers store the full text of each web page in the Google database. Snippets of each page are extracted to creating the page listings on Google's search results pages. In addition, these stored documents provide the cached pages that are linked to from the search results page.
 How Google Builds Its DatabaseAnd Assembles Its IndexAt the heart of Google's search system is the database of web pages stored on Google's document servers. These servers hold literally billions of individual web pagesnot the entire Web, but a good portion of it. How does Google determine which web pages to index and store on its servers? It's a complex process with several components. First and foremost, most of the pages in the Google database are found by Google's special spider software. This is software that automatically crawls the Web, looking for new and updated web pages. Google's crawler, known as GoogleBot, not only searches for new web pages (by exploring links to other pages on the pages it already knows about), it also re-crawls pages already in the database, checking for changes and updates. A complete re-crawling of the web pages in the Google database takes place every few weeks, so no individual page is more than a few weeks out of date. The GoogleBot crawler reads each page it encounters, much like a web browser does. It follows every link on every page until all the links have been followed. This is how new pages are added to the Google database, by following those links GoogleBot hasn't seen before. Note GoogleBot is smart about how it updates the Google database. Web pages that are known to be frequently updated are crawled more frequently than other pages. For example, pages on a news site might be crawled hourly.
 The pages discovered by GoogleBot are copied verbatim onto Google's document serversand copied over each time they're updated. These web pages are used to compile the page summaries that appear on search results pages; they can also be viewed in their entirety when you click the Cached link in the search results. (These cached pages are a good way to view older versions of pages that have recently changed or been deleted.) | As big as Google's database is, there are still lots of web pages that don't make it into the database. In particular, Google doesn't do a good job of searching the "deep web," those web pages generated on the fly from big database-driven websites. Google also doesn't always find pages served by the big news sites, pages housed on web forums and discussion groups, blog pages, and the like. These are all web pages with "dynamic" content that change frequently and don't always have a fixed URL; the URLand the page itselfis generated on the fly, typically as a result of a search within the site itself. This lack of a permanent URL makes these pages difficult, if not impossible, for GoogleBot to find. That's because GoogleBot, unlike a human being, can't enter a query into a site's search box and click the Search button. It has to take those pages that it finds, typically the site's fixed home page. The dynamically generated pages slip through the cracks, so to speak. This is why it's possible to search for a page that you know exists (you've seen it yourself!) and not find it listed in Google's search results. It's not a trivial problem; more and more of the Web is moving to dynamically generated content, leaving at least half the Internet beyond the capability of Google's crawler. Google has technicians working on this challenge, but it's a big enough challenge that you shouldn't expect big improvements anytime soon. | 
 
 In order to search the Google database, Google creates an index to all the stored web pages. This search engine index is much like the index found in the back of this book; it contains a list of all the important words used on every stored web page in the database. Once the index has been compiled, it's easy enough to search for a particular word, and have returned a list of all the web pages on which that word appears. And that's exactly how the Google index and database work to serve your search queries. You enter one or more words in your query, Google searches its index for those words, and then those web pages that contain those words are returned as search results. Fairly simple in concept, but much more complex in executionespecially since Google is indexing all the words on several billion web pages. How Google Ranks Its ResultsSearching the Google index for all occurrences of a given word isn't all that difficult, especially with the computing power of 10,000 PCs driving things. What is difficult is returning the results in a format that is usable by and relevant to the person doing the searching. You can't just list the matching web pages in random order, nor is alphabetical or chronological order all that useful. No, Google has to return its search results with the most important or relevant pages listed first; it has to rank the results for its users. How does Google determine which web pages are the best match to a given query? I wish I could give you all the details behind the scheme, but Google keeps this core methodology under lock and key; this methodology is what makes Google the most effective search engine on the Web today. Even with all this secrecy, Google does provide some hints as to how its ranking system works. There are three components to the ranking: Text analysis. Google looks not only for matching words on a web page, but also for how those words are used. That means examining font size, usage, proximity, and more than a hundred other factors to help determine relevance. Google also analyzes the content of neighboring pages on the same website to ensure that the selected page is the best match.Links and link text. Google then looks at the links (and the text for those links) on the web page, making sure that they link to pages that are relevant to the searcher's query.PageRank. Finally, Google relies on its own proprietary PageRank technology to give an objective measurement of web page importance and popularity. PageRank determines a page's importance by counting the number of other pages that link to that page. The more pages that link to a page, the higher that page's PageRankand the higher it will appear in the search results. The PageRank is a numerical ranking from 0 to 10, expressed as PR0, PR1, PR2, and so forththe higher the better.
 Although the other factors are important, PageRank is the secret sauce behind Google's page rankings. The theory is that the more popular a page is, the higher that page's ultimate value. While this sounds a little like a popularity contest (and it is), it's surprising how often this approach delivers high-quality results. The actual formula used by PageRank (called the PageRank Algorithm) is super-duper top-secret classified, but by all accounts it's calculated using a combination of quantity and quality of the links pointing to a particular web page. In essence, the PageRank Algorithm considers the importance of each page that initiates a link, figuring (rightly so) that some pages have greater value than others. The higher the PageRank of the pages pointing to a given page, the higher the PageRank will be of the linked-to page. It's entirely possible that a page with fewer, higher-ranked pages linking to it will have a higher PageRank than a similar page with more (but lower-ranked) pages linking to it. The PageRank factor on the linking page is also affected by the number of total outbound links on that page. That is, a page with a lot of outbound links will contribute a lower PageRank to each of its linked-to pages than will a page with just a few outbound links. As an example, a page with PageRank of PR8 that has 100 outbound links will boost a linked-to page's PageRank less than a similar PR8 page with just 10 outbound links. It's important to note that Google's determination of a page's rank is completely automated. There is no human subjectivity involved, and no person or company can pay to increase the ranking of their listings. It's all about the math. Note PageRank is page specific, not site specific. This means that the PageRank of the individual pages on a website can (and probably will) vary from page to page.
  |