Google, the Almighty

As I finish up my hack (largely based on searching and natural language processing), I wanted to step back and praise Google for their amazing search. Much of what they do is taken for granted, but over the course of the last two weeks I’ve really dived into the inner workings of Google’s search. I have renewed appreciation. Consider the fact that there are approximately 25 billion web pages contained in the Internet. Then consider that Google is able to not only come up with the most relevant articles but also do it in less than half a second. 25 billion web pages! Even naively, by taking the search query and just looking for matches, that’s a whole a data to go through.

But let’s take it a little further. Before you have even typed your second letter Google is already guessing what you are searching with, to the best of my guesses, some sort of Markov chaining. Alright so what? They are smart dudes after all and this should be expected. Now take a look under one of the resultant links. A nice little description of the relevant sentences pertaining to your query. As I looked to add this to my website, I began to wonder how they did this. First, they must store all that text from the articles somewhere (which was a deal breaker for me since I’m using one server that’s sitting under my bed). Then they have to sift through the text to find what’s relevant to the query using some natural language processing. It’s pretty neat when you think about it. Further, when I was building my search function I ran into the issue of stemmed words. If you stem a query (which is usually a good idea) you must also stem the text you’re searching in. So Google probably has 2 copies of every article, stemmed and unstemmed. Then what? Do you search the stemmed query in the stemmed text and then somehow relate that to the unstemmed text that needs to be displayed as a description? By this point, my mind is blown. But on top of all that, Google has to handle 400 million queries a day. That’s a lot of queries.

I could go on forever talking about all the stuff that needs to be done in that half of a second (advanced searches, etc), but I thought this was a good list to start with. Just to recap, here’s what goes on in just a day of searches:

for (int i = 0; i < 400 million; i++) {

  1. Process query and predict what the user is going to query for
  2. Handle funky searches and advanced searches
  3. Find relevant articles out of the 25 billions pages indexed
  4. Search articles for relevant description
  5. Output unstemmed description
  6. Finish in under half a second
I hope that next time you type a search in on Google you take a second to think about the crazy technology that must be behind it. For that matter take a second to think about any site you go to, consider how painstakingly long it took the developer just to churn out one simple feature. How many bugs and how many times he or she had to hit the refresh button on the browser in order to get that site working like a well oiled machine. Thank you to all those brilliant developers out there creating amazing things everyday.

Comments are closed.