Archives for 2006

Google Doesn’t Do Synonyms

Rarely do we want to rank for only one search phrase. Generally there are many different words and phrases that humans might use to get to your pages, different phrases that I refer to here very roughly as "synonyms".

But the notion of synonomy is a "semantic" concept — one that involves meaning — rather than a "syntactic" concept — one that relates only to language. Search engines don’t do semantics, only syntax.

For example, I know that a force transducer is the same thing as a load cell, but these semantically equivalent terms are syntactically disjoint, so ranking for one will not get you ranking for the other in today’s search engines.

This is especially problematic where your company name, which often gets used as link text, does not contain your preferred search. Here’s a concrete example I found to illustrate this point, but you can find many of your own.

A company named Transducer Techniques sells load cells. Totally obvious to me, a total mystery to Google. Many of the links to this firm are from industry directories that use the company name, so they are top ranked for the word "transducer". But according to WordTracker, the real traffic is for "load cell", where this company ranks #4.

Now compare Transducer Techniques to Load Cell Central, the top ranked page for load cell, which is not even in the first 100 results for transducer.

This is entirely a result link text. If links like Transducer Techniques were replaced with links like The Load Cell Experts the rankings would like be very different owing to the difference in PageRanks between these two sites.

A Brief Outline of the Google Architecture

The very front end of Google is the spider function which creates a queue of pages visit with content for each. This feeds the indexer which matches these pages to existing pages in the index and either creates a new index entry or updates an existing one with the new content. Originally at least, and probably still true I think, the indexing stage is also where link text is propagated from linking pages to destination pages where it is stored as an augmentation to the target page. This is one of — IMHO — the great ideas inside Google as it essentially reduces a key off-page analysis to one that can instead be conducted entirely on-page.

As an entirely separate process that uses the existing index as input, the PageRank algorithm is run to create a "side file" of PR values indexed by the same unique identifier used to access pages. The architecture makes no demands as to when PR is updated — it can be on any schedule they like.

What normal humans, not to be confused with SEOs :-), call "the search engine" is really the "query engine". This takes the index of content and the PageRank values and performs a computation to rank results as each user query is received and processed. Clearly, there can be no pre-processing of queries — there is just raw data to feed into this engine.

So what happens to a brand new page?

A new page doesn’t actually "exist" as far as Google is concerned until it is spidered nd indexed. Once it is indexed, meaning you can find it at Google using a site: query on your domain, it is my best understanding that all Link Reputation effects are baked into the index and fully affect search queries. Since spidering and indexing is the fastest part of Goolge, this explains the similarly fast positioning changes that can be accomplished through link text alone.

PageRank is another matter and can take any amount of time. There is no requirement in the software architecture that requires any particular schedule. In fact, my research suggests that the way PR is computed and the schedule used to compute it are drastically different today than 3 years ago.

While awaiting PageRank values for a new page, it can be very difficult for that page to rank to any significant searches. Likewise, making significant linking topoloy changes in an existing website, using my Dynamic Linking approach for example, will generally take a quite a while to have a ranking impact because of how long it takes for PR values to be recomputed across your site.