Relative vs. Absolute links

Rumored to have come from an advanced SEO seminar is this latest SEO myth stemming from a complete lack of understanding of the underlying technology. As the story is told, Absolute Links pass PageRank but Relative Links do not. Horse-hockey!

There are actually three varieties of links. An absolute link is one that includes complete domain name and path information, like http://www.windrosesoftware.com/index.html. A domain relative link is one lacking a domain name, but including absolute path information, such as /site/index.php. The final form is the path relative link, which lacks both the domain name and the leading ‘/’. For example, site/index.php.

A Search Engine Spider, just like your desktop browser, is simply an HTTP/HTML client program. It makes a request via HTTP of a web server and processes the HTML text that is returned as a result. This is the entirety of the interation with the server. All that is left is to process the HTML locally.
To resolve the links in the document, the spider/browser has to take two steps.

First, the "base URL" for the document must be determined. By default, this will be the absolute URL of the document itself. However, the tag can be used in a document to override the base URL used for path relative links found within the document. All browsers and spiders must look for this tag and modify the base URL for the document appropriately before doing any link processing.

Second, with the base URL now in hand, the second step is to "canonicalize" each link. What that twenty dollar word means is "to put into standard form", which in the case of URLs is the same as saying "make all URLs absolute".

  • Absolute URLs obviously don’t change at all, they are already canonical;
  • Domain relative URLs get the domain added; and
  • Path relative URLs get the entire base URL added as a prefix.

So why does it have to be this way? Because spiders deal in "pages", not "sites", there is no way to process non-canonicalized URLs. You can either process absolute URLs or carry around the base URL separately — a relative URL is not meaningful in isolation of the document where it is found. This is so fundamental to the task of parsing HTML, that the only sensible place for the search engines to canonicalize URLs is in the software that does the spidering of pages. Once done, URLs of any variety will be identical.

Moreover, even absolute URLs have problems, owing to what I personally consider a bug in the HTTP specification, so even absolute URLs are not the basis of indexing within search engines. Google uses what the founders called a "docID" to uniquely identify the pages added to the Google index.

Somewhere early in the Google machine, all links are transformed from references via (absolute) URL to references involving the docID. For good technical reasons, the other engines will be similarly organized so that the (original) form of a URL will ceased to be known to the algorithms downstream of the spidering application.

Speak Your Mind

*