You are hereHome / General / A web 2.0 search engine that doesn't understand HTML

A web 2.0 search engine that doesn't understand HTML


By Gerd Riesselmann - Posted on 01 November 2005

Within the last days I checked my log files more frequently then I did before to see if any files got lost when switching to drupal. And I recently got quite a lot of 404's, mostly caused by spam bots trying to pingback, trackback or post a comment. But some looked serious, until I figured out they where caused by a search spider unable to cope with the HTML base-tag.

Drupals adds a base statement to every page, effectivly pointing to the site's root directory. Having a base means that all relative links should be considered relative to the base URL, not the document URL. This is very handy, if you do not know in advance what the URL of a piece of text will be. This text, for example, may appear on the front page, on some archive's and categories' page and beneath its own unique URL. If there was no base, I had to code all relative links as absolute, that is prepending them with "http://www.gerd-riesselmann.net/". This is both annoying and error-prone.

The base tag has been in HTML if not forever so at least since HTML 4.01.

In steps OmniExplorer, a search spider that already got some attention this summer for being too hungry and ignoring robots.txt. It tries to index pages like this one:

/archives/2004/11/categories/blogging

It obviously has found the link to "/categories/blogging" on the page /archives/2004/11/, took the later as the base for relative links and appended the first. In other words, it is completly ignoring the base tag.

I wonder, however, how OmniExplorer managed to build this URL:

/archives/2004/11/archives/2004/11/categories/blogging

There's nowhere a link to the archives of november 2004 on the november 2004 archive's page...

The next things that happen if the base directive is ignored, are some kind of recursive URLs, like this one:

/node/node/node/node

And on and on for ever...

It seems, OmniExplorer has gathered around 5 Million Dollars by venture capitalists who think, this will be "a Web 2.0 search deal". Duh!

Tags

Topics