A web 2.0 search engine that doesn't understand HTML
Within the last days I checked my log files more frequently then I did before to see if any files got lost when switching to drupal. And I recently got quite a lot of 404's, mostly caused by spam bots trying to pingback, trackback or post a comment. But some looked serious, until I figured out they where caused by a search spider unable to cope with the HTML base-tag.
Drupals adds a base statement to every page, effectivly pointing to the site's root directory. Having a base means that all relative links should be considered relative to the base URL, not the document URL. This is very handy, if you do not know in advance what the URL of a piece of text will be. This text, for example, may appear on the front page, on some archive's and categories' page and beneath its own unique URL. If there was no base, I had to code all relative links as absolute, that is prepending them with "http://www.gerd-riesselmann.net/". This is both annoying and error-prone.
The base tag has been in HTML if not forever so at least since HTML 4.01.
In steps OmniExplorer, a search spider that already got some attention this summer for being too hungry and ignoring robots.txt. It tries to index pages like this one:
/archives/2004/11/categories/bloggingIt obviously has found the link to "/categories/blogging" on the page /archives/2004/11/, took the later as the base for relative links and appended the first. In other words, it is completly ignoring the base tag.
I wonder, however, how OmniExplorer managed to build this URL:
/archives/2004/11/archives/2004/11/categories/bloggingThere's nowhere a link to the archives of november 2004 on the november 2004 archive's page...
The next things that happen if the base directive is ignored, are some kind of recursive URLs, like this one:
/node/node/node/nodeAnd on and on for ever...
It seems, OmniExplorer has gathered around 5 Million Dollars by venture capitalists who think, this will be "a Web 2.0 search deal". Duh!
