You are hereBeware of Apache's Multiviews

Beware of Apache's Multiviews


By Gerd Riesselmann - Posted on 15 April 2005

Releasing a web project for the first time brings up something new every day. The lesson I learned yesterday was to never rely on Apache's Multiview feature, since it doesn't work with the most important browser on earth: Googlebot.

As you may know, Google does deep indexing of new content onyl once a month, which happened to be yesterday. But imagine my horror, when I discovered each of my content pages delivering an 406 error to Googlebot. Oh me, oh my! What has happened here?

It turned out that this is a well-known problem of Apache's multiview feature. Multiviews allow substitutions of file extensions, so you can call an URL like www.somehost.org/mypage.php using www.somehost.org/mypage. Apache will figure out there is only on file (mypage.php) matching the request and will serve this instead returning an error 404 - not found.

So far, so good. And handy, indeed. But...

While doing this, Apache will check if the MIME-type of the file used for substitution matches the MIME types accepted by the client that does the request. This ain't a problem for most browsers, since they usually accept any MIME type. However, Google decided to accept only (or mainly) text/html.

Well, mypage.php creates text/html, doesn't it? So where's the problem?

The problem is, that while it may create text/html, mypage.php itself is of MIME type application/x-httpd-php. This type is not accepted by Googlebot, hence the error 406 - Not accepted.

Dooh!

Unfortunately, there doesn't seem to be a general solution for Apache 1.3.x (while there is one for Apache 2). And since I wasn't feeling lucky enough to figure out by myself, I decided to rely on Apache's rewriting capabilities. Which, to be honest, I recently tried to avoid, since regular expression usually give me headaches.

My first hotfix was to simple hardcode every page:

Options -Multiviews
RewriteEngine On
RewriteBase /
RewriteRule ^mypage/(.*) /mypage.php/$1
RewriteRule ^mypage$ /mypage.php

This simply will rewrite /mypage/some/parameters to /mypage.php/some/parameters (the first rule) or /mypage to /mypage.php (the second one). Note that parameters like ?arg1=some&arg2=parameter are handled by Apache automatically. So something like /mypage?arg1=some&arg2=parameter will be handled by the second rule.

This hotfix made Googlebot happy, and therefore me, too. However, I came up with a more general solution that covers all cases:

Options -Multiviews
RewriteEngine On
RewriteBase /
 
# If there is no . before / rewrite to $1.php/$2
RewriteCond $1 !images|downloads
RewriteRule ^([^\./]+)/(.*) /$1.php/$2 [L]
 
# Find single files like mypage
RewriteCond $1 !images|downloads
RewriteRule ^([^\.]+)$ /$1.php [L]

First, all the rules should not apply to real directories, like for example images or downloads. Thats why I excluded them using a rewrite condition. This is far from good, and I hope to find a general (and performant) solution for this, too. But since regular expressions give me headaches, I won't do too much at a time of them.

The first rule now checks for a request that doesn't contain a dot before the first slash, which means there is no file extension provided. It must have at least one char before the slash, though, else there will be a problem with the root directory "/". The rule will simply append ".php" and exit.

The second rule does nearly the same for requests like "/mypage", where there is no trailing slash.

I'm quite sure both cases can be put into one single rule, but as you may know, regular expressions give me headaches. So this is left up for another time.

Hi Gerd,

my own experience about 1 or 2 years earlier was mixed. I was also using Multiviews with Apache1.3 and we did have 406s by some Googlebots but regular 200s by other Googlebots. My conclusion was, that the bots they have running are slightly different software or different versions. All I know is our content was accessible in the Google-Index, but wether it took longer or had other sideeffects - I didn't investigate/compare back then. But it's probably best to avoid it alltogether.

Good to know, that there is a solution for Apache2. Thanks.

As for a general mod_rewrite solution - textpattern uses the following which I find pretty nice:

RewriteCond %{REQUEST_FILENAME} -f [OR]

RewriteCond %{REQUEST_FILENAME} -d

RewriteRule ^(.+) - [PT,L]

RewriteRule ^(.*) index.php

REQUEST_FILENAME is the full local filesystem path to the file or script matching the request.

-f checks wether it is a file.

-d checks wether it is a directory.

If either is true do nothing (-) and skip the remaining rules (L). The last rule rewrites everything to index.php but can easily be replaces with your rule above:

RewriteRule ^([^\.]+)$ /$1.php [L]

(In case the comment form ate up the code, here is a link: http://svn.textpattern.com/current/.htaccess )

Thanks Sencer, I'll give it a try.

I previously was experimenting with the -d feature like this:

RewriteCond $1 !-d
RewriteRule RewriteRule ^([^\./]+)/(.*) /$1.php/$2 [L]

But for some reason it didn't work. %{REQUEST_FILENAME} instead seems to be the right think to look at.