Releasing a web project for the first time brings up something new every day. The lesson I learned yesterday was to never rely on Apache’s Multiview feature, since it doesn’t work with the most important browser on earth: Googlebot.
As you may know, Google does deep indexing of new content onl once a month, which happened to be yesterday. But imagine my horror, when I discovered each of my content pages delivering an 406 error to Googlebot. Oh me, oh my! What has happened here?
It turned out that this is a well-known problem of Apache’s multiview feature. Multiviews allow substitutions of file extensions, so you can call an URL like www.somehost.org/mypage.php using www.somehost.org/mypage. Apache will figure out there is only on file (mypage.php) matching the request and will serve this instead returning an error 404 - not found.
So far, so good. And handy, indeed. But…
While doing this, Apache will check if the MIME-type of the file used for substitution matches the MIME types accepted by the client that does the request. This ain’t a problem for most browsers, since they usually accept any MIME type. However, Google decided to accept only (or mainly) text/html.
Well, mypage.php creates text/html, doesn’t it? So where’s the problem?
The problem is, that while it may create text/html, mypage.php itself is of MIME type application/x-httpd-php. This type is not accepted by Googlebot, hence the error 406 - Not accepted.
Unfortunately, there doesn’t seem to be a general solution for Apache 1.3.x (while there is one for Apache 2). And since I wasn’t feeling lucky enough to figure out by myself, I decided to rely on Apache’s rewriting capabilities. Which, to be honest, I recently tried to avoid, since regular expression usually give me headaches.
My first hotfix was to simple hardcode every page:
This simply will rewrite /mypage/some/parameters to /mypage.php/some/parameters (the first rule) or /mypage to /mypage.php (the second one). Note that parameters like ?arg1=some&arg2=parameter are handled by Apache automatically. So something like /mypage?arg1=some&arg2=parameter will be handled by the second rule.
This hotfix made Googlebot happy, and therefore me, too. However, I came up with a more general solution that covers all cases:
First, all the rules should not apply to real directories, like for example images or downloads. Thats why I excluded them using a rewrite condition. This is far from good, and I hope to find a general (and performant) solution for this, too. But since regular expressions give me headaches, I won’t do too much at a time of them.
The first rule now checks for a request that doesn’t contain a dot before the first slash, which means there is no file extension provided. It must have at least one char before the slash, though, else there will be a problem with the root directory “/”. The rule will simply append “.php” and exit.
The second rule does nearly the same for requests like “/mypage”, where there is no trailing slash.
I’m quite sure both cases can be put into one single rule, but as you may know, regular expressions give me headaches. So this is left up for another time.