Multisite Robots and Sitemap

One of the issues of moving to a Multisite WordPress setup was managing the robots.txt and sitemap.xml files. robots.txt is a text file that lives at the root of a website (so, in my case, at http://paulsaunders.org.uk/robots.txt); search engines are supposed to read that as the first stage of crawling a site and the file tells them what they can and can’t index. The sitemap.xml sits along side the robots.txt file and assists search engines by telling them what pages exist and information about them.

Now, in a typical webserver, you’d write these files, save them in /var/www/ or /srv/http/ or whatever (basically, alongside your index.html) and everything would be fine. But with a Multisite WordPress, you’re serving several hosts from the same filesystem. The fact that the content is generated on-the-fly (by the WordPress PHP engine) allows to serve different content depending on the hostname asked for.

So there are plugins for WordPress that allow us to serve per-host robots.txt and sitemap.xml files and manage them through the WordPress admin interface. I have chosen to use Multisite Robots.txt Manager and Google (XML) Sitemaps Generator for WordPress. These are nice and easy to use: Install them as any other plugin. MS Robots.txt Manager needs to be Network Activated, the Sitemap Generator can be activated on a per-site basis.

The one issue I had with them was actually serving the generated files. As I’ve mentioned before, I use nginx as my webserver which /doesn’t/ support the htaccess file which apache uses for redirects, access control and so on. So at the moment, there’s no connection between a request for, say, robots.txt and the plugin.

The solution was to add the following lines to my nginx configuration.

# Auto-generated Metas
rewrite ^/robots\.txt$ /index.php?robots=1 last;
rewrite ^/sitemap\.xml$ /index.php?xml_sitemap=index last;

These lines need to be added into the server{} block for the wordpress site. I added them to the global/wordpress-mu.conf file that I include.

A reload of the nginx service and, hey presto, Google was able to start crawling my site.

Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *