Googlebot not following sitemap URLs faithfully

Here’s a little background first.

We have implemented a URL validation step when we process a response
to make sure that when people call a page they use the correct URL.
If they use an incorrect URL, then they are sent a 301 redirect with
the correct URL.

The URL in our sitemap is in the format:
http://www.domain.com/index.html?whatever=value

We’ve now had errors showing up in Webmaster Tools, with it saying that Googlebot is coming across too many redirects in our sitemap URLs.  The problem with Googlebot is that even though we put the correct URL in the sitemap, it doesn’t use that URL to make the request – it omits the index.html bit, contracting it down to:
http://www.domain.com/?whatever=value

So our server sees this ‘incorrect’ URL, issues a 301 with the
‘correct’ URL (that has the index.html bit in it), but then Googlebot
doesn’t follow that URL faithfully and again tries to request the URL
without index.html in the path.  So our server again issues a 301
redirect, with the correct URL and here we go off on our infinite
loop.

So no wonder we get the error message:
URLs not followed….

contained too many redirects.

I think this is a bug as the 301 redirect clearly sends the redirect
URL, if Googlebot followed this redirect URL faithfully then we
wouldn’t see this issue.

Here is the sitemap error in more detail (substituted our actual domain for a pretend one).

HTTP Error:
Found: 301 (Moved permanently)

http://www.domain.com/?param=whatever1
http://www.domain.com/?param=whatever2
http://www.domain.com/?param=whatever3
http://www.domain.com/?param=whatever4
http://www.domain.com/?param=whatever5
Jul 20, 2008

Double checking the sitemap file, these URLs are in the right format complete with index.html.

Why does Googlebot strip out index.html?

Advertisements

4 Responses to Googlebot not following sitemap URLs faithfully

  1. Seth says:

    Yo Ed. . . .I have a client who is having a similar problem. But they do not have a parameter string involved. I Google about this, but have not heard back. Here is what the issue I am dealing with looks like: http://www.sethnickerson.com/trouble-with-urls-in-google-sitemap/

  2. Ed says:

    Thanks Seth – it’s pretty poor isn’t it – the Googler John Mueller says that it is ‘usually fine’ to omit index.html from the request… ‘usually fine’. Hmm – if that were a general guideline that Google uses when implementing anything, then they would be stuffed.

  3. sandrar says:

    Hi! I was surfing and found your blog post… nice! I love your blog. 🙂 Cheers! Sandra. R.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: