Date with Googlebot, Part II: HTTP status codes and If-Modified-Since
Thursday, November 27, 2008
-----------
Dearest Googlebot,
Recently, I did some spring cleaning on my site and deleted a couple of old, orphaned pages. They now return the 404 "Page not found" code. Is this ok, or have I confused you?
Frankie O'Fore
Dear Frankie,
404s are the standard way of telling me that a page no longer exists. I won't be upset—it's normal that old pages are pruned from websites, or updated to fresher content. Most websites will show a handful of 404s in the Crawl Diagnostics over at Webmaster Tools. It's really not a big deal. As long as you have good site architecture with links to all your indexable content, I'll be happy, because it means I can find everything I need.
But don't forget, it's not just me who comes to your website—there may be humans seeing these pages too. If you've only got a very simple '404 page not found' message, visitors who aren't as savvy could be baffled. There are lots of ways to make your 404 page more friendly; a quick one is our 404 widget over at Webmaster Tools, which will help direct people to content which does exist. For more information, you can read the blog post. Most web hosting companies, big and small, will let you customise your 404 page (and other return codes too).
Love and kisses,
Googlebot
404s are the standard way of telling me that a page no longer exists. I won't be upset—it's normal that old pages are pruned from websites, or updated to fresher content. Most websites will show a handful of 404s in the Crawl Diagnostics over at Webmaster Tools. It's really not a big deal. As long as you have good site architecture with links to all your indexable content, I'll be happy, because it means I can find everything I need.
But don't forget, it's not just me who comes to your website—there may be humans seeing these pages too. If you've only got a very simple '404 page not found' message, visitors who aren't as savvy could be baffled. There are lots of ways to make your 404 page more friendly; a quick one is our 404 widget over at Webmaster Tools, which will help direct people to content which does exist. For more information, you can read the blog post. Most web hosting companies, big and small, will let you customise your 404 page (and other return codes too).
Love and kisses,
Googlebot
Hey Googlebot,
I was just reading your reply to Frankie above, and it raised a couple of questions.
What if I have someone linking to a page that no longer exists? How can I make sure my visitors still find what they're after? Also, what if I just move some pages around? I'd like to better organise my site, but I'm worried you'll get confused. How can I help you?
Yours hopefully,
Little Jimmy
Hello Jimmy,
Let's pretend there are no anachronisms in your letter, and get to the meat of the matter. Firstly, let's look at links coming from other sites. Obviously, these can be a great source of traffic, and you don't want visitors presented with an unfriendly 'Page not found' message. So, you can harness the power of the mighty redirect.
There are two types of redirect—301 and 302. Actually, there are lots more, but these are the two we'll concern ourselves with now. Just like 404, 301 and 302 are different types of responses codes you can send to users and search engine crawlers. They're both redirects, but a 301 is permanent and a 302 is temporary. A 301 redirect tells me that whatever this page used to be, now it lives somewhere else. This is perfect for when you're re-organising your site, and also helps with links from offsite. Whenever I see a 301, I'll update all references to that old page with the new one you've told me about. Isn't that easy?
If you don't know where to begin with redirects, let me get you started. It depends on your webserver, but here are some searches that may be helpful:
Apache: http://www.google.com/search?q=301+redirect+apache
IIS: http://www.google.com/search?q=301+redirect+iis
You can also check your manual, or the README files that came with your server.
As an alternative to a redirect, you can email the webmaster of the site linking to you and ask them to update their link. Not sure what sites are linking to you? Don't despair - my human co-workers have made that easy to figure out. In the "Links" portion of Webmaster Tools, you can enter a specific URL on your site to determine who's linking to it.
My human co-workers also just released a tool which shows URLs linking to non-existent pages on your site. You can read more about that here.
Yours informationally,
Googlebot
Let's pretend there are no anachronisms in your letter, and get to the meat of the matter. Firstly, let's look at links coming from other sites. Obviously, these can be a great source of traffic, and you don't want visitors presented with an unfriendly 'Page not found' message. So, you can harness the power of the mighty redirect.
There are two types of redirect—301 and 302. Actually, there are lots more, but these are the two we'll concern ourselves with now. Just like 404, 301 and 302 are different types of responses codes you can send to users and search engine crawlers. They're both redirects, but a 301 is permanent and a 302 is temporary. A 301 redirect tells me that whatever this page used to be, now it lives somewhere else. This is perfect for when you're re-organising your site, and also helps with links from offsite. Whenever I see a 301, I'll update all references to that old page with the new one you've told me about. Isn't that easy?
If you don't know where to begin with redirects, let me get you started. It depends on your webserver, but here are some searches that may be helpful:
Apache: http://www.google.com/search?q=301+redirect+apache
IIS: http://www.google.com/search?q=301+redirect+iis
You can also check your manual, or the README files that came with your server.
As an alternative to a redirect, you can email the webmaster of the site linking to you and ask them to update their link. Not sure what sites are linking to you? Don't despair - my human co-workers have made that easy to figure out. In the "Links" portion of Webmaster Tools, you can enter a specific URL on your site to determine who's linking to it.
My human co-workers also just released a tool which shows URLs linking to non-existent pages on your site. You can read more about that here.
Yours informationally,
Googlebot
Darling Googlebot,
I have a problem—I live in a very dynamic part of the web, and I keep changing my mind about things. When you ask me questions, I never respond the same way twice—my top threads change every hour, and I get new content all the time! You seem like a straightforward guy who wants straightforward answers. How can I tell you when things change without confusing you?
Temp O'Rary
Dear Temp,
I just told little Jimmy that 301's are the best way to tell a Googlebot about your new address, but what you're looking for is a 302.
Once you're indexed, it's the polite way to tell your visitors that your address is still the right one, but that the content can temporarily be found elsewhere. In these situations, a 302 (or the rarer '307 Temporary Redirect') would be better. For example, orkut redirects from http://orkut.com to http://google.com/accounts/login?service=orkut, which isn't a page that humans would find particularly useful when searching for Orkut***.
It's on a different domain, for starters. So, a 302 has been used to tell me that all the content and linking properties of the URL shouldn't be updated to the target - it's just a temporary page.
That's why when you search for orkut, you see orkut.com and not that longer URL.
Remember: simple communication is the key to any relationship.
Your friend,
Googlebot
I just told little Jimmy that 301's are the best way to tell a Googlebot about your new address, but what you're looking for is a 302.
Once you're indexed, it's the polite way to tell your visitors that your address is still the right one, but that the content can temporarily be found elsewhere. In these situations, a 302 (or the rarer '307 Temporary Redirect') would be better. For example, orkut redirects from http://orkut.com to http://google.com/accounts/login?service=orkut, which isn't a page that humans would find particularly useful when searching for Orkut***.
It's on a different domain, for starters. So, a 302 has been used to tell me that all the content and linking properties of the URL shouldn't be updated to the target - it's just a temporary page.
That's why when you search for orkut, you see orkut.com and not that longer URL.
Remember: simple communication is the key to any relationship.
Your friend,
Googlebot
*** Please note, I simplified the URL to make it easier to read. It's actually much more complex than that.
Captain Googlebot,
I am the kind of site who likes to reinvent herself. I noticed that the links to me on my friends' sites are all to URLs I got rid of several redesigns ago! I had set up 301s to my new URLs for those pages, but after that I 301'ed the newer URLs to my next version. Now I'm afraid that if you follow their directions when you come to crawl, you'll end up following a string of 301s so long that by the end you won't come calling any more.
Ethel Binky
Dear Ethel,
It sounds like you have set up some URLs that redirect to more redirects to... well, goodness! In small amounts, these "repeat redirects" are understandable, but it may be worth considering why you're using them in the first place. If you remove the 301s in the middle and send me straight to the final destination on all of them, you'll save the both of us a bunch of time and HTTP requests. But don't just think of us. Other people get tired of seeing that same old 'contacting.... loading ... contacting...' game in their status bar.
Put yourself in their shoes—if your string of redirects starts to look rather long, users might fear that you have set them off into an infinite loop! Bots and humans alike can get scared by that kind of "eternal commitment." Instead, try to get rid of those chained redirects, or at least keep 'em short. Think of the humans!
Yours thoughtfully,
Googlebot
It sounds like you have set up some URLs that redirect to more redirects to... well, goodness! In small amounts, these "repeat redirects" are understandable, but it may be worth considering why you're using them in the first place. If you remove the 301s in the middle and send me straight to the final destination on all of them, you'll save the both of us a bunch of time and HTTP requests. But don't just think of us. Other people get tired of seeing that same old 'contacting.... loading ... contacting...' game in their status bar.
Put yourself in their shoes—if your string of redirects starts to look rather long, users might fear that you have set them off into an infinite loop! Bots and humans alike can get scared by that kind of "eternal commitment." Instead, try to get rid of those chained redirects, or at least keep 'em short. Think of the humans!
Yours thoughtfully,
Googlebot
Dear Googlebot,
I know you must like me—you even ask me for unmodified files, like my college thesis that hasn't changed in 10 years. It's starting to be a real hassle! Is there anything I can do to prevent your taking up my lovely bandwidth?
Janet Crinklenose
Janet, Janet, Janet,
It sounds like you might want to learn a new phrase—'304 Not Modified'. If I've seen a URL before, I insert an 'If-Modified-Since' in my request's header. This line also includes an HTTP-formatted date string. If you don't want to send me yet another copy of that file, stand up for yourself and send back a normal HTTP header with the status '304 Not Modified'! I like information, and this qualifies too. When you do that, there's no need to send me a copy of the file—which means you don't waste your bandwidth, and I don't feel like you're palming me off with the same old stuff.
You'll probably notice that a lot of browsers and proxies will say 'If-Modified-Since' in their headers, too. You can be well on your way to curbing that pesky bandwidth bill.
Now go out there and save some bandwidth!
Good ol' Googlebot
It sounds like you might want to learn a new phrase—'304 Not Modified'. If I've seen a URL before, I insert an 'If-Modified-Since' in my request's header. This line also includes an HTTP-formatted date string. If you don't want to send me yet another copy of that file, stand up for yourself and send back a normal HTTP header with the status '304 Not Modified'! I like information, and this qualifies too. When you do that, there's no need to send me a copy of the file—which means you don't waste your bandwidth, and I don't feel like you're palming me off with the same old stuff.
You'll probably notice that a lot of browsers and proxies will say 'If-Modified-Since' in their headers, too. You can be well on your way to curbing that pesky bandwidth bill.
Now go out there and save some bandwidth!
Good ol' Googlebot
-----------
Googlebot has been so helpful! Now we know how to best respond to users and search engines. The next time we get together, though, it's time to sit down for a good long heart-to-heart with the guy (Date with Googlebot: Part III, is coming soon!).
UPDATE: Added a missing link. Thanks to Boris for pointing that out.