The $10,000 Robots.txt File...Ouch!
I recently changed one of my robots.txt files pruning duplicate content pages to help more of the internal PageRank flow to the higher quality and better earning pages. In the process of doing that, I forgot that one of the most well linked to pages on the site had a similar URL as the noisy pages. About a week ago the site's search traffic halved (right after Google was unable to crawl and index the powerful URL). I fixed the error pretty quickly, but the site now has hundreds of pages stuck in Google's supplemental index, and I am out about $10,000 in profit for that one line of code! Both Google and Yahoo support wildcards, but you really have to be careful when changing a robots.txt file because a line like this
Disallow: /*page
also blocks a file like this from being indexed in Google
beauty-pageants.php
Unless you are thinking of that in advance it is easy to make a mistake.
If you are trying to prune duplicate content for Google and are fine with it ranking in other search engines, you may want to make those directives specific for GoogleBot. If you make a directive for a specific robot, that bot will ignore your general robots directives in favor of following the more specific directives you created for it.
Google's webmaster guidelines and Yahoo!'s Search Blog both offer tips on how to format your robots.txt file.
Google also offers a free robots.txt test tool, which allows you to see how robots will respond to your robots.txt file, notifying you of any files that are blocked.
You can use Xenu link sleuth to generate a list of URLs from your site. Upload that URL list to the Google robots.txt test tool (currently in 5,000 character chunks...an arbitrary limit I am sure they will eventually lift).
Inside the webmaster console Google will also show you what pages are currently blocked by your robots.txt file, and let you view when Google tried to crawl the page and noticed it was blocked. Google also shows you what pages are 404 errors, which might be a good way to see if you have any internal broken links or external links pointing at pages that no longer exist.
Comments
That totally sucks! I'm going to be careful when making changes to that file. I always have to triple check when I'm doing anything there or .htaccess
Glad you caught that before it was more like a $50k mistake, cause that would reeeeeaallly make you mad.
On the other hand, I wouldn't mind having a single site that made $10k in such a short period of time.
Need to be careful also that google and yahoo's support for wildcards is not similar. For example different bots can handle ? in a different way.
Also need to take into account that at least Google has robots.txt length limitation (around 5000 bytes).
Shit happens, but the most important is that now you've got one more lesson! This is priceless!
Regards,
William
Wow, sorry to hear that.
Thanks for putting out such a good description of what you can (and should) do to filter out the less valuable pages of a website.
Hi,
Maybe I'm wrong but I think that the 5.000 characters limitation is only for the robots.txt validation tool. The robot.txt of The White House, for instance, has a more than these characters.
and thanks for your blog!
I did the same on my blog and didn't have any problem... Of course there's no such a page giving me that 10k...
I know this is tricky. Thats why we always have to check Google Webmater Tool to see if crawler has been restricted to any page/s that we wanted to index. Sometimes if you put restriction like
Disallow: /search
Just to tell BOT not to crawl your search pages, it will also put restriction against any pages that has "search" term at the beginning of the URL. Suppose http://www.yoursite.com/search-domain-name.html
I wrote a similar article few days ago. In case if you guys are interested.
http://blogsandbucks.com/use-robotstxt-file-correctly-for-your-blog/
If you are disallowing a directory, wouldn't it be smarter to do this:
Disallow: /*directory/
That wouldn't affect pages named directory-something.html would it ?
Really good!! Good source of information.
yeah, another reason why i keep a very simple robots.txt and i never change it!
Well, I can see the funny side of that. Have you noticed that Shoemoney has ditched your "improvements" on his robots.txt?
or chalk it up to costing you $10k for a good topic to blog about. :)
Once you have a robots.txt directive to block certain pages - in your experience - how long does it take for that to be noticed and effective?
I've had pages disallowed for months that are still in the SERPS and I'm wondering what that means.
Wow! great info to know
All i can say is ouch!, one page earning 10 k,
Two questions:
How much do you earn from this website alone?
and which page is the one which earns you 10 K?
Funny you mention that. I recently did something very similar and was kicking myself for being such an idiot. That will teach me to edit code at 3AM :-)
Is this robot.txt file definition and purpose located in your ebook aaron? If so, I must be blind x_x
Hi Yi Lu
I briefly mention robots.txt in my ebook, but I don't go too deep into using it too aggressively because it is so easy to mess it up (as I accidentally did above).
I can't disclose the specific earnings of that site. Keep in mind that I never said that the one page made 10K...just that it had lots of link equity. That link equity helped to power the crawl depth of the site and help other pages on the same site rank better.
It depends on the crawl priority of that site and that page in question, as well as where they are in their crawl cycle when you do it.
I think I made this error about 3 weeks ago and Google started reacting to it about a week ago.
As far as how long it will take to correct goes, that depends on the same factors mentioned above, plus how long it takes Google to discover and trust the link equity pointing at the rest of the site, and reassign those pages to the primary index rather than the supplemental index.
This is really "Ouch"! Thank you for the great tips btw.
I'm just put a file using name robots.txt without putting any coding it...It's very helpful tip.Thanks a lot.
Aaron:
You should consider a reinclusion request for that page via webmaster central. I heard Matt Cutts speak the other day and he mentioned that Google recrawls the robots.txt file only every couple hundred visits to a site.
Jonah
I don't understand why the robots.txt error described in this post would cause a lot of the pages in the site to go supplemental.
Are you saying that the page was so important that when you no crawled it by accident the internal links from that page caused opther pages to go supplemental? Is that the rational?
Ouch Ouch $$...
Aaron learned that by -$10,000 and he is so gracious to let us know that watch it.....
Thanks Aaron
Vijay
Hi Philip
Yes...much of the site's link equity went into that one page. And when it went away so did much of the site's link equity, thus many of the pages went supplemental. It is a quite large site too, so that link equity was important.
I forgot to take out a noindex header tag. It took about two months to go back to normal traffic levels. I filled a reinclusion request too for precaution (I haven't got a response)
Google, Yahoo and MSN all support the end of string character ($). So for those engines you could use:
Disallow: /*page$
And that would only match example.com/my-page and not example.com/my-pagerank or example.com/my-page/
There are extensive Drupal-specific robots.txt examples on Drupalzilla.com...
Still as important today as it was a few years ago?
If you use it incorrectly / have an error in it then yes it can still be super harmful :)
Unless you have a friend at Big "G" - everything seems to react slowly when it comes to subtle changes. It can take five minutes or five months to get a link indexed, and then it can be gone in just a second. If you depend on organic traffic for your lively hood, these can be some severe learning curves.
This is a great post. i just wish I would of read it 6 months ago.
Add new comment