How to Handle Duplicate Content

Here is a fun webmaster help video from March 10th of 2010, answering the following question:

"If Google crawls 1,000 pages/day, Googlebot crawling many dupe content pages may slow down indexing of a large site. In that scenario, do you recommend blocking dupes using robots.txt or is using META ROBOTS NOINDEX,NOFOLLOW a better alternative?"

The answer kinda jumps around a bit, but here is a quote:

I believe if you were to talk to our crawl and index team, they would normally say "look, let us crawl all the content, we'll figure out what parts of the site are dupe (so which sub-tree are dupes) and we'll combine that together.

Whereas if you block something with robots.txt we can't ever crawl it, so we can't ever see that its a dupe. And then you can have the full page coming up, and then sometimes you'll see these uncrawled URLs where we saw the URL but we weren't able to crawl them and see that its a dupe.
...
I would really try to let Google crawl the pages & see if we can figure out the dupes on our own.

Trust in GoogleBot

The key point here is that before you consider discarding any of your waste you should give GoogleBot a chance to see if they can just figure it out on their end. Then, without updating said advice, Google rolled out the Panda update & torched 10,000's of webmasters for following what was up to then a Google best practice. Only after months of significant pain did Google formally suggest on their blog that you should now block them from indexing such low value pages.

Matt's video also suggested some of the other work around options webmasters could do (like re-architecting their site or using parameter handling in Webmaster Tools), but made it sound like Google getting it right by default was anything but an anomaly. What such advice didn't take into account was the future.

What Does a Search Engineer Do?

The problem with Google is that no matter what they trust, it gets abused. Which is why they keep trying to fold more signals into search & why they are willing to make drastic changes that often seem both arbitrary & unjust.

Search engineers are well skilled at public relations. A big part of what search engineers do is managing the market through FUD. If you can get someone else to do your work for you for free then that is way more profitable than trying to sort everything out on your end.

Search engineers are great at writing code. A lot of what the search engineers do is reactionary. Some things get out of control and are so obvious that FUD won't work, so they need to stomp on them with new algorithms. Most search engine signals are created through tracking people, so they usually follow people. Even when it seems like they are trying to change the game drastically, a lot of that data still comes from following people.

What to Do as an SEO?

The ignorant SEO waits until they are told by Google to do something & starts following "best practices" after most of the potential profits have been commoditized, both by algorithmic changes & a market that has become less receptive to a marketing approach which has since lost its novelty.

The *really* ignorant SEO only listens to official Google advice & trusts some of the older advice even after it has become both stale & inaccurate. As recently as 2 years ago I saw a published author in the SEO space handing out a tip on Twitter to use the Google toolbar as your primary backlink checking tool. Sad!

The search guidelines are very much a living breathing document. If search engines are to remain relevant they must change with the web. Those blazing new paths & changing the landscape of internet marketing often operate in ways that are not yet commonplace & thus not yet covered by guidelines that are based on last year's ecosystem. Individual campaigns fail often, because they are trying something new or different. Off of each individual marketing campaign the expected outcome is failure. However they generally win the war. Those who follow behind remain in their footprints (unless they operate in less competitive markets).

The savvy SEO is a trail blazer who is pushing & probing to test some of the boundaries. They are equally a person who watches the evolution of the web through the lens of history, attempting to predict where search may lead. If you can predict where search is going you are not as likely to get caught with your pants down as the person who waits around for Google telling them what to do next. It may still happen in some cases, but it is less common & you are more likely to be able to adjust quickly if you are looking at the web through Google's perspective (rather than through the perspective they suggest you use).

Google's Noble Respect for Copyright

Google has a history of challenging the law & building a business through wildcatting in a gray hat/black hat manner.

They repeatedly broke the law with their ebook scanning project. Their ebook store is already open in spite of a judge requiring them to rework their agreements.
They bought Youtube, a den of video piracy & then spent $100 million on legal bills after the fact. When they were competing with Youtube they suggested that they could force copyright holders to pay Google for lost ad revenues if they didn't give Google access to the premium content. :D
They sold ads against trademarks where it was generally viewed as illegal and awaited the court's decisions after the fact.
They tried doing an illegal search tie-up with Yahoo & only withdrew after they were warned that it would be challenged. They later slid through a similar deal with Yahoo Japan that was approved.
They "accidentally" collected personally identifiable information while getting router information & scanning streets (and we later learn via internal emails in court documents how important some of this "accidental" data collection was to them).
They pushed Buzz onto Gmail users and paid the fine.
Google torched UK finance comparison sites for buying links. Then Google bought one of the few they didn't torch (in spite of its spammy links). After getting flamed on an SEO blog they penalized that site, but then it was ranking again 2 weeks later *without* cleaning up any of the spammy links.
When the Panda update torched one of your sites Google AdSense was probably already paying someone else to steal it & outrank you. Google itself scrapes user reviews & then replaces the original source with Google Places pages. The only way to opt out of that Google scrape is to opt out of Google search traffic.
Google promotes open in others, but then with their own products it is all or nothing bundling: "we are using compatibility as a club to make them do things we want." - Google's Dan Morrill
For years Google recommended warez and keygens and serials to searchers, all while building up a stable of over 50,000 advertisers pedaling counterfeit goods. That only stopped when the US government applied pressure, and then Google painted themselves as the good guys for fighting piracy.
Google is reportedly about to launch their music service, once again without permission of the copyright holders they are abusing.

Those were examples of how Google interpreted "the guidelines" in modern societies.

Google doesn't wait for permission.

What are you doing right now?

Are you sitting around hoping that GoogleBot sorts everything out?

If so, grab a newspaper & pull out the "help wanted" section. You're going to need it!

If you want to win in Google's ecosystem you must behave like Google does, rather than behaving how they claim to & tell you to.

Published: May 10, 2011 by Aaron Wall in google

How Brands Became Hardwired/coded in Google's SERPs

AdWords: Yet Another Problem With Google's Panda Update

Comments

domain linkz

May 10, 2011 - 2:51am

man oh man

Aaron, these guys Amit, Crawl Team, Adsense Team, Amit (mind behind search) should sit together and make up their mind as to what they want done.

One google guy says you should noindex dup/low quality pages, other guy says don't write on same topic but for love of god I can write 1000s of articles on topic of "chess positions, patterns, endgames, tablebases, strategies, tactics, psychology" and it would still be useful to my reader.

This recent Google BS is driving me insane, none of them can make up their mind or spend some time together to set something so everyone knows where to go.

If you noindex then crawl team will not know if you have dup content, if you let dup content get indexed then Amit gets mad about quality, if you don't get pages indexed then crawler cant find it and Matt's team can't make better picture of your site regarding good/bad/shallow content.

What on earth is going on? None of them are in SYNC.

Sidblogote

May 10, 2011 - 3:59am

I see...

I would leave everything to Google and its bot. Also, I'd try to eliminate any dupe pages from my website -- no point in letting GoogleBots find dupe pages on its own only to know that it ranks those pages lower.

Also, as pointed out by "domain linkz," the people at Google departments have been gaming our mind: Google needs exceptional content -- people have been writing exceptional content over the years, mind you -- while the other hand, those who have been an example of awesome content, Askthebuilder and many more, have been hit by the Panda. And, as we all know, the Panda favors scraper sites! These Google guys are presenting nothing but shallow facts, write this and write that, give your credit information question and deep analysis stuff. What about the blogs? Not everyone has to be an expert to write content, some people write because they are passionate about the topic.

There is no clear picture on how to tackle Panda. I'd leave it to Google to recover from its failure.

NeonDog

May 10, 2011 - 8:13am

L O L

Wow, that's rich.

Letting Google determine who owns content and what's duplicate content is fundamentally flawed because Google could care less.

When Adwords quality score went in a few years back one of my sites/keywords got slapped badly. I pointed out that the same site was currently ranking #1 for the same keyword (and it still is to this day). I was told that Adwords does not use organic results to determine quality.

So now look at this guy: http://www.google.com/support/forum/p/AdWords/thread?tid=0bb4bb671eff705...

He is banned from advertising because the Google drones are claiming that his site is low-quality duplicate content because the scrapers have stolen his stuff. Oh, I see they're using organic results to determine that?

Of course, it's no problem if you're willing to make a report or submit a DMCA to Google...they'll "look into it" in about 2-3 months if you're lucky.

The standard is WHATEVER Google says it is at that precise moment in time. Cognitive dissonance is probably one of the personality traits encouraged on Google's employment apps.

Aaron Wall

May 10, 2011 - 9:59am

that deserves a follow up! :D

benpfeiffer

May 10, 2011 - 3:34pm

right on

Your should never let Google decide for you. At that's been my manta for a long time.

Cheers Aaron for writing this article. I have spent last few weeks looking into this to make clear advice for clients and for dealing with dup content on large website. I am to point just blocking Google with a variety of methods from anything I think they won't like. Better safe than sorry.

DC Finder

May 10, 2011 - 9:54pm

Amen to that

Very well put! I've been preaching about duplicate content for years now (and even developed software to find it). At the same time I've had to listen one guru after another telling me that Google will and can handle DC. OK - they handle it.... By throwing away sites ;-)

Tim Cohn

May 11, 2011 - 1:27pm

So What's The Calculus?

If your not skating to where the puck is going to be, you will end up being at where it has already been.

Aaron Wall

May 12, 2011 - 3:13am

exactly

And that ultimately is why lateral thinking, creativity, and strategy are so under-valued...people ultimately want to be told to "go here" and "do this" because of measurement x, but ultimately a lot of that stuff from tools is based on consensus, and thus ends up being backwards looking.

googlemonopoly

May 14, 2011 - 12:46am

Breaking news

We blog about Google on a daily basis at googlemonopoly.eu

On Monday, May 16, 2011, we are publishing a piece about duplicate content that just might break Google once and for all. It's going to be earth shattering like some of our other longer features pieces. It's all factual and we are working on a proof of concept to show a formula in the public view that Google just might have to do something about.

I have a lot of respect for Aaron Wall and others strong enough to speak out against Google. Google has a corporate belief system that involves utter disregard of patents, copyrights and individual property rights.

As a publisher, I've recently been subjected to the broken DMCA takedown process. It's a sewer of time and nearly impossibly financially to deal with 1000's of these one might have to do for a real site.

In typical Google fashion, DMCA's are handled via a broken form with little to no usability. No admin to track such, no followup, nothing. I wouldn't be surprised if they are just piping the requests to /dev/null. I expect more from a company that employs 26k+ people and allegedly the best and brightest.

Ignore SEO, GoogleBot Will Sort it All Out for You

How to Handle Duplicate Content

Trust in GoogleBot

What Does a Search Engineer Do?

What to Do as an SEO?

Google's Noble Respect for Copyright

What are you doing right now?

Comments

Add new comment

Related Posts

Have a question?

I'd like to learn more about:

About SEOBook