Google Duplicate Content Filter
Captain Caveman posts on Google's duplicate content filters.
Interesting tactic by Google. If too many pages on the same site trip a duplicate content filter Google does not just filter through to find the best result, sometimes they filter out ALL the pages from that site.
This creates an added opportunity cost to creating keyword driftnets & deep databases of near identical useless information. One page left in the results = no big deal. Zero pages = big deal.
Not only would this type of filter whack junk empty directories, thematic screen scraper sites, and cookie cutter affiliate sites, but it could also hit regular merchant sites which had little unique information on each page.
On commercial searches many merchants will be left in the cold & the SERPs will be heavily biased toward unique content & information dense websites.
If your site was filtered there is always AdWords. And if there are few commercial sites in the organic results then the AdWords CTR goes up. Everyone is happy, except the commercial webmaster sitting in the cold.
Yet another example of Google trying to nullify SEO techniques that work amazingly well in it's competitors results. I wonder what percent of SEOs are making different sites targeted at different engines algorithms.
I have to be somewhat careful with watching some of these types of duplicate content filters, because I have a mini salesletter on many pages of this site, and this site could get whacked by one of these algorithms. If it does changes will occur. Perhaps using PHP to render text as an image or some other similar technique.
Comments
>Perhaps using PHP to render text as an image
Scott at web-professor.net has a nice little script for this:
http://web-professor.net/wp/2005/09/17/theres-no-duplicate-content-here/
Dupe content is definitely going to be a big hot button issue. This certainly is one way to deal with it, but I imagine there will be a somewhat substantial outcry with the collateral damage involved. I guess G doesn't have to listen, and this is a potentially viable strategy, but I somehow doubt it will be the end of dupe content problems yet.
Now that Yahoo! has started fighting back, I think more and more people are going to start returning to them, because people are starting to think that Google are getting to big for their boots.
Maybe some people are just jealous of their success, but when a company starts manipulating results like this, people get quite annoyed.
I know something has to be done about scraper sites and artificially boosting page rank with link directories, but it is getting ridiculous.
I guess Google get a lot of criticism, because they are so well-known, but it is a thankless task.
What else can they do?
Though, I would like to know how this is going to effect template-based sites, which obviously duplicate content.
The question is HOW is google determining what is duplicate. Is it actually making a hashmap and comparing the OVERALL page to all the other pages on the site.
Or is it doing a hashmap and breaking the page down into sectors, then comparing the hash of individual sectors of the page with other pages. Sort of like the way good anti-spam filters get around the randomization spammers use within spam emails.
I think that's a bad idea from Google because there ARE articles that are making sense and are published on many different sites... why not?
Google will filter many "good" pages also. My 2 cents are: They will remove this "technique" or controll it by humans. At least I hope so, and I'm not using this technique at all. In fact I'm not using any SEO technique at the moment :D
Hey Aaron
I'm sure this is not news to you, but for your own concerns on this site, you can just use an iFrame for your repeated content:
I tried to write the code snip here but it won't show it.
Thanks,
Rich
What about URL parameters? For example, sometimes for tracking purposes, I'll put a referer variable in the URL - thus index.php and index.php?track=1 are technically duplicate pages, even though it's really just one page.
Will G count that as a duplicate?
Does anyone know how alike the pages have to be to considered duplicate by google?
Hi Jack
It is not an exact number really, and it is something they frequently work on.
Dr Garcia posted that duplicate content filters can use sliding windows (or shingles) to look for how similar certain chunks of a page are. They then may look for wildcard replace matches across those shingles (or word sets) as they slide them across the page.
Plus some search engines may boilerplate strip some of the site formatting out of the page.
You can count on duplicate content filters getting more advanced because computer cycles are getting cheaper and many spam generators use markov chains to mix page text. Some content generators (not openly available on the market) are so sophisticated that most people couldn't tell the difference between real and fake content.
Any site that is primarily an empty product database would be shielding it's longterm profitability if it also added some original useful content or interactive elements that inspire consumer generated media.
You probably do not want the same content getting indexed at multiple URLs like that.
i want anty filter or proxy
actually that might be safe - because they may not include the latter, according to G -
"Don't use "&id=" as a parameter in your URLs, as we don't include these pages in our index."
http://www.google.com/intl/en/webmasters/guidelines.html
technically your not using the '&' operator or calling it 'id' but the similar enough
Hi Rob
I have seen many pages like the later that were pure duplicate content and indexed.
I think that the likelihood of Google penalising you for duplicating content on your own site is pretty minimal. It will just pick whatever it feels is the most relevant page (or first to be indexed in a tie-break), only list that and discount the other pages. (Horses mouth: http://googlewebmastercentral.blogspot.com/2006/12/deftly-dealing-with-d...)
What is more worrying is sites that appear to reap the credit for content that originated elsewhere. http://www.livingroom.org.au/photolog/ is an example of a site that seemingly reports other's content, and Google loves it. Explain that!
sex anti filter
at the moment i havent any comment
>Will G count that as a duplicate?
Sometimes they may.
Does duplicate content on various domains negatively affect ratings? Such as an article that you have written and have posted on numerous different domains?
Newb question :)
Add new comment