Is the Huffington Post Google's Favorite Content Farm?
I was looking for information about the nuclear reactor issue in Japan and am glad it did not turn out as bad as it first looked!
But in that process of searching for information I kept stumbling into garbage hollow websites. I was cautious not to click on the malware results, but of the mainstream sites covering the issue, one of the most flagrant efforts was from the Huffington Post.
AOL recently announced that they were firing 15% to 20% of their staff. No need for original stories or even staff writers when you can literally grab a third party tweet, wrap it in your site design, and rank it in Google. Inline with that spirit, I took a screenshot. Rather than calling it the Huffington Post I decided a more fitting title would be plundering host. :D
We were told that the content farm update was to get rid of low quality web pages & yet that information-less page was ranking at the top of their search results, when it was nothing but a 3rd party tweet wrapped in brand and ads.
How does Huffington Post get away with that?
You can imagine in a hyperspace a bunch of points, some points are red, some points are green, and in others there’s some mixture. Your job is to find a plane which says that most things on this side of the place are red, and most of the things on that side of the plane are the opposite of red. - Google's Amit Singhal
If you make it past Google's arbitrary line in the sand there is no limit to how much spamming and jamming you can do.
we actually came up with a classifier to say, okay, IRS or Wikipedia or New York Times is over on this side, and the low-quality sites are over on this side. - Matt Cutts
(G)arbitrage never really goes away, it just becomes more corporate.
The problem with Google arbitrarily picking winners and losers is the winners will mass produce doorway pages. With much of the competition (including many of the original content creators) removed from the search results, this sort of activity is simply printing money.
As bad as that sounds, it is actually even worse than that. Today Google Alerts showed our brand being mentioned on a group-piracy website built around a subscription model of selling 3rd party content without permission! As annoying as that feels, of course there are going to be some dirtbags on the way that you have to deal with from time to time. But now that the content farm update has went through, some of the original content producers are no longer ranking for their own titles, whereas piracy sites that stole their content are now the canonical top ranked sources!
Google never used to put piracy sites on the first page of results for my books, this is a new feature on their part, and I think it goes a long way to show that their problem is cultural rather than technical. Google seems to have reached the conclusion that since many of their users are looking for pirated eBooks, quality search results means providing them with the best directory of copyright infringements available. And since Google streamlined their DMCA process with online forms, I couldn’t discover a method of telling them to remove a result like this from their search results, though I tried anyway.
... I feel like the guy who was walking across the street when Google dropped a 1000 pound bomb to take out a cockroach - Morris Rosenthal
Way to go Google! +1 +1
Too clever by half.
Comments
I love it - "(G)arbitrage never really goes away, it just becomes more corporate." Stated perfectly. (and the fight continues for equality amongst us all in the web world....)
I ran into this same problem search for an article titled "White House veto threat on home refinance bill" when you Google the title name the original article doesn't even come up on the first page. In fact it comes up number 15. And the original article clearly states that "This material may not be published, broadcast, rewritten or redistributed." yet every news source is and ranking higher than the original.
Aside from ironically raising the results of some very spammy sites, it seems like Google has found a way to arbitrarily grant "authority" to big media sites. The return of old media elitism is coming to the web.
How long before every result on the first page is either a "whitelisted authority" that is basically free to do what they please or a Google-owned property?
I'd say not very long at all...and then you're only option is Adwords. Oh, but they'll suspend your account without remorse if you're not spending 7-figures which might enable you to speak to a real person.
Google doesn't want to own the web- Google wants to BE the web.
So let me get this straight... Content is no longer king. Other peoples content on someone else's site IS king - but only if you have Google's blessing.
So where does that leave the rest of us?
But one has to be more and more creative with both cost & strategy in order to remain profitable.
I don't want to do it, because MS is anathema, but if I have to...
Hey guys ! Google has already taken enough space on the Internet. Why would we want it to be in charge of legislating copyright infringement? Duplication is natural. It is often useful. For example, a company might distribute brochures with a different contact info in different locations. Copyright infringement is bad, but this has nothing to do with search engines. The duplication has occurred before they crawled the web. Let us deal with that separately. I discuss a related issue here civm.ca/?store=queen .
Search engines often rank infringing content. Sometimes they also host it!
Further, it is the ad networks of search companies which make many of the infringing "business" models profitable.
To claim search engines are completely innocent (even while Google for years recommended keywords with cracks, serials, & keygens in their search suggest AND Google admitted publicly that they had 50,000 advertisers dealing in counterfeit goods) is at best laughable.
Google had to pay a half-billion Dollar fine recently for illegal drug ads. That fine wasn't incurred because they were a friend/fan of intellectual property, as they are not.
However, I am only considering the organic search results here. So, perhaps I should have changed my title and replace "Google" with "Google algo", the part that determines the organic search results. I maintain my point that this algo should not have to deal with copyright infringement. It should be a different algo, not under the control of Google. An algo that would only use Google database as one of many other ways to search the web. I also think that, as long as Google algo (the part for which we do not directly pay) does its best to present the web as it is, I don't think we should blame Google for this algo. Why? Because it would force Google to have a non natural algo that does not focus on its job.
1.) That which is rewarded becomes more popular.
2.) They can't design algorithms while ignoring how people may react to them. If they tried to legitimize that approach/angle, then how could they also justify all the fear-mongering they have done about paid links?
3.) They *do* hard-code some of the "organic" search results to promote some of their own properties, in which they do monetize / are paid.
You are right that if we reward the violators, it will encourage more violations. However, it is not only Google that reward them. When we close our eyes and even participate in it by downloading copyrighted material, we are rewarding them even more than Google. So, yes, Google, as anyone else, has a social responsibility, but I do not think that the ultimate solution lies in their algo. Besides, if we take the viewpoint that we cannot trust Google, it is just another reason not to rely on their algo . Paid links are another story because their purpose is to directly interfere with their algo. It's different. They had to adapt their algo.
It's beside the main point, but had to say that paid links are not necessarily used to interfere with the algo. Now, with the "no follow" tag, they can play their legitimate role without interfering with the algo.
When an end user conducts a search & lands on a site like Mahalo.com (or another scraper site wrapped in Google AdSense ads) they probably don't realize that the original source of the content is being cut out of the value chain.
And when Google recommended warez, cracks, serials, etc. keywords *for years* they indeed were helping to create the problem.
Humans write their algos & tell other humans how to rate their algos based on some (often arbitrary) preconceived notions.
A site like Google Places is considered spam, except when it is Google's. A site like BeatThatQuote would get at least a 3 month penalty for link buying & be required to clean up the issues before re-appearing in the search results, except when it is Google's. Link brokers are spam, except the ones Google invest in. Paid links are spam, except when Google is buying them. etc etc etc
Many online paid links existed BEFORE Google did as a company. Indeed, even Google's main revenue engine of selling links was something they snagged from Overture (and had to pay a huge fee over).
If "the algorithm" aimed to be a legitimate observer of the web then it wouldn't require engineers to spread FUD, it wouldn't look the other way when it was Google doing the spamming, and Google AdSense wouldn't have a "get rich quick" ad category.
Unfortunately anyone who tries to make a clean separation between Google the search engine & Google the ad network is either ignorant or needs their head examined. Even the Google founders spoke out against ad based search engines in their early research. My guess is either you work for Google and/or you haven't spent the years of reading research & understanding the ecosystem the way I have.
It is fine to have your own opinion, but it is nuts to claim Google deserves a free pass everywhere when they do a "1 strike and your out" approach with a lot of their paying customers.
Right from the beginning, I said that I want to focus on the algo, not on Google. My view is that we must first crawl the web, then index it, then select what will be presented for each query. A natural question is where do we manage copyright infringements in that picture ? I maintain that we should add a separate algo that use Google index, but not only Google index. This separate algo might also benefit from doing its own crawling. I also maintain, especially if we do not trust Google, that Google should not be responsible for that algo.
We will get the benefit that Google will have to focus on its main job, which is my main motivation. I don't want Google to have excuses to fight content duplication. Content duplication over the WEB and even within a site is good. It allows web sites to adapt the content to many possible user requests. What is bad is content duplication in the search results. So, the simple approach is to make a good use of the canonical tag and hope that Google will be flexible in the differences allowed among the different URLs. You see, I am just concerned that people confuse copyright infringements with content duplication. So, I am saying let us completely separate the way we manage copyright infringements.
As far as whether Google is ethical or not, this is another story, My point would hold even if the search engines were controlled by the governments - it has nothing to do with Google. Now, since you brought me into this territory, I had to think about it. It's unrelated to my original point, but I still think that copyright infringements is a social problem. I don't get your argument that the poor guy did not know what he was doing. We are told on every single DVD that we watch that it is against the law and we nevertheless do it like crazy.
Their main job is monetization. And they are very good at it.
Google's Panda algorithm *did* torch many sites that had the sort of infrastructure you suggest.
The DMCA is one layer of separation. However it is expensive to have to use it every single time Google pays someone to steal your work. There are even YouTube video downloader sites monetized by...you guessed it...Google ads.
You are still missing the point.
Site A spends $2,000 to create a feature piece of content.
Site B steals the content from site A without attribution.
Google ranks site B above site A.
As an end user, how are you to know that Google is ranking the stolen version of the content above the original? Do you have any way of even knowing the original existed when there is no attribution & Google filters out the original as duplicate content?
Don't think that is a problem? Well then why did Google recently ask for help with this then?
What's worse, is most of the above problem is created directly by Google ads. An independent study a few years ago pegged Google ads as monetizing over 2/3 of the stolen content. And I am not just talking about movie torrents, but also textual articles and such.
Perhaps to see from where I come from, you should read this web page about duplication of content created by local contact information Vs cloaking (civm.ca/?store=queen). In fact, I would not be surprised that you know the answer to the question that this web page raises.
...you are focused on how a company might legitimately duplicate their own content across their properties.
My focus (with the above article) was how Google was/is funding wholesale content theft.
If Google was ranking some version of ABC for ABC's content that is fine. However in the above example there is ABC content posted through Twitter, syndicated to HuffingtonPost. ABC is funding the content, Twitter is the platform, and Huffington Post is getting paid to suck it down whole without adding any value at all. That is my point.
Yes, it was a difference of focus, but not because of the scale. The example of useful duplicate content is for one business, but my point hold also among a group of companies and even in the entire Internet. Duplicate content is useful in all these cases. Your focus was on the cases of copyright infringements and similar cases. You want to solve the problem at its source, which you argue is Google. My focus is that, even if Google is the source of the problem, which I think is exaggerated because there are many other factors, still I want Google algo to accept any duplicate content by default, i.e., it should accept duplicate content unless there are strong evidence of copyright infringements or similar issues. I am very concerned that this war against copyright infringements and look-alikes become a war against duplicate content because it will be very detrimental to the Internet. It would be a big mistake, a misunderstanding. Duplicate content is natural and required to adapt the content to visitors. BTW, why do you remove the link? It is naturally connected to the discussion here.
...is the Panda update & existing duplicate content filters already waged some portions of the war. And in many cases smaller webmasters were casualties.
Google isn't waging a war against copyright infringement (rather they actively invest in undermining copyright). The war they wage is against the independent webmaster.
Why do I keep removing the link? Because I think your point of the off-topic out-of-context comment was to get a link. Why wouldn't I remove that sort of link? :D
We said "different focus", but while I was explaining our different "focus", it became clearer that actually my view, which stands by itself in a very fundamental manner is simply challenging your view. You won't have me agree with you that Google is such and such because it's a very complex situation to analyse. You presents the facts that you see through your tinted glasses. It is not enough for me to take a position and, obviously, I am not alone. It is OK that you have taken your position, but I am disappointed that you expect others to follow you. People won't follow you so quickly.
My view, on the other hand, is more fundamental, less dependent on facts, more dependent on common sense and logic. In one sentence, this view is that duplicate content at every scale is necessary to adapt content to visitors, it is very natural, and we must not transform a war against copyright infringements or against Google into a war against duplicate content. It's clearly not off-topic. It is a very fundamental part of the topic. My best bet is that you simply did not take the time to sit and think about the importance of duplicate content in a world where there are so many differences, some times subtle differences, in the expectations or needs of the visitors. There is no way to provide an optimal experience to the visitors without many variations, some times small variations, on a single piece of content. Some times only presenting a same piece of content in a different context is useful to the visitors. It is obvious that there is no way this can be restricted within individual companies. It would be against natural competition - not every thing is under copyright and prevented from being improved upon, adapted or even only presented in a different context. I am totally for a protection against copyright infringements, but duplicate content should be accepted by default.
Just to be clear, by "duplicate content", I do not mean the basic notion where the exact same page in a same context has multiple URLs. This is obviously bad. However, the notion of duplicate content has been extended by search engines to include variations on a given piece of content. I would call it "adapted content", not "duplicate content". Adapted content is natural and necessary. So, you conduct your war against Google and I know that you would like others to join you, but I am not interested and I am only asking you if you would make sure that that lost of freedom on adapted content would not be a casualty of that war ? I am interested on a war against copyright infringements in general, not specifically against Google, but I think it is important that before we conduct that war, we sit and think about the importance of adapted content. The most natural and simple approach is to accept adapted content by default while we conduct separately that war with full access to the material gathered by search engines.
...we don't have a say in how Google chooses to reflect their current business interests in their relevancy algorithms. We can highlight what we believe in and that is all we can do.
What your missing is that language is used and abused to fit business needs.
Content x is great user experience on one site & the same content is duplicate spam or a content farm on another. We as end users don't really get to write the algorithms or make that decision. We can only highlight when it goes astray, as I did above.
OK, I stepped back a little and looked more carefully at your examples. I understand now that you are essentially against suppression of content. The confusion came up when you expressed your view in the context of copyright infringements. When you say that Google encourages copyright infringements, you mean that the copied content is often suppressed and this makes thing worst. You are not at all saying that Google should control and suppress duplicate (adapted) content even more.
Controlling adapted (duplicate) content is an attack against freedom of expression on the Internet. We cannot tell in advance if a piece of content is an adaptation or totally original. Therefore, if we do not accept an adapted content by default, we are controlling the Internet. This is not acceptable. We must control adapted content, but only after the content appears in search engines, not before, just like in any society that protect freedom of speech. Most of the cases that you presented are examples of suppression of content. So, they are actually supporting this point. In its war against duplicate or adapted content, Google discards URLs or omits them in search results. The discarded pages are not even indexed whereas the omitted pages are indexed, but omitted in the search results of a given query : you know the “In order to show you the most relevant results, we have omitted some entries very similar..."
There are also the merged pages (i.e., collapsed under a single URL), but these are perfectly fine, actually they are great, as long as it does not occur across domains without the canonical tag, but this is what Google does already.
So, finally, our positions might not be so opposed. We should encourage Google to never discard duplicate (adapted) content and to always accept the canonical tag when there are no obvious flaws. This does not take care of the omitted results, but if you think that Google abuses it, then you do not want to make things worst by asking Google to be "more vigilant." Just like me, you should expect that it will be managed separately, which is my original point.
I expect Google's approach to be self-serving and maximally profitable to Google...based on their past & recent approach.
I meant "expect" to mean "desire" or "want", not "suspect" or "foresee". LOL.
You say a lot about what what you expect (foresee) of Google, which is nothing good, but not much about what you want from them, meaning here how would you do it yourself if you had the possibility? Anyway, you made me laugh. I learned a lot in this overall thread because it made me think about how I would try to do it better. It is not obvious. Every tool that you provide to fight that corruption is likely to be used to create even more imbalance. The solution is not to balance Google monopoly with another big search engine. It will not work. We need to intelligently discuss the algo, i.e., how we would do it. Only then there will be a chance that it gets implemented. It does not matter by who.
In my concern that Google will suppress even more duplicate (adapted or similar) content, I forgot about those entries that are already omitted in the current algo. Just like you said, Google is already making a choice, which I believe should be done in more open manner, so that copyright infringements could be deal with. I was only concerned that any complaint could be misinterpreted as a suggestion that Google should be more vigilant when accepting duplicate (adapted or similar) content. We want the opposite and also that any selection, if it has to be done, should be done in a more open manner. This is totally justified because Google should not have the authority to deal with copyright infringements, which is my basic point. Copyright infringements should be managed by neutral entities that have been given the authority to do so.
Add new comment