Google a Web Bully? Hot Nacho Speaks Out Against Their Spam Double Standards

A while ago Chad Jones, from Not Nacho, the site involved with the WordPress content spam fiasco, spoke out about what went wrong.

WordPress hosted about 4,000 content articles about expensive topics. Matt Mullienweg hosted the content on Wordpress.org and placed hidden links on the home page pointing at the articles.

WordPress, the popular blog software which use the hidden links, was back in the Google index quickly. Google is still punishing the owner of HotNacho to this day, as Chad states:

They seem to have taken punitive measures by looking up my other sites via WHOIS and punitively banning a bunch of my sites -- including my hobby freeware sites.

Sites I own (all of which Google has banned):
hotnacho.com
acme-web-hosting.info
avatarsoft.com
notepad.com
free-backup-software.net

Thoughts on his article:

  • I don't like his comparisons on his content vs real spam, but his point that it is hard for human compiled content to be profitable against automated systems is on many fronts accurate.

  • Him saying Google controls over 90% of web traffic right after complaining about others not doing any fact finding undermines his credibility.
  • He has some good ideas on the content rating and importance of user feedback or using strong quality guidelines off the start is important.
  • I know many other friends who run the exact same business model, but do it profitably, successfuly, and in Google's good graces because how the content is formatted. Wrap it in a blog and post a few articles a day to each channel.
  • While he was talking about how his keyword placement software could increase the ability of content to rank, I think it is in error to look at it purely from an algorithmic front. The social structure of content matters.
  • It is far easier to build links into topical channels (such as blogs) than article banks.
  • He talks about creating a bunch of freeware and offering free support. Doing good on one front does not offset the actions on others with the mob justice on the web.
  • I think it is pretty shitty of Google to have banned all of his sites. I mean who does this help? Where is the relevancy?
  • And yet Google funds much of the garbage they purportedly hate. Google not only acts reactively, but blatently overly reactive when certain issues become public. I suppose they were trying to send a message to Chad Jones, but it was not one honestly focused on search relevancy. I wish I would have seen this article sooner.
  • The fact that few people have mentioned the Hot Nacho article shows how biased blogs are at grabbing the front end of the story and then prowling for the next story before adding any depth or further research. Sorta reminds me of the Nirvana song Plateau, although I admit I am just as guilty at it as the next blogger.

Risk vs Reward In Hiring a Cheap Link Monkey

Not only are the engines getting better at discriminating link quality, but when you outsource your link building to save money you often get automated junk which is sent WAY off target.

That presents three main problems:

  • potential bad plublicity (few things suck as bad as Danny Sullivan highlighting one of your own link exchange requests as being bad, as you know that probably gets read by MANY search engineers)

  • frequently exchanging way off topic makes your site less likely to be linkable from the quality resources on your topic (and, to a lesser extent, may cost you some of the quality links you already have)
  • If sites are willing to trade way off topic that means odds are pretty good that much of their link popularity is bottom of the barrel link spam. Thus as you trade more and more off topic links a larger and larger percent of your direct and in-direct link popularity come from link spam that is easy to algorithmically detect.

The net result is that a somewhat well trusted and normalish link profile starts to look more and more abnormal. Eventually bad plublicity or the low quality links may catch up with the site and it risks either gets banned or filtered out of the search results.

If you have a longterm website, and are using techniques that increase your risk profile and are easily accessible to and reproducible by your competitors at dirt cheap rates it might be time to look for other techniques.

Some sites that practice industrial strength off topic link spam might be ranking well in spite of (and not because) some of the techniques they use.

[Update: just got this gem

Hi

My name is Ben, and I'm working with Search Engine Optimisation [URL removed].

I have found your site and believe it would be mutually beneficial for us to exchange links as our sites share the same subject matter. As you may already know, trading links with one another helps boost both of our search engine rankings.

As a result, I am sending this email to inform you about our site and to propose submitting our link to your web page located at; www.search-marketing.info

We would appreciate if you could add a link to our web site on this/your web page, using the following information:

Title Link: Search Engine Marketing
Description: Tailored Search engine marketing campaigns for your business. Leverage our online marketing & pay per click management experience & achive fast ROI.
URL: http://www.[site].com.au/search-engine-marketing.html

NOTE: We will upload your link on our site, when you have notified us our link is live and we can see it online.

Thank you for your time and your consideration.

Sincerely, Ben

Linkmaster
ben@[site].com.au

Can you imagine how shitty their SEO services are for their clients if they send shit like that out for their own site.

They know I am an SEO, and they:

  • are too lazy to grab my name from my site, even though it is on every page (not hard to automate that)

  • say I may know something about how links work (get a clue)
  • call my home page a links page (really stupid)
  • want me deep link into a useless service page on their site
  • call their search engine marketing services tailored, when it is pretty obvious that they are not using sophisticated or useful techniques for their own site.]

Greg Boser: Blogger

Greg says Oh my God, I’ve Become a Blogger. A great thing for webmasters and search in general, IMHO.

Greg asks:

But now comes the hard part. How do you go about creating a blog about search marketing that is truly unique?

Anyone ever notice that the black hat SEO blogs typicially have both higher content quality and more original content than the typical white hat SEO blogs? Apparently, Gordon Hotchkiss has yet to get the memo.

via Oilman

Regulating Search Conference @ Yale

The Information Society Project at Yale Law School is hosting "Regulating Search?: A Symposium on Search Engines, Law, and Public Policy," the first academic conference devoted to search engines and the law. "Regulating Search?" will take place on December 3, 2005 at Yale Law School in New Haven, CT.

Topics covered:

  • Panel 1: The Search Space
    This panel will review the wide range of what search engines do and their importance in the information ecosystem.

  • Panel 2: Search Engines and Public Regulation
    This panel will discuss the possibility of direct government regulation of search functionality.

  • Panel 3: Search Engines and Intellectual Property
    This panel will review past and present litigation involving search engines and claims framed in the legal doctrines of copyright, trademark, patent, and right of publicity.

  • Panel 4: Search Engines and Individual Rights
    This panel will look at the role of search engines in reshaping our experience of basic rights and at the pressures the desire to protect those rights place on search.

Early bird registration fees (early registration ends on Nov. 15):

  • $35 for students

  • $75 for academic and nonprofit
  • $165 for corporate and law firm participants

Free Open Source Keyword Phrase List Generator

Probably the least exciting of the SEO / SEM tools I have created so far, but recently my friend Mike created a keyword phrase list generator.

I made it open source, so if you like it feel free to link to it, mirror it, or improve it.

Google May Sell Ads for Chicago Newspaper Company

UPDATE: Google Weighing Test Of Print Ads In Newspapers

Google Inc. (GOOG) is considering testing print advertisements in Chicago newspapers, in a sign that the Internet giant, to date seen primarily as a threat to traditional media, could also become an ally.

If Google could take the inefficienies out of offline media they probably could end up making the papers more revenue in the short run. Long run is anyone's guess.

Google Sandbox Tip

What is the difference between how real news stories spread and how link spam or artifical link manipulation spreads.

Link Spam Detection Based on Mass Estimation

Not sure how many people believe TrustRank is in effect in the current Google algorithm, but I would be willing to bet it is. Recently another link quality research paper came out by the name of Link Spam Detection Based on Mass Estimation [PDF].

It was authored by Zoltan Gyongyi, Pavel Berkhin, Hector Garcia-Molina, and Jan Pedersen.

The proposed method for determining Spam Mass works to detect spam, so it compliments nicely with TrustRank (TrustRank is primarily aimed to detect quality pages and demote spam).

The paper starts off by defining what spam mass is.

Spam Mass - an estimate of how much PageRank a page accumulates by being linked to from spam pages.

I covered a bunch of the how it works in theory stuff in the extended area of this post, but the general takehome tips from the article are

  • .edu and .gov love is the real deal, and then some

  • Don't be scared of getting a few spammy links (everyone has some).
  • TrustRank may deweight the effects of some spammy links. Since most spammy links have a low authority score they do not comprise a high percentage of your PageRank weighted link popularity if you have some good quality links. A few bad inbound links are not going to put your site over the edge to where it is algorithmically tagged as spam unless you were already near the limit prior to picking them up.
  • If you can get a few well known trusted links you can get away with having a large number of spammy links.
  • These types of algorithms work on a relative basis. If you can get more traditional media coverage than the competition you can get away with having a bunch more junk links as well.
  • Following up on that last point, some sites may be doing well in spite of some of the things they are doing. If you aim to replicate the linkage profile of a competitor make sure you spend some time building up some serious quality linkage data before going after too many spammy or semi spammy links.
  • Human review is here to stay in search algorithms. Humans are only going to get more important. Inside workers, remote quality raters, and user feedback and tagging gives search engines another layer to build upon beyond link analysis.
  • Only a few quality links are needed to rank in Google in many fields.
  • If you can get the right resources to be interested in linking your way (directly or indirectly) a quality on topic high PageRank .edu link can be worth some serious cash.
  • Sometimes the cheapest way to get those kinds of links will be creating causes or linkbait, which may be external to your main site.

On to the review...

  • To determine the effect of spam mass they computate PageRank twice. Once normally and then again with more weight on known trusted sites that would be deemed to have a low spam mass.

  • Spammers either use a large number of low PageRank links, a few hard to get high PageRank links, or some combination of the two.
  • While the quality authoritative links to spam sites are more rare, they are often obtained through the following
    • blog / comment / forum / guestbook spam

    • honey pots (creating something useful to gather link popularity to send to spam)
    • buying recently expired domain names
  • if the majority of inlinks are from spam nodes it is assumed that the host is spam, otherwise it is labeled good. Rather than looking at the raw link count this can further be biased by looking at percent of total PageRank which comes from spam nodes
  • to further determine the percent of PageRank due to spam nodes you can also look at link structure of in-direct nodes and how they pass PageRank toward the end node
  • the presumption of knowing weather something is good or bad is not feasible, so it must be estimated from a subset of the index
  • for this to be practical search engines must have white lists and / or black lists to compare other nodes to. this can be automated or manual compiled
  • it is easier to assemble a good core since it is fairly reliable and does not change as often as spam techniques and spam sites (Aaron speculation: perhaps this is part of the reason some uber spammy older sites are getting away with murder...having many links from the good core from back when links were easier to obtain)
  • since the small reviewed core will be much smaller of a sample than the number of good pages on the web you must also review a small random uniform sample of the web to determine the approximate percent of the web that is spam to normalize the estimated spam mass
  • due to sampling methods some nodes may have a negative spam mass, and are likely to be nodes that were either assumed to be good in advance or nodes which are linked closely and heavily to other good nodes
  • it was too hard to manually create a large human reviewed set, so
    • they placed all sites listed in a small directory they considered to be virtually void of spam in the good core (they chose not to disclose the URL...anyone want to guess which one it was?). this group consisted of 16,776 hosts.

    • .gov and .edu hosts (and a few international organizations) also got placed in the good core
    • those sources gave them 504,150 unique trusted hosts
  • of the 73.3 million hosts in their test set 91.1% have a PageRank less than 2 (less than double the minimum PageRank value)
  • only about 64,000 hosts had a PageRank 100 times the minimum or more
  • they selected an arbitrary limit for minimum PageRank for reviewing the final results (since you are only concerned about the higher PageRank results that would appear atop search results)
    of this group of 883,328 sites and they hand reviewed 892 hosts

    • 564 (63.2%) were quality

    • 229 (25.7%) were spam
    • 54 (6.1%) uncertain (like beauty, spam is in the eye of the beholder)
    • 45 (5%) hosts down
  • ALL high spam mass anomalies on good sites were categorized into the following three groups
    • some Alibia sites (Chinese was far from the core group),
    • Blogger.com.br (relatively isolated from core group),
    • .pl URLs (there were only 12 polish educational institusions in the core group)
  • Calculating relative mass is better than absolute mass (which is only logical if you wanted the system to scale, so I don't know why they put it in the paper). Example of why absolute spam mass does not work:
    • Adobe had lowest absolute spam mass (Aaron speculation: those taking the time to create a PDF are probably more concerned with content quality than the average website)

    • Macromedia had third highest absolute spam mass (Aaron speculation: lots of adult and casino type sites have links to Flash)

[update: Orion also mentioned something useful about the paper on SEW forums.

"A number of recent publications propose link spam detection methods. For instance, Fetterly et al. [Fetterly et al., 2004] analyze the indegree and outdegree distributions of web pages. Most web pages have in- and outdegrees that follow a power-law distribution. Occasionally, however, 17 search engines encounter substantially more pages with the exact same in- or outdegrees than what is predicted by the distribution formula. The authors find that the vast majority of such outliers are spam pages. Similarly, Benczúr et al. [Benczúr et al., 2005] verify for each page x whether the distribution of PageRank scores of pages pointing to x conforms a power law. They claim that a major deviation in PageRank distribution is an indicator of link spamming that benefits x. These methods are powerful at detecting large, automatically generated link spam structures with "unnatural" link patterns. However, they fail to recognize more sophisticated forms of spam, when spammers mimic reputable web content. "

So if you are using an off the shelf spam generator script you bought from a hyped up sales letter and a few thousand other people are using it that might set some flags off, as search engines look at the various systematic footprints most spam generators leave to remove the bulk of them from the index.]

Link from Gary

Google Accounts Being Pushed to Google AdWords Users

When you log into AdWords they have a notice that you should switch over to the new Google Accounts by January 15th, 2006.

Once you switch over a new user access sub tab appears, which allows you to share your AdWords account with co workers without needing to share your personal Google account.

Google has more information about sharing an account and how to send invitations.

Not too long ago Google was giving out Google Account passwords.

Quality Content Without Links Is Not Quality Content...

There is a thread on WMW about the right price to sell an article for. The general consensus is that the author should probably wait it out until their site ranks and just keep their content.

While that is nice in theory, there is no guarantee that a site will eventually rank well just because it has decent content. Of course I am taking stuff out of context here, but you can read the thread to get the gist.

Comment:

As one site is willing to pay you, it doesn't make sense to give your articles away to the other site just to get a link.

Reply:

A friend of mine recently published an article on A List Apart. I think it would be hard to sell most any article for the value he is getting out of the authority of the link from that site, let alone the boost in credibility.

plus good primary links to your site may lead not only to direct exposure and link popularity, but also secondary exposure and more link love.

since your site is new you likely have lots of content and not so many links.

comment:

Whatever you decide, don't make the mistake of granting anyone exclusive rights to publish your work in perpetuity for peanuts.

reply:

for books I totally agree, but if you are obscure / new and / or are operating in a not so well known field and are good at writing articles sometimes giving them away is a great form of marketing.

rule #1: Obscurity is a far greater threat to authors and creative artists than piracy.

comment:

Regarding your site, you will never leave the sandbox unless you keep your content 100% to yourself.

reply:

I think sitting chill with minimal link popularity is far worse than trading some of what you got a lot of for something you don't got a lot of (ie: content for links)

The web has taught me alot about not considering what things could or should be worth and that unless you actively work to make them worth it then inferior products which are marketed more aggressively will often win big.

if you have around a hundred articles I don't think it hurts you to share a few of them.

Some of the links you get by giving stuff away are links you never could have bought. Those are the ones that are usually worth a bunch too.

Friends don't let friends go unlinked. ;)

Pages