Link Spam Detection Based on Mass Estimation
Not sure how many people believe TrustRank is in effect in the current Google algorithm, but I would be willing to bet it is. Recently another link quality research paper came out by the name of Link Spam Detection Based on Mass Estimation [PDF].
It was authored by Zoltan Gyongyi, Pavel Berkhin, Hector Garcia-Molina, and Jan Pedersen.
The proposed method for determining Spam Mass works to detect spam, so it compliments nicely with TrustRank (TrustRank is primarily aimed to detect quality pages and demote spam).
The paper starts off by defining what spam mass is.
Spam Mass - an estimate of how much PageRank a page accumulates by being linked to from spam pages.
I covered a bunch of the how it works in theory stuff in the extended area of this post, but the general takehome tips from the article are
- .edu and .gov love is the real deal, and then some
- Don't be scared of getting a few spammy links (everyone has some).
- TrustRank may deweight the effects of some spammy links. Since most spammy links have a low authority score they do not comprise a high percentage of your PageRank weighted link popularity if you have some good quality links. A few bad inbound links are not going to put your site over the edge to where it is algorithmically tagged as spam unless you were already near the limit prior to picking them up.
- If you can get a few well known trusted links you can get away with having a large number of spammy links.
- These types of algorithms work on a relative basis. If you can get more traditional media coverage than the competition you can get away with having a bunch more junk links as well.
- Following up on that last point, some sites may be doing well in spite of some of the things they are doing. If you aim to replicate the linkage profile of a competitor make sure you spend some time building up some serious quality linkage data before going after too many spammy or semi spammy links.
- Human review is here to stay in search algorithms. Humans are only going to get more important. Inside workers, remote quality raters, and user feedback and tagging gives search engines another layer to build upon beyond link analysis.
- Only a few quality links are needed to rank in Google in many fields.
- If you can get the right resources to be interested in linking your way (directly or indirectly) a quality on topic high PageRank .edu link can be worth some serious cash.
- Sometimes the cheapest way to get those kinds of links will be creating causes or linkbait, which may be external to your main site.
On to the review...
- To determine the effect of spam mass they computate PageRank twice. Once normally and then again with more weight on known trusted sites that would be deemed to have a low spam mass.
- Spammers either use a large number of low PageRank links, a few hard to get high PageRank links, or some combination of the two.
- While the quality authoritative links to spam sites are more rare, they are often obtained through the following
- blog / comment / forum / guestbook spam
- honey pots (creating something useful to gather link popularity to send to spam)
- buying recently expired domain names
- if the majority of inlinks are from spam nodes it is assumed that the host is spam, otherwise it is labeled good. Rather than looking at the raw link count this can further be biased by looking at percent of total PageRank which comes from spam nodes
- to further determine the percent of PageRank due to spam nodes you can also look at link structure of in-direct nodes and how they pass PageRank toward the end node
- the presumption of knowing weather something is good or bad is not feasible, so it must be estimated from a subset of the index
- for this to be practical search engines must have white lists and / or black lists to compare other nodes to. this can be automated or manual compiled
- it is easier to assemble a good core since it is fairly reliable and does not change as often as spam techniques and spam sites (Aaron speculation: perhaps this is part of the reason some uber spammy older sites are getting away with murder...having many links from the good core from back when links were easier to obtain)
- since the small reviewed core will be much smaller of a sample than the number of good pages on the web you must also review a small random uniform sample of the web to determine the approximate percent of the web that is spam to normalize the estimated spam mass
- due to sampling methods some nodes may have a negative spam mass, and are likely to be nodes that were either assumed to be good in advance or nodes which are linked closely and heavily to other good nodes
- it was too hard to manually create a large human reviewed set, so
- they placed all sites listed in a small directory they considered to be virtually void of spam in the good core (they chose not to disclose the URL...anyone want to guess which one it was?). this group consisted of 16,776 hosts.
- .gov and .edu hosts (and a few international organizations) also got placed in the good core
- those sources gave them 504,150 unique trusted hosts
- of the 73.3 million hosts in their test set 91.1% have a PageRank less than 2 (less than double the minimum PageRank value)
- only about 64,000 hosts had a PageRank 100 times the minimum or more
- they selected an arbitrary limit for minimum PageRank for reviewing the final results (since you are only concerned about the higher PageRank results that would appear atop search results)
of this group of 883,328 sites and they hand reviewed 892 hosts- 564 (63.2%) were quality
- 229 (25.7%) were spam
- 54 (6.1%) uncertain (like beauty, spam is in the eye of the beholder)
- 45 (5%) hosts down
- ALL high spam mass anomalies on good sites were categorized into the following three groups
- some Alibia sites (Chinese was far from the core group),
- Blogger.com.br (relatively isolated from core group),
- .pl URLs (there were only 12 polish educational institusions in the core group)
- Calculating relative mass is better than absolute mass (which is only logical if you wanted the system to scale, so I don't know why they put it in the paper). Example of why absolute spam mass does not work:
- Adobe had lowest absolute spam mass (Aaron speculation: those taking the time to create a PDF are probably more concerned with content quality than the average website)
- Macromedia had third highest absolute spam mass (Aaron speculation: lots of adult and casino type sites have links to Flash)
[update: Orion also mentioned something useful about the paper on SEW forums.
"A number of recent publications propose link spam detection methods. For instance, Fetterly et al. [Fetterly et al., 2004] analyze the indegree and outdegree distributions of web pages. Most web pages have in- and outdegrees that follow a power-law distribution. Occasionally, however, 17 search engines encounter substantially more pages with the exact same in- or outdegrees than what is predicted by the distribution formula. The authors find that the vast majority of such outliers are spam pages. Similarly, Benczúr et al. [Benczúr et al., 2005] verify for each page x whether the distribution of PageRank scores of pages pointing to x conforms a power law. They claim that a major deviation in PageRank distribution is an indicator of link spamming that benefits x. These methods are powerful at detecting large, automatically generated link spam structures with "unnatural" link patterns. However, they fail to recognize more sophisticated forms of spam, when spammers mimic reputable web content. "
So if you are using an off the shelf spam generator script you bought from a hyped up sales letter and a few thousand other people are using it that might set some flags off, as search engines look at the various systematic footprints most spam generators leave to remove the bulk of them from the index.]
Link from Gary
Comments
From the above: "Spammers either use a large number of low PageRank links, a few hard to get high PageRank links, or some combination of the two."
Seems pretty broad and could almost cover the incoming links for, er, possibly 80% of sites on the net?
Surely it is possible for new people with new ideas or new sites to break into social networks. It happens all the time. Just as time passes the social relationships become more self reinforcing and the quantity of content on the web increases, so you have to do more or better stuff to be seen as remarkable or interesting.
>Seems pretty broad and could almost cover the incoming links for, er, possibly 80% of sites on the net?
Well I think if you look at the power laws bit mentioned by Orion that you can think of it a bit more as mathematical patterns.
Most automated spam systems have patterns associated with them.
Algorithms and systems like these are not about catching all spam, but are about reducing the volume of spam to a manageable level.
They are looking for spikes in the high PageRank Web sites (and their scale for PageRank does not match the classic PageRank definition -- this is a modified PageRank algorithm).
According to their test, about 1 in 4 "high PageRank" sites were found to be spam. Given how few of the sites were tested (less than 1 million out of a database of 70+ million), these sites would probably correspond to Toolbar PR 8 and above sites, maybe only PR 9 and above.
A lot of layering is implied in this model. That is, they suggest that the spammers create(d) low-quality sites that were used to boost a few other sites, which were used to boost ultimate target sites to high PR. My feeling is that the practice is a bit more sophisticated than that.
The paper also identifies "cliques" on the Web, isolated regions where good content pretty much links in a closed group. The MegaSpam sites apparently don't fall into the clique categories. They have stepped past the boundaries of self-inclusion and actually participate in natural Web community linking.
But the paper argues that, despite this apparent blending with the natural Web community, bad sites with high PR still stick out as bad sites.
Since PR doesn't affect search rankings very much, the paper in itself is not significant, but it may signal future trends in spam detection that could feed new filters.
SEO book is an impressive work of art that sits quietly in cyberspace. Your writing is high quality, yet quite understandable. Your style is unassuming and your approach "matter of factly". I hope I can one day communicate like you do. I have been teaching medics for 8 years and I decided to do what I really like with my life and that is why I found your blog. If I write anything inappropriate, please edit it and forgive me, this is my first!
Your blog and your site in general has helped me very much and the advice and guidance reassures me. I have only just started this journey of doing business on the internet. I wish I could say "watch this space" or maybe I can. I am determined and quite motivated.
I have seen some ezine articles and believe it or not spamming has been carried to that level with the advent of resource box. Loads of articles are written that lack substance and you never see a comment. The authors are not reading, so what are they writing about? Does the possibility of repetition af maybe what someone else wrote a few days earlier bother them? -More repulsive is the typos.......even one mistake on a website de-ranks (for want of a better word) the site and the zeal to continue reading is gone in 60 milliseconds. The articles in themselves are just promotional reviews of what the author does. Many of them also sell hyped ebooks that were written by them, reviewed by them and is now being marketed by them! The funny thing is that many of their sites are single page extra long sites. You need a magnifying glass to see the scrollbar and the page, though static html seems like the Las Vegas skyline, with abominable use of highlighting and a cacophony of design imbalance. I hope these spammers will fall by the wayside and allow us to have a decent place on earth.
Thank you for your contribution to the internet and especially to my net education. While I continue to build my business on the web, I will visit your blog as often as possible and feed my brain with your knowledge and the resource here. I hope that in less than a year I can say the success of my site, www.goodhosthunting.com was because of what I learned from Seobook.
Michael
Hi Michael
Thanks for the kind comment.
I will warn you though that if you look closely enough at some of my writing there are lots of typos in it as well.
I know, I was just talking I guess. Even after typing I saw there were lots of typos in mine too. But what I was trying to get at was that when we are trying to market ourselves or our website through ezines, one should take fine details serious. But I am sure I will learn more as I go through your endless site! I am feasting.
Have a nice weekend.
Michael
If such "cliques" exist amonst the eite sites and they dont generally link out to lets say "the less then desirables" then is it even possible for a new site that can't get the attention of the elite to still climb the ladder to super stardom and become a recognized top teir site as well?
What I find really annoying about this is that alot of good forums are much more reliable than an organisation. Organisations have an agenda - one of the things I do is work within marketing projects within a university and I have news for Google - EDUCATIONAL INSTITUTIONS THESE DAYS ARE BUSINESSES. How can they not understand this?
Forums on the other hand are quite public and some (like this one I presume) are highly moderated by experts in the field. They make a highly valid contribution. Why is it then that basically any edu link automatically scores highly but a link on a forum is bogus (presumably if I put a spam link here that had nothing to do with the forum, Aaron would moderate it). It goes without saying that all black hats should enroll in university courses just to get university hosting project space / pay some student to link to their site on their personal student page.
I think one reason is that most forums eventually suffer rot and are not well maintained. Plus Google is not trying to stop nepotism, marketing, and market manipulation, they are simply trying to make it time consuming and expensive such that
Of course there is a cap on AdWords because not everyone can win in an auction. As far as the other part, I think that is a bit up on the air as to if it is succeeding or not. I actually think publishers are getting more aggressive at creating information pollution, but maybe Google doesn't care about that so long as subverting AdWords is expensive.
Add new comment