New Directory, URL, & Keyword Phrase Based Google Filters & Penalties
WebmasterWorld has been running a series of threads about various penalties and filters aligned with specific URLs, keyword phrases, and in some cases maybe even entire directories.
Some Threads:
There is a lot of noise in those threads, but you can put some pieces together from them. One of the best comments is from Joe Sinkwitz:
1. Phrase-based penalties & URL-based penalties; I'm seeing both.
2. On phrase-based penalties, I can look at the allinanchor: for the that KW phrase, find several *.blogspot.com sites, run a copyscape on the site with the phrase-based penalty, and will see these same *.blogspot.com sites listed...scraping my and some of my competitors' content.
3. On URL-based penalties allinanchor: is useless because it seems to practically dump the entire site down to the dregs of the SERPs. Copyscape will still show a large amount of *.blogspot.com scraping though.
Joe has a similar post on his blog, and I covered a similar situation on September 1st of last year in Rotating Page Titles for Anchor Text Variation.
You see a lot more of the auto-gen spam in competitive verticals, and having a few sites that compete for those types of queries helps you see the new penalties, filters, and re-ranked results as they are rolled in.
Google Patents:
Google filed a patent application for Agent Rank, which is aimed at allowing them to associate portions of page content, site content, and cross-site content with individuals of varying degrees of trust. I doubt they have used this much yet, but the fact that they are even considering such a thing should indicate that many other types of penalties, filters, and re-ranking algorithms are already at play.
Some Google patents related to phrases, as pointed out by thegypsy here:
- Phrase-based searching in an information retrieval system
- Multiple index based information retrieval system
- Phrase-based generation of document descriptions
- Phrase identification in an information retrieval system
- Detecting spam documents in a phrase based information retrieval system
Bill Slawski has a great overview post touching on these patent applications.
Phrase Based Penalties:
Many types of automated and other low quality content creation cause the low quality pages to barely be semantically related to the local language, while other types of spam generation cause low quality pages to be too heavily aligned to the local language. Real content tends to fall within a range of semantic coverage.
Cheap or automated content typically tends to look unnatural, especially when you move beyond comparing words to looking at related phrases.
If a document is too far off in either direction (not enough OR too many related phrases) it could be deemed as not relevant enough to rank, or a potential spam page. Once a document is flagged for one term it could also be flagged for other related terms. If enough pages from a site are flagged a section of the site or a whole site can be flagged for manual review.
URL and Directory Based Penalties:
Would it make sense to prevent a spam page on a good domain for ranking for anything? Would it make sense for some penalties to be directory wide? Absolutely. Many types of cross site scripting errors and authority domain abuses (think rented advertisement folder or other ways to gain access to a trusted site) occur at a directory or subdomain level, and have a common URL footprint. And cheaply produced content also tends to have section wide footprints where only a few words are changed in the page titles across an entire section of a site.
I recently saw an exploit on the W3C. Many other types of automated templated spam leave directory wide footprints, and as Google places more weight on authoritative domains they need to get better at filtering out abuse of that authority. Google would love to be able to penalize things in a specific subdomain or folder without having to nuke that entire domain, so in some cases they probably do, and these filters or penalties probably effect both new domains and more established authoritative domains.
How do You Know When You are Hit?
If you had a page which typically ranked well for a competitive keyword phrase, and you saw that page drop like a rock you might have a problem. Other indications of problems are if you have inferior pages that are ranking where your more authoritative page ranked in the past. For example, lets say you have a single mother home loan page ranking for a query where your home loan page ranked, but no longer does.
Textual Community:
Just like link profiles create communities, so does the type and variety of text on a page.
Search results tend to sample from a variety of interests. With any search query there are assumed common ideas that may be answered by a Google OneBox, related phrase suggestions, or answered based on the mixture of the types of sites shown in the organic search results. For example:
- how do I _____
- where do I buy a ____
- what is like a _____
- what is the history of ______
- consumer warnings about ____
- ______ reviews
- ______ news
- can I build a ___
- etc etc etc
TheWhippinpost had a brilliant comment in a WMW thread:
- The proximity, ie... the "distance", between each of those technical words, are most likely to be far closer together on the merchants page too (think product specification lists etc...).
- Tutorial pages will have a higher incidence of "how" and "why" types of words and phrases.
- Reviews will have more qualitative and experiential types of words ('... I found this to be robust and durable and was pleasantly surprised...').
- Sales pages similarly have their own (obvious) characteristics.
- Mass-generated spammy pages that rely on scraping and mashing-up content to avoid dupe filters whilst seeding in the all-important link-text (with "buy" words) etc... should, in theory, stand-out amongst the above, since the spam will likely draw from a mixture of all the above, in the wrong proportions.
Don't forget that Google Base recently changed to require certain fields so they can help further standardize that commercial language the same way they standardized search ads to have 95 characters. Google is also scanning millions of books to learn more about how we use language in different fields.
Comments
I agree with Mike. You put all this time and effort into a website and then have to deal with making sure you aren't penalized. I mean come on, having to change your titles and such all the time? The majority of the people who link to my site use the site's name. So am I getting penalized because all the link text are the same? Realistically most people who link to a site use the site's name, and only people who know about SEO will use keywords and link within a paragraph.
Hi Tim
I have many legitimate links to my homepage that say learn SEO, Aaron Wall, SEO Book, Aaron's SEO Book, etc etc etc. There is a decent amount of variation in there, though at one point (about 1.5 years ago) there was probably not enough due to the artificial nature of my link profile AND Google getting too aggressive with a filter that nuked my site for its own name and also hit Paypal's official site for their name.
Come on..
The natural way is for search engines to follow websites trend, not websites following search engines thrends and rules.
I'm tired of worring about search engines. Forget them, I will focus on getting visitors and customers by links from other websites; if they also arrive through search engines I'll take it as a bonus.
I'm tired to depend on search engines to make or break my business.
Google is obsolete, because ranking pages on popularity instead of quality is WRONG.
I don't want to find popular pages as results, instead I want to find quality pages as results, popular or not.
I think very soon a NEW search engine will spark from someplace, a search engine that will rank pages on their TRUE quality, instead of how popular they are.
Personally when I'm searching, I don't want popular junk and garbage, I don't want old pages being favorised in th defavor of new pages, I want the most quality pages appear first ( old or breaking new ) not the most popular.
I think someone should patent a Quality Ranking system, but I have no ideea how can such an ideea be put to practice.
Quality sites on a topic first, less quality sites last, no matter of their link popularity, nor keyword density, nor age.
Maybe the next search engine will be human driven? I mean, humans when searching for something voting through results and casting less quality pages from the top while pushing the most quality sites to the top?
With some good anti-abuse systems, this could work great
Hey Gang - heres some related Googies in the Ultimate Phrase Based Indexing and Retrieval Guide - http://www.huomah.com/search-engines/algorithm-matters/phrase-based-opti...
>>The natural way is for search engines to follow >>websites trend, not websites following search engines >>thrends and rules.
Spot on! Search engines are following their own contorted number logic and getting sucked into all numbers game.
There goes my "encyclopedic" approach, eh, thanks for the heads up, Aaron, great post again.
Ouch, more penalties?
And only some of them show up in webmastertools? So odd - they argue that it helps prevent spammers from learning how to get around the penalties, but it makes it difficult for legitimate webmasters to sleep at night...
Add new comment