WebmasterWorld has been running a series of threads about various penalties and filters aligned with specific URLs, keyword phrases, and in some cases maybe even entire directories.
Some Threads:
There is a lot of noise in those threads, but you can put some pieces together from them. One of the best comments is from Joe Sinkwitz:
1. Phrase-based penalties & URL-based penalties; I'm seeing both.
2. On phrase-based penalties, I can look at the allinanchor: for the that KW phrase, find several *.blogspot.com sites, run a copyscape on the site with the phrase-based penalty, and will see these same *.blogspot.com sites listed...scraping my and some of my competitors' content.
3. On URL-based penalties allinanchor: is useless because it seems to practically dump the entire site down to the dregs of the SERPs. Copyscape will still show a large amount of *.blogspot.com scraping though.
Joe has a similar post on his blog, and I covered a similar situation on September 1st of last year in Rotating Page Titles for Anchor Text Variation.
You see a lot more of the auto-gen spam in competitive verticals, and having a few sites that compete for those types of queries helps you see the new penalties, filters, and re-ranked results as they are rolled in.
Google Patents:
Google filed a patent application for Agent Rank, which is aimed at allowing them to associate portions of page content, site content, and cross-site content with individuals of varying degrees of trust. I doubt they have used this much yet, but the fact that they are even considering such a thing should indicate that many other types of penalties, filters, and re-ranking algorithms are already at play.
Some Google patents related to phrases, as pointed out by thegypsy here:
Bill Slawski has a great overview post touching on these patent applications.
Phrase Based Penalties:
Many types of automated and other low quality content creation cause the low quality pages to barely be semantically related to the local language, while other types of spam generation cause low quality pages to be too heavily aligned to the local language. Real content tends to fall within a range of semantic coverage.
Cheap or automated content typically tends to look unnatural, especially when you move beyond comparing words to looking at related phrases.
If a document is too far off in either direction (not enough OR too many related phrases) it could be deemed as not relevant enough to rank, or a potential spam page. Once a document is flagged for one term it could also be flagged for other related terms. If enough pages from a site are flagged a section of the site or a whole site can be flagged for manual review.
URL and Directory Based Penalties:
Would it make sense to prevent a spam page on a good domain for ranking for anything? Would it make sense for some penalties to be directory wide? Absolutely. Many types of cross site scripting errors and authority domain abuses (think rented advertisement folder or other ways to gain access to a trusted site) occur at a directory or subdomain level, and have a common URL footprint. And cheaply produced content also tends to have section wide footprints where only a few words are changed in the page titles across an entire section of a site.
I recently saw an exploit on the W3C. Many other types of automated templated spam leave directory wide footprints, and as Google places more weight on authoritative domains they need to get better at filtering out abuse of that authority. Google would love to be able to penalize things in a specific subdomain or folder without having to nuke that entire domain, so in some cases they probably do, and these filters or penalties probably effect both new domains and more established authoritative domains.
How do You Know When You are Hit?
If you had a page which typically ranked well for a competitive keyword phrase, and you saw that page drop like a rock you might have a problem. Other indications of problems are if you have inferior pages that are ranking where your more authoritative page ranked in the past. For example, lets say you have a single mother home loan page ranking for a query where your home loan page ranked, but no longer does.
Textual Community:
Just like link profiles create communities, so does the type and variety of text on a page.
Search results tend to sample from a variety of interests. With any search query there are assumed common ideas that may be answered by a Google OneBox, related phrase suggestions, or answered based on the mixture of the types of sites shown in the organic search results. For example:
- how do I _____
- where do I buy a ____
- what is like a _____
- what is the history of ______
- consumer warnings about ____
- ______ reviews
- ______ news
- can I build a ___
- etc etc etc
TheWhippinpost had a brilliant comment in a WMW thread:
- The proximity, ie... the "distance", between each of those technical words, are most likely to be far closer together on the merchants page too (think product specification lists etc...).
- Tutorial pages will have a higher incidence of "how" and "why" types of words and phrases.
- Reviews will have more qualitative and experiential types of words ('... I found this to be robust and durable and was pleasantly surprised...').
- Sales pages similarly have their own (obvious) characteristics.
- Mass-generated spammy pages that rely on scraping and mashing-up content to avoid dupe filters whilst seeding in the all-important link-text (with "buy" words) etc... should, in theory, stand-out amongst the above, since the spam will likely draw from a mixture of all the above, in the wrong proportions.
Don't forget that Google Base recently changed to require certain fields so they can help further standardize that commercial language the same way they standardized search ads to have 95 characters. Google is also scanning millions of books to learn more about how we use language in different fields.