Greg Boser: Blogger

Greg says Oh my God, I’ve Become a Blogger. A great thing for webmasters and search in general, IMHO.

Greg asks:

But now comes the hard part. How do you go about creating a blog about search marketing that is truly unique?

Anyone ever notice that the black hat SEO blogs typicially have both higher content quality and more original content than the typical white hat SEO blogs? Apparently, Gordon Hotchkiss has yet to get the memo.

via Oilman

Regulating Search Conference @ Yale

The Information Society Project at Yale Law School is hosting "Regulating Search?: A Symposium on Search Engines, Law, and Public Policy," the first academic conference devoted to search engines and the law. "Regulating Search?" will take place on December 3, 2005 at Yale Law School in New Haven, CT.

Topics covered:

  • Panel 1: The Search Space
    This panel will review the wide range of what search engines do and their importance in the information ecosystem.

  • Panel 2: Search Engines and Public Regulation
    This panel will discuss the possibility of direct government regulation of search functionality.

  • Panel 3: Search Engines and Intellectual Property
    This panel will review past and present litigation involving search engines and claims framed in the legal doctrines of copyright, trademark, patent, and right of publicity.

  • Panel 4: Search Engines and Individual Rights
    This panel will look at the role of search engines in reshaping our experience of basic rights and at the pressures the desire to protect those rights place on search.

Early bird registration fees (early registration ends on Nov. 15):

  • $35 for students

  • $75 for academic and nonprofit
  • $165 for corporate and law firm participants

Free Open Source Keyword Phrase List Generator

Probably the least exciting of the SEO / SEM tools I have created so far, but recently my friend Mike created a keyword phrase list generator.

I made it open source, so if you like it feel free to link to it, mirror it, or improve it.

Google May Sell Ads for Chicago Newspaper Company

UPDATE: Google Weighing Test Of Print Ads In Newspapers

Google Inc. (GOOG) is considering testing print advertisements in Chicago newspapers, in a sign that the Internet giant, to date seen primarily as a threat to traditional media, could also become an ally.

If Google could take the inefficienies out of offline media they probably could end up making the papers more revenue in the short run. Long run is anyone's guess.

Google Sandbox Tip

What is the difference between how real news stories spread and how link spam or artifical link manipulation spreads.

Link Spam Detection Based on Mass Estimation

Not sure how many people believe TrustRank is in effect in the current Google algorithm, but I would be willing to bet it is. Recently another link quality research paper came out by the name of Link Spam Detection Based on Mass Estimation [PDF].

It was authored by Zoltan Gyongyi, Pavel Berkhin, Hector Garcia-Molina, and Jan Pedersen.

The proposed method for determining Spam Mass works to detect spam, so it compliments nicely with TrustRank (TrustRank is primarily aimed to detect quality pages and demote spam).

The paper starts off by defining what spam mass is.

Spam Mass - an estimate of how much PageRank a page accumulates by being linked to from spam pages.

I covered a bunch of the how it works in theory stuff in the extended area of this post, but the general takehome tips from the article are

  • .edu and .gov love is the real deal, and then some

  • Don't be scared of getting a few spammy links (everyone has some).
  • TrustRank may deweight the effects of some spammy links. Since most spammy links have a low authority score they do not comprise a high percentage of your PageRank weighted link popularity if you have some good quality links. A few bad inbound links are not going to put your site over the edge to where it is algorithmically tagged as spam unless you were already near the limit prior to picking them up.
  • If you can get a few well known trusted links you can get away with having a large number of spammy links.
  • These types of algorithms work on a relative basis. If you can get more traditional media coverage than the competition you can get away with having a bunch more junk links as well.
  • Following up on that last point, some sites may be doing well in spite of some of the things they are doing. If you aim to replicate the linkage profile of a competitor make sure you spend some time building up some serious quality linkage data before going after too many spammy or semi spammy links.
  • Human review is here to stay in search algorithms. Humans are only going to get more important. Inside workers, remote quality raters, and user feedback and tagging gives search engines another layer to build upon beyond link analysis.
  • Only a few quality links are needed to rank in Google in many fields.
  • If you can get the right resources to be interested in linking your way (directly or indirectly) a quality on topic high PageRank .edu link can be worth some serious cash.
  • Sometimes the cheapest way to get those kinds of links will be creating causes or linkbait, which may be external to your main site.

On to the review...

  • To determine the effect of spam mass they computate PageRank twice. Once normally and then again with more weight on known trusted sites that would be deemed to have a low spam mass.

  • Spammers either use a large number of low PageRank links, a few hard to get high PageRank links, or some combination of the two.
  • While the quality authoritative links to spam sites are more rare, they are often obtained through the following
    • blog / comment / forum / guestbook spam

    • honey pots (creating something useful to gather link popularity to send to spam)
    • buying recently expired domain names
  • if the majority of inlinks are from spam nodes it is assumed that the host is spam, otherwise it is labeled good. Rather than looking at the raw link count this can further be biased by looking at percent of total PageRank which comes from spam nodes
  • to further determine the percent of PageRank due to spam nodes you can also look at link structure of in-direct nodes and how they pass PageRank toward the end node
  • the presumption of knowing weather something is good or bad is not feasible, so it must be estimated from a subset of the index
  • for this to be practical search engines must have white lists and / or black lists to compare other nodes to. this can be automated or manual compiled
  • it is easier to assemble a good core since it is fairly reliable and does not change as often as spam techniques and spam sites (Aaron speculation: perhaps this is part of the reason some uber spammy older sites are getting away with murder...having many links from the good core from back when links were easier to obtain)
  • since the small reviewed core will be much smaller of a sample than the number of good pages on the web you must also review a small random uniform sample of the web to determine the approximate percent of the web that is spam to normalize the estimated spam mass
  • due to sampling methods some nodes may have a negative spam mass, and are likely to be nodes that were either assumed to be good in advance or nodes which are linked closely and heavily to other good nodes
  • it was too hard to manually create a large human reviewed set, so
    • they placed all sites listed in a small directory they considered to be virtually void of spam in the good core (they chose not to disclose the URL...anyone want to guess which one it was?). this group consisted of 16,776 hosts.

    • .gov and .edu hosts (and a few international organizations) also got placed in the good core
    • those sources gave them 504,150 unique trusted hosts
  • of the 73.3 million hosts in their test set 91.1% have a PageRank less than 2 (less than double the minimum PageRank value)
  • only about 64,000 hosts had a PageRank 100 times the minimum or more
  • they selected an arbitrary limit for minimum PageRank for reviewing the final results (since you are only concerned about the higher PageRank results that would appear atop search results)
    of this group of 883,328 sites and they hand reviewed 892 hosts

    • 564 (63.2%) were quality

    • 229 (25.7%) were spam
    • 54 (6.1%) uncertain (like beauty, spam is in the eye of the beholder)
    • 45 (5%) hosts down
  • ALL high spam mass anomalies on good sites were categorized into the following three groups
    • some Alibia sites (Chinese was far from the core group),
    • Blogger.com.br (relatively isolated from core group),
    • .pl URLs (there were only 12 polish educational institusions in the core group)
  • Calculating relative mass is better than absolute mass (which is only logical if you wanted the system to scale, so I don't know why they put it in the paper). Example of why absolute spam mass does not work:
    • Adobe had lowest absolute spam mass (Aaron speculation: those taking the time to create a PDF are probably more concerned with content quality than the average website)

    • Macromedia had third highest absolute spam mass (Aaron speculation: lots of adult and casino type sites have links to Flash)

[update: Orion also mentioned something useful about the paper on SEW forums.

"A number of recent publications propose link spam detection methods. For instance, Fetterly et al. [Fetterly et al., 2004] analyze the indegree and outdegree distributions of web pages. Most web pages have in- and outdegrees that follow a power-law distribution. Occasionally, however, 17 search engines encounter substantially more pages with the exact same in- or outdegrees than what is predicted by the distribution formula. The authors find that the vast majority of such outliers are spam pages. Similarly, Benczúr et al. [Benczúr et al., 2005] verify for each page x whether the distribution of PageRank scores of pages pointing to x conforms a power law. They claim that a major deviation in PageRank distribution is an indicator of link spamming that benefits x. These methods are powerful at detecting large, automatically generated link spam structures with "unnatural" link patterns. However, they fail to recognize more sophisticated forms of spam, when spammers mimic reputable web content. "

So if you are using an off the shelf spam generator script you bought from a hyped up sales letter and a few thousand other people are using it that might set some flags off, as search engines look at the various systematic footprints most spam generators leave to remove the bulk of them from the index.]

Link from Gary

Google Accounts Being Pushed to Google AdWords Users

When you log into AdWords they have a notice that you should switch over to the new Google Accounts by January 15th, 2006.

Once you switch over a new user access sub tab appears, which allows you to share your AdWords account with co workers without needing to share your personal Google account.

Google has more information about sharing an account and how to send invitations.

Not too long ago Google was giving out Google Account passwords.

Quality Content Without Links Is Not Quality Content...

There is a thread on WMW about the right price to sell an article for. The general consensus is that the author should probably wait it out until their site ranks and just keep their content.

While that is nice in theory, there is no guarantee that a site will eventually rank well just because it has decent content. Of course I am taking stuff out of context here, but you can read the thread to get the gist.

Comment:

As one site is willing to pay you, it doesn't make sense to give your articles away to the other site just to get a link.

Reply:

A friend of mine recently published an article on A List Apart. I think it would be hard to sell most any article for the value he is getting out of the authority of the link from that site, let alone the boost in credibility.

plus good primary links to your site may lead not only to direct exposure and link popularity, but also secondary exposure and more link love.

since your site is new you likely have lots of content and not so many links.

comment:

Whatever you decide, don't make the mistake of granting anyone exclusive rights to publish your work in perpetuity for peanuts.

reply:

for books I totally agree, but if you are obscure / new and / or are operating in a not so well known field and are good at writing articles sometimes giving them away is a great form of marketing.

rule #1: Obscurity is a far greater threat to authors and creative artists than piracy.

comment:

Regarding your site, you will never leave the sandbox unless you keep your content 100% to yourself.

reply:

I think sitting chill with minimal link popularity is far worse than trading some of what you got a lot of for something you don't got a lot of (ie: content for links)

The web has taught me alot about not considering what things could or should be worth and that unless you actively work to make them worth it then inferior products which are marketed more aggressively will often win big.

if you have around a hundred articles I don't think it hurts you to share a few of them.

Some of the links you get by giving stuff away are links you never could have bought. Those are the ones that are usually worth a bunch too.

Friends don't let friends go unlinked. ;)

Playing on the Web...2.0 ;)

Blummy - Firefox bookmarklet management tool that is loved by the Web 2.0 geek. It allows you to put many bookmarks into an expandable box that opens up when you click on it.

For example, the Link Harvester blummlet (code shown below, please ignore the line breaks I added for formatting) looks like:

javascript:Blummy.href
('http://www.linkhounds.com/link-harvester/backlinks.php
?query='+location.href)

and would run Link Harvester on whatever page you are viewing.

A regular bookmarklet for it would look like (again, ignore the formatting line breaks):

javascript:location = 'http://www.linkhounds.com/link-harvester/backlinks.php?query=' + escape(location);

Here is a list of a wide variety of Mozilla bookmarlets, including character count and word frequency bookmarklets.

I was reading some Dive Into Greasemonkey today...good stuff. I just wish I knew a bit more about XPath and Javascript Firefox strategies.

It will probably take me at least a few days before I could make anything cool. I may try though, and if not I could always bug Mike, and maybe Platypus is more my mode :)

A Greasemonkey Hacks book was recently released. Greasemonkey is cool stuff, not just because DaveN says so, but also because you can do things like number search results and import De.licio.us data right into Google search results.

Here is a cool free video maker. I made one today, though it takes forever to upload and sounds like I am eating the mic. I will probably upload it tomorrowish.

I have been far too textual, and think I need to start looking more at trying to learn programming languages, audio, and visual stuff :)

I got to chat for a while with one of the guys from Validome, and they sure do some cool stuff over there.

For those wondering how this post is in any way relevant to search, you can tell a good bit about how competitive a field may be by seeing how many of the top ranked results are annotated.

GoDaddy References Google's Patent

You know you have good reach as a search engine when registrars use your patent numbers to sell domains. GoDaddy says:

Google recently filed United States Patent Application 20050071741. As part of that patent application, Google made apparent its efforts to wipe out search engine spam, stating:

'Valuable (legitimate) domains are often paid for several years in advance, while doorway (illegitimate) domains rarely are used for more than a year. Therefore, the date when a domain expires in the future can be used as a factor in predicting the legitimacy of a domain and, thus, the documents associated therewith."

Domains registered for longer periods give the indication, true or not, that their owner is legitimate. Google uses a domain's length of registration when indexing and ranking a Web site for inclusion in their organic search results.

So to prove to everyone that your site is the real deal, register for more than one year and increase your chances of boosting your search ranking on Google.

I know registrars always sell bogus submit your site to the search engines garbage, but I don't think I have ever seen one recommend registering for extended periods of time because of a Google patent before.

Smart marketing on them, and smart marketing on Google for putting endless amounts of FUD in that patent.

Pages