URL Canonicalization: The Missing Manual
Canonicalization can be a confusing area for webmasters, so let's take a look at what it is, and ways to avoid it causing problems.
What Is Canonicalization?
Canonicalization is the process by which URLs are standardized. For example, www.acme.com and www.acme.com/ are treated as the same page, even though the syntax of the URL is different.
Why Is Canonicalization An Issue For SEO?
Problems can occur when the search engine doesn't normalize URLs properly.
For example, a search engine might see http://www.acme.com and http://acme.com as different pages. In this instance, the search engine has the host names confused.
Why Is This a Problem?
If the search engines sees a page as being published at many separate URLs, the search engine may rank your pages lower than they would otherwise, or not rank them at all.
Canonicalization issues can split link juice between pages if people link to variants of the URL. Not only does this affect rank (less PageRank = lower rank), but it can also affect crawl depth (if PageRank is spent on duplicate content it is not being spent getting other unique content indexed).
To appreciate what a dramatic effect canonicalization issues can have on search traffic look at the following example, and notice that for the given example proper canonicalization increased traffic for that keyword by 300%
Link Equity | Google Ranking Position | % of Search Traffic | Daily Traffic Volume | Traffic Increase | |
split 1 | 60% | 8 | 3% | 50 | - |
split 2 | 40% | 0% | 0 | - | |
canonical | 100% | 2 | 12% | 200 | 300% |
What Conditions Can Cause This Problem?
There are various conditions, but the following are amongst the most common:
- Different host names i.e. www.acme.com vs acme.com
- Redirects pointing to different URLs i.e. 302 used inappropriately
- Forwarding multiple URLs to the same content, and/or publishing the same content on multiple domains
- Improperly configured dynamic URLs i.e. any url rewriting based on changing conditions
- Two index pages appearing in the same location i.e. Index.htm vs Index.html
- Different protocols i.e. https://www vs http://www
- Multiple slashes in the filepath i.e. www.acme.com/ vs www.acme.com//
- Scripts that generate alternate URLs for the same content i.e. some blogging and forum software, ecommerce software that adds tracking URLs
- Port numbers in the domain name i.e. acme.com/4430 : can sometimes be seen in virtual hosting environments.
- Capitalization - i.e. www.acme.com/Index.html vs www.acme.com/index.html
- URLs "built" from the path you take to reach a page i.e. tracking software may incorporate the click path in the URL for statistical purposes.
- Trailing questions marks, with or without parameters i.e. www.acme.com/? or www.acme.com/?source=cnn (a common tagging strategy amongst ad buys)
How Can I Tell If Canonicalization Issues Are Affecting My Site?
Besides working through the checklist performing a manual check, you can also use Google's cache date.
Previously, you would have been able to use Google's supplemental index marker, although Google have recently done away with this feature.
The supplemental index is a secondary index, seperate from Google's main index. It is a graveyard, of sorts, containing outdated pages, pages with low trust scores, duplicate content, and other erroneous pages. As duplicate pages often reside in the supplemental index, appearing in the supplemental index can be an indicator you may have canonicalization issues, all else being equal.
Before Google removed the supplemental index label, many SEOs noticed that supplemental pages had an old cache date and that cache date is a good proxy for trust. If your page is not indexed frequently, and you think it should be, chances are the page is residing in the supplemental index.
Michael Gray at Wolf-Howl" outlines a method to easily check for this data. In summary, you add a date and unique field to each page, wait a couple of months, then search on this term.
How Can I Avoid Canonicalization Issues?
Good Site Planning
Using good site planning and architecture, from the start, can save you a lot of problems later on. Pick a convention for linking, and stick with it.
Maintain Consistent Linking Conventions
It's an important point, so I'll repeat it ;) Always link to www.acme.com, rather than sometimes linking to acme.com/index.htm, and sometimes linking to www.acme.com.
301 Redirect Non-www to www , Or Vice Versa
You can force resolution to one URL only. To do this, you create a 301 redirect.
Here's a typical 301 redirect script:
RewriteEngine On RewriteCond %{HTTP_HOST} ^seobook.com [NC] RewriteRule ^(.*)$ http://www.seobook.com/$1 [L,R=301]
For a more detailed analysis on how to use redirects, see .htaccess, 301 Redirects & SEO.
Use The Website Health Check Tool
This tool, and accompanying video, shows you how to spot a number of site architecture problems, including canonicalization issues.
Download the tool, check the www vs non-www option box, and hit the Analyze button.
If you have a large site you may not be able to surface all the canonicalization issues using the default tool settings. You may need to use the date based filter options to get a deep view of recently indexed pages...many canonicalization issues occur sitewide, so looking deeply at new pages should help you detect problems.
Another free, but far more time consuming option, is to use the date based filters on Google's advanced search page.
Workaround For Https://
Sometimes Google will index both the http:// and the https:// versions of a site.
One way around this is to tell the bots not to index the https:// version.
Tony Spencer outlines two ways to do this in .htaccess, 301 Redirects & SEO. One is to cloak the robots.txt file, the other is to create a conditional php script.
Use Absolute, As Opposed To Relative Links
An absolute link specifies the exact location of a file on a webserver. For example, http://www.acme.com/filename.html
A relative link is, as the name suggests, relative to a pages' location on the server.
A relative link looks like this:
"/directory/filename.htm"
There are various issues to consider, not related to canonicalization issues, when deciding to using either format. These issues include page download speed, server access times, and design conventions. The point to remember is to remain consistent. Absolute links tend to make doing so easier, as there is only ever one URL format for a file, regardless of context.
Don't Link To Multiple Versions Of The Page
In some cases, you may intend to have duplicate content on your site.
For example, some software, such as blog and forum software, aggregates posts into archives. Always link to the original version of the post, as opposed to the archive, or any other, location i.e. www.acme.com/todays-post.htm , not www.acme.com/archive/december/todays-post.htm.
If your software program links to a duplicate version of the content (like an individual post from a forum thread) consider adding rel=nofollow to those links.
Use 301s, not 302s On Internal Affiliate Redirects
A 301 redirect is a permanent redirect, which indicates a page has been moved permanently. 301s typically pass PageRank, and do not cause canonicalization issues.
A 302 redirect is a temporary redirect. If you use 302s the wrong page may rank. Google's Matt Cutts claims they are trying to fix the problem:
we’ve changed our heuristics to make showing the source url for 302 redirects much more rare. We are moving to a framework for handling redirects in which we will almost always show the destination url. Yahoo handles 302 redirects by usually showing the destination url, and we are in the middle of transitioning to a similar set of heuristics. Note that Yahoo reserves the right to have exceptions on redirect handling, and Google does too. Based on our analysis, we will show the source url for a 302 redirect less than half a percent of the time (basically, when we have strong reason to think the source url is correct)
but if you use 302s on affiliate links the affiliate page may rank in the search results, as shown in the below SnapNames search. This, in turn, would credit the affiliate with a commission anytime someone buys through that link in the search results...effectively cutting the margins of the end merchant.
Specify preferred urls in Google Webmaster Tools
Google Webmaster Tools provides an area where you can specify which version of URL i.e. http://www.acme or http//acme Google should use.
Note: It is important not to use the remove URL tool to try and fix these domain issues. Doing so may result in your entire domain, as opposed to one page, being removed from the index.
Further Reading
- Matt Cutts SEO Advice On URL Canonicalization Issues
- SEO as International Minutia Dealer - A look at trailing slashes, alternate home pages, and how google changes how they handle canonicalization over time.
- SEO Quiz - Do search Engines Consider The Trailing Slash?
- Website Health Check Tool - includes video on how to use the SEO Health Check Tool to spot canonical URL issues?
- Google Date Based Search Filters Refer to Michael Grays technique for monitoring the Google cache.
Comments
Haay salamat Aaron sa post. Now i can use this to show my bosses that they need to fix the canonicalization problem they're having for years now. :)
Am I missing something? I'm logged in and can't seem to see the download link. Help!
Are you a paying subscriber comacow?
Aaron, I have a few questions.
Can 301 redirect be also used when we are moving entire site to a new domain?
1. Will the new site get the same rank in SERPs?
2. Will the new site get the same PageRank (PR)?
The answer is not black and white, but the answer should usually be yes...or something close to it. The new site should inherit most (if not all) of the trust and rank of the old site.
There are lots of caveats and details though, and I don't want to write a 6 page blog comment. ;)
Maybe I am completely ignorant here, but I can't see a reason that this would be a problem. I understand that you said that the SEs are ranking the pages lower, I am assuming that this is because internal PR is spread thinner across a domain, but does that really influence the SERPs enough to be concerned?
The reason I bring this up is because if you look at the example for snapnames you see that the site ranks three times in the same SERP. Now of course I understand that with this example we are using a unique brand term and anything after the first spot is kinda pointless. But lets say you are ranking twice on the same page for the term "LCD Monitor" it seems that you are getting twice as much exposure. Why try to stop that?
I am not trying to argue or disagree just asking for mine and others benefit.
For a brand related search, it is easy for the official site to rank. Navigational search is *very* easy to do.
But even within that realm of navigational search, the issue with the Snapnames rankings was that if I clicked the affiliate link someone *other than Snapnames* was getting a commission for that sale. Who wants to create a brand then arbitrarily and unnecissarily pay affiliates for every click on their branded site?
The issue where rankings is harder is for generic non-brand specific phrases and ranking for other people's brands.
And yes splitting link equity can cause what would have been a #2 ranking to be a #8 ranking, and cause that page to miss out on 75% of its potential traffic for some competitive keywords.
Aaron, great post. Between you and g1smd, there should never be any issues ever again, but alas, the web is riddled with them.
typo error in your post:
For example, a search engine might see http://www.acme.com and http://aceme.com as different pages.
One would think that the search engines would hopefully see those at 2 different domains, not just pages.
:)
Thanks Jake. We fixed it :)
Trailing questions marks... how to fix these? htaccess/301s?
I think .htaccess is probably the best way to do it, but I don't have the code to do it.
Add new comment