Duplicate content test

What is it?

Identifies any pages which contain identical content but different URLs.

Why it matters

Consider a website which returns an identical page for each of these addresses:

  • http://example.com/
  • http://www.example.com/
  • http://example.com/index.html

This is a problem because – to a computer – these are completely different URLs, and any ‘value’ given to these URLs is divided between these copies.

It would be as if a company had 3 phone numbers that performed an identical function, and tried to market all 3 for people to remember; their efforts would be diluted. With webpages, duplicate content is seen as a dilution of the value of a page by search engines like Google. A core tenant of SEO is the removal or proper markup of duplicate content.

How it works

Insites compares the visible content of every page, discounting some ‘noise’ which could cause identical pages to appear differently.

Where duplicate pages are found, Insites considers the one with the shortest URL to be original version of the page. If the URL length is identical, the first page it encountered takes precedence.

If a page defines a canonical URL, that URL is considered the actual URL for the page, and if that page has already been downloaded, the canonical copy is ignored. This means you can ‘fix’ duplicate content by adding canonical tags, as is recommended by Google.

Duplicate pages are purposely excluded from most tests, and do not count towards the maximum number of pages that Insites will test. This means, for example, that duplicate pages are not spell checked, as if they were, they would needlessly report duplicate results.

How to use it

The first tab of this test lists duplicate pages. For each page, a Copies count shows how many instances of that page exist. You may click on this count to list all of the copies of that specific page.

How to fix it

There are several ways to address duplicate content, depending on the type of duplication:

  • For content duplicated across www and non-www prefixed URLs, e.g. example.com and www.example.com – these should be addressed by a site-wide redirection, e.g. any visitor going to www.example.com/foo should be taken to example.com/foo automatically. This is normally a relatively simple server configuration change and will halve your duplicated content.
  • For content duplicated across http and https URLs, e.g. http://example.com/ and https://example.com/ – these should also be addressed by a site-wide redirection. The HTTPS (secure) version of the page should be the preferred version.
  • For content duplicated by an end trailing slash, e.g. example.com/foo and example.com/foo/ – these should also be addressed by a site-wide redirection. You can choose either, but we recommend using the non-www prefixed URLs as the preferred version.
  • For content duplicated by an optional filename, e.g. example.com/ and example.com/index.html – you can either configure a site-wide redirection as above, or update all of your existing links to point to the shorter URL form.
  • For content duplicated by query parameter or a dynamic script, e.g. example.com/?session=1234 and example.com/?session=2345 you should add a canonical tag to the affected pages.
  • For any other duplicated content you should consider either adding a 301 redirection, or set a canonical tag.