What is duplicate content and how to avoid it?

Have you ever read an article on the internet and had that eerie feeling you’ve seen it somewhere before? 

If so, you’re not alone. The internet is chock full of duplicate, boilerplate content – similar content to other pages. 

The internet is also full of so-called “accidental” duplicate content too. It’s common for websites to replicate the content of other sites through a variety of technical mechanisms. Unless you know what counts as a duplication, you could be hurting your websites search engine ranking.

Google cares about duplicate content. It’s not interested in filtering customers through to a bunch of copycat pages. It wants to send people to the original source of information. Only the pages that offer unique value will do well in search results. 

So with that said, what kinds of duplicate content are out there? And how do you avoid them? 

Deliberate duplication or “thin” content

Google knows that there is a lot of accidental duplicate content out there. It is common for many web pages to contain content duplicated from other pages on the internet without any malicious intent whatsoever. 

Google, therefore, cares a lot about whether you intended to copy or not. It wants to skip past pages with accidental duplication and focus on those who try to game the system. Copying, for Google, is the real crime. 

It’s crucial, therefore, for any organisation or entity with a website to avoid copying. Google doesn’t just dislike copying; it will actively attempt to filter your page of copied content from it’s search results. It is better to not produce any content at all than create identical content or a version of the content that is very similar to pages already out there. Intent matters a lot. 

Copied content

Google can never really know if you deliberately copied or not for sure. But it has systems in place which it uses to figure out with a surprising accurately whether you’ve done the dirty. 

When it comes to copied content, there are three varieties you need to avoid. The first is outright copying. You see some text you like on another domain, and you cut and paste it onto your site. It doesn’t matter if you cut and paste snippets from different pages around the web; the effect is the same. Google will find out and penalise you. 

Some copiers take a more sophisticated approach. Instead of copying outright, they change the odd word here and there, hoping that Google won’t notice. Replacing words in the text with synonyms doesn’t work either. The search giant’s bots can sniff this out pretty quickly. 

The final technique is to copy “dynamic content” or content that changes regularly, like newsfeeds and search results. This is common when syndicating content. Google might not find an exact match, but it can weed out content with similar patterns. Again, if it thinks that the material was a cut and paste job, then it will penalise. 

Scraped content

Google defines scraped content in much the same way as it does copied content. It’s content you take from another, more reputable site. 

Webmasters like to do this, according to the search giant, because they believe it will help increase traffic. The more they display the high-quality content of another website, the higher their user interaction. 

Google wants to prevent this from happening. Not only does scraped content offer users little additional value, but it may also constitute copyright infringement. 

The same rules as above apply. Website content should be original. You should not copy it from third party sites. You should not use synonyms in place of the original content. And you should avoid reproducing the dynamic content of other websites without added benefits for users. 

Google doesn’t even like it if you embed too much content on your pages (such as videos or images) without adding your spin. It wants to see that you’re providing value whenever you use the assets of another site. 

Protect your content against being copied and scraped

Whilst you can’t stop someone copying or scraping your content you can protect yourself from the harm it may cause you.

Add in both the canonical tag and a link within the content that points back to your original article called the canonical page. The canonical tag goes in the head section of the HTML.

<link rel="canonical" href="http://example.com/page/" />

If your content does get copied the chances are that no one will remove the in content link and Google will pretty soon work out which the original content is.

Boilerplate content

Boilerplate content is different from both scraped and copied content (which are mostly the same thing). Many websites, especially those offering services, need similar content on their pages to inform users. A dental practice, for instance, requires a bunch of pages on the different procedures that it offers. 

Boilerplate content also includes things like legal disclaimers repeated over and over, or footer content. It’s stuff that you’ll find repeated across pages for legal reasons or the convenience of users. (It would be annoying to have to go to the homepage every time you wanted to consume footer content). 

There are also other helpful ways to use boilerplate content. To make life easy for users, many service- or product-oriented websites use template content. They create a framework and then populate it with specific details. It’s a time-saving device which does little to detract from the user experience. 

Google says that it recognises the value of boilerplate content. It can understand how some varieties of duplicate content might be beneficial to its users. However, what the company says and what it actually does can differ markedly. 

The problem with boilerplate content is that it creates “semantic noise.” All that addition verbiage confuses search engines and can cause them to skip over some pages. Even Google’s own webmaster guidelines state that you should minimise boilerplate content where possible. 

So how do you avoid it? Sometimes, you can’t. Footer content is there for a reason, and the vast majority of successful websites have it. 

But if you do have boilerplate content of more than a few sentences, then it’s usually better to link. Instead of including your terms and conditions at the end of each page, provide a “Terms & Conditions” link. This way, you avoid repetition while also providing customers value. 

A note on thin content

The term “thin” content might seem a little strange. But once you understand the concept, you’ll soon get the metaphor. 

Thin content is content which does little to inform the user or provide them with the knowledge they want. 

When you clicked this article, you came intending to find out what duplicate content is and how to avoid it. If this were thin content, it wouldn’t give you satisfactory answers. Sure, it might mention the keywords, but it would be hard to read, and you would come away from it not having learnt much. 

It would be as if the person writing it had done little research and wasn’t interested in informing you. All they care about it hitting a word count target or mentioning a specific keyword over and over. Your needs don’t matter one bit. 

Again, Google wants to avoid funnelling users through to these types of pages. If your page mentions all the keywords but fails to add value to users, the search giant will punish you. 

Accidental duplicate content

Google wants you to believe that it can distinguish between deliberately and accidentally duplicated content. And, for the most part, the web giant does well. The problem, however, is that it can become confused in some situations, making avoiding duplication a priority. 

WWW versus non-WWW

Your website might have two URLs: http://www.example.com and http://example.com without the “www” prefix. The sites are precisely the same but at different addresses. Google says that that’s okay and that it has systems in place to prevent you from being penalised. But, in general, you’ll still want to avoid this kind of duplication. Google could make a mistake. 

The same concept applies to “https” and “http” websites. You could have identical copies of your site on these different web directories, interfering with one another. 

So what can you do? 

One solution is to use a 301 redirect. This facility allows you to redirect content to your primary or “original” page. If you are using a content management system such as a self hosted WordPress install you can do this in the admin under settings and general.

Wordpress site wide 301 redirect

The benefits of this are twofold. Not only do you prevent identical pages from competing with each other for views, but you also help to generate higher relevancy. People are likely to spend more time on your preferred URLs overall, boosting their rank. 

Alternatively you can use the .htaccess file to redirect either non www to www or vice versa.

To redirect to www the code to insert is

<link rel="canonical"  <IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{HTTP_HOST} ^domain\.com [NC]
RewriteRule ^(.*)$ http://www.domain.com/$1 [L,R=301]
</IfModule> href="http://example.com/page/" />

Or to redirect to non www

<IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{HTTP_HOST} ^www\.domain\.com [NC]
RewriteRule ^(.*)$ http://domain.com/$1 [L,R=301]
</IfModule> 

You can also prevent content being listed on multiple pages in Google by using the canonical URL as previously mentioned. Using the “rel=canonical attribute,” you can tell Google that the ranking metrics of a page should apply to an original, not the copy.

All of the factors that could boost the rank of the copies of your core pages will apply to the original. You can, therefore, keep separate domains and use traffic to promote the core page. 

The easy way to do this is to use the canonical code on every page and blog post that you produce. When you add the relevant code to the page, you help boost the ranking power of the original, unique page in a similar way to a 301 redirect. 

<link rel="canonical" href="http://example.com/page/" />

Index.html and domain root

Let’s say you have two pages: www.mysite.com/ and www.mysite.com/index.html. Both pages are the same, but unless you tell Google otherwise, it’ll track them separately. 

Again Google will consider that you have the same content on multiple URLs.

The solution is to use the .htaccess file again but this time using the code

RewriteEngine On
RewriteCond %{THE_REQUEST} /index.html [NC]
RewriteRule ^(.*/)?index.html$ /$1 [R=301,L] 

This will result in a 301 redirect from index.html to the site root, don’t forget to ensure that your internal links reflect how you want the home page to be displayed.

The problem of session IDs and dynamic URLs

When an ecommerce website wants to track the activities of a specific user, it often creates a session ID. It does this for all kinds of reasons with the main being that it can track customers as they move around the site. Online retailers want to know which products interest customers and how they make their journey to a sale. 

The problem with session IDs is that they rely on the creation of unique user URLs for each page visit. 

So, for instance, you might have a session ID for user one which is

www.mysite.com/products/37485j4383j

And another session ID on the same “products” page listed as 

www.mysite.com/products/34947302fn38

The problem with this is that Google’s bots see these two pages as different, even though the content is identical. When Google tries to index and rank the pages, it runs into a problem: your site appears to have millions of duplicate pages. 

Here Amazon makes only small changes to session ID.

The solution again is to implement canonical URLs, this way no matter what session ID is added to the URL parameter Google will usually index the canonical page.

Conclusion

Google is trying to get better at determining what duplicate content is, where it is appropriate, and where it is not. Sometimes the content isn’t genuinely duplicated at all, such as in the case of dynamic URLs – it is purely a limitation of current search technology. At other times, the duplication is necessary – such as copyright information in a page footer – but confuses Google. 

Most of the time, you can resolve issues by using either a 301 redirect or the rel=canonical command. Both essentially do the same thing but in different ways. They inform Google of the page you intend to be the original. 

Get A Free SEO Video Audit

Get a FREE no obligation SEO audit video direct to your inbox and discover how to improve your websites position in Google!

Posted in

Robert

Leave a Comment