What is duplicate content? • Yoast

You’ve probably come across the term duplicate content quite a lot, but what is it? Duplicate content is content that lives in several locations — i.e., URLs. Duplicate content can harm your rankings and many people say that copious amounts of it can even lead to a penalty by Google. That’s not true, though. There is no duplicate content penalty, but having loads of duplicate or copied content can get Google to influence your rankings negatively.

What does duplicate content mean?

Duplicate content is all content that is available on multiple locations on or off your site. It often lives on a different URL and sometimes even on a different domain. Most duplicate content happens accidentally or is the result of a sub-par technical implementation. For instance, your site could be available on both www and non-www or HTTP and HTTPS — or both at the same time, the horror! Or maybe your CMS uses excessive dynamic URL parameters that confuse search engines. Even your AMP pages could count as duplicate content if not linked properly. Duplicate content is everywhere.

Google’s definition of duplicate content is as follows:

“Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin.”

That last part is important. If you scrape, copy and spin existing content — Google calls this copied content — with the intention of deceiving the search engine to get a higher ranking you will be on dangerous ground.

Google says this type of malicious intent might trigger an action:

“Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results”

Michiel has some great tips for discovering duplicate content on your site: DIY Duplicate content check. Google’s documentation is also a goldmine for working with duplicate content.

Duplicate content vs. copied content vs. thin content

The topic of duplicate content confuses a lot of people. For Google, most duplicate content has a technical origin, but it will also look at the content itself. “I have two URLs for the same article, which one should I choose?” While most regular people will probably think of pieces of similar content that appear elsewhere on a site. “I have used this piece of text in several other places, is that bad?” This is all duplicate content, but for determining rankings, search engines make a distinction between duplicate content, copied content and thin content.

Your duplicate content might classify as copied content if you use an existing text and rehash it quickly to reuse it on your site. It doesn’t matter if you give it a little spin or put in a few keywords, this behavior is not acceptable. Throw in a couple of thin content pages — pages that have little to no quality content — and you’re in dangerous territory. Site quality is an issue and these tactics can bring serious harm to your site. Remember Panda?

Don’t block duplicate content on your site

Google is pretty apt at discovering and handling duplicate content. The search engine is smart enough to figure out what to do with most of the duplicate content it finds. If it finds multiple versions of a page it will fold these into the version it finds best — in most cases, this will be the original article/page. What it does need, though, is complete access to these URLs. If you block Googlebot in your robots.txt from crawling these URLs, it cannot figure these things out by itself and you will run the risk of Google treating these pages as separate instances. Here are a couple of things you should do:

Allow robots to crawl these URLs
Mark the content as duplicate by using rel=canonical (read more about this below)
Use Google’s URL Parameter Handling tool to determine how parameters should be handled
Use 301 redirects to send users and crawlers to the canonical URL

There’s more you can do to fight duplicate content on your site as Joost describes in his article on duplicate content: causes and solutions.

Use rel=canonical!

One of the essential tools in your duplicate content fighting toolkit is rel=”canonical” . You can use this piece of code to determine what the original URL is of a piece of content, something we call the canonical URL. We have an excellent ultimate guide to rel=”canonical” that shows you everything there is to know about it.

Focus on original, fresh and authoritative content

Another tool in your arsenal to fight duplicate, copied and unoriginal content are your writing skills. Google is focused on quality. It is always on the lookout for the best possible piece of content that fits the users intent best. Your goal should not be to make a quick buck but to leave a lasting impression. Watch out for thin content and make sure to make it original and of high quality.

The same goes for similar content on your site. We’ve talked about keyword cannibalization before and this is an extension of that. Folding several comparable posts into one can achieve much better results, both in terms of rankings as well as fighting duplicate content.

Here’s Google’s take on similar content:

“Minimize similar content: If you have many pages that are similar, consider expanding each page or consolidating the pages into one. For instance, if you have a travel site with separate pages for two cities, but the same information on both pages, you could either merge the pages into one page about both cities or you could expand each page to contain unique content about each city.”

Duplicate content is everywhere — know what to do about it

Ex-Googler Matt Cutts once famously said that 20% to 30% of the web consists of duplicate content. While I’m not sure these numbers are still accurate; duplicate content continues to pop up on every site. This doesn’t have to be bad news. Fix what you can and don’t try and turn duplicate content and its siblings copied content and thin content into a viable SEO strategy.

Source link