How To Check for Duplicate Content Penalties on Your Website for SEO

by Dave Crader

Don't be a copy cat

Ever had your content stolen? It recently happened to us and we certainly did not appreciate it. We spend hours writing well-researched original articles because that’s what our readers deserve. You won’t find any rewriting here. The downside to this, as we recently discovered, is becoming a target for copyright infringement. Luckily, we were able to get the copied content removed in this case, but I doubt it will always be so easy. In this post we’ll go over a few options you have if this ever happens to you, and we’ll help you identify duplicate content on your own site that you may not even know existed.

There are a lot of misconceptions about Google’s duplicate content penalty, so let me explain the basics before we dive in.

Duplicate Content Off-Site

A lot of duplicate content is created by scraper websites known as ‘autoblogs.’ Autoblogs are set-up by low-life scumbags who don’t have the creative skill to write their own material. These autoblogs are configured to scrape your website and steal your content almost instantly after you’ve posted it. This confuses search engine spiders because they don’t know which site is the source of the original content. If the spiders crawl the autoblog before crawling your website they will think the autoblog is the original source and you are the copier. If the spiders crawl your content before crawling the autoblog, you should be safe from penalties. It’s not as clean cut as that, but that’s the general idea. Either way, it’s worth your time to proactively pursue content thieves just to be safe. 

1. Contact the Site Owner and Ask for Removal

I was doing some research for our new mobile app service page and stumbled across an article that looked strikingly similar to David’s stellar mobile website blog post from back in June. I ran David’s article through copyscape.com, a free online plagiarism checker, and was surprised to find not just one infringer, but two. I immediately sent a friendly yet direct e-mail to the website owners in hopes of getting the copied content removed.

Original Email

This is one of the e-mails I received back:

Reply Email

I won’t expose this individual since he complied, but it’s interesting that he didn’t know about basic copyright infringement laws. When a tangible idea is shared with at least one other person it becomes protected under U.S. copyright law. A tangible idea can be defined as any idea that is spoken or written. It really does not matter if ‘Copyright ©’ appears next to the idea or not. ‘Copyright ©’ is just a way to warn people that the idea is protected. Like any law, there are various odd exceptions that come into play, but I’ll leave those to the U.S. Copyright Office to explain. The other infringing website owner did not respond to my e-mail, but the page was also removed promptly.

2. Contact the Site’s Host and Ask for Removal

I’ve never had to go straight to the hosting provider, but Google recommends it on its duplicate content help page. I didn’t know hosting providers were required to accommodate such requests, but if they are, it seems like this would be a very effective method.

3. File Lawsuit

You can always have your lawyers write up an official cease and desist letter if you’d like, but this is usually pretty expensive. I’d go with option one or two before heading down this path.

If you can’t get a hold of anyone you can always file a request for Google to remove the infringing page from its search results. The copied content will remain, but no one will be able to find it in search results. This will also remove the risk of any duplicate content penalties Google may have assigned to your website.

Duplicate Content On-Site

On-site duplicate content is very common. It’s arguably more dangerous than off-site duplication because precious link juice is being spread thin in multiple directions. Off-site duplication doesn’t split link juice, it just tells search engines not to give any juice to the copied version.

Canonicalization Issues

For example, let’s take a look at Gojo.com. Gojo’s® homepage is splitting link juice in 6 directions causing canonicalization issues. We know this because each of the following URLs displays the exact same homepage content.

•    http://gojo.com/
•    http://www.gojo.com/default.asp
•    http://www.gojo.com/default.aspx
•    http://gojo.com/default.aspx
•    http://gojo.com/default.asp
•    http://www.gojo.com/

Google gets confused by this because it doesn’t know which URL is the primary version that Gojo would like to rank in search. If Google doesn’t know which version is the primary, it makes a guess based on backlinks and other factors.

The split has also caused a link equity problem. For example,

http://gojo.com/ - has 20 backlinks.
www.gojo.com - has 891 backlinks.

Google assumes www.gojo.com is the primary version because of the backlinks, but that doesn’t necessarily fix the problem. The company is still missing out on some precious link juice from the 20 people who linked to http://gojo.com instead of www.gojo.com. All of these problems can be easily fixed with a 301 redirect or a rel=”canonical” attribute specifying one primary version of the URL.
A rel=”canonical” attribute will only redirect search engine spiders, not users.
A 301 redirect will redirect both search engine spiders and users.

Duplicate Title Tags

Looking at the Title tags of a website is an easy way to detect duplicate content issues. You can find duplicate title tags in the Diagnostic section of Google webmaster tools. If you don’t have Google Webmaster Tools set up you can use Xenu Link Sleuth instead (Mac users will need to use Screaming Frog).

Duplicate Title Tags

Two pages that have the same title tag often have the same content as well. To fix this, you should apply a 301 or rel=”canonical” attribute to specify a primary version. If one page has accrued more backlinks than the other, I’d recommend choosing the page with backlinks as the primary.

Duplicate "Print Friendly View" Pages

A lot of webmasters offer visitors a ‘print friendly view’ of their website’s pages. This is great for users, but bad for search engines because two pages with the exact same content will exist on the website. The rel=”canonical” attribute can be used in this situation because it will eliminate the duplicate content issue while still allowing users to access the print friendly page. If a 301 is used the user would be redirected back to the same page that he/she is currently on. The rel=”canonical” attribute was actually invented for this very reason. Here at Evolve, we avoid this issue completely by using print friendly style sheets and CSS. These style sheets use a different set of css properties for when a browser attempts to print a page. This creates a print friendly version of the page without needing a separate URL. Smashing Magazine has a great tutorial for accomplishing this.

Search engines hate duplicate content because they don’t know which version to show in search results. They’re already dealing with over a trillion web pages - why make their job even harder? If you think you may have some duplicate content issues, just give us a call at 330-331-0211 for some help.

Comments (9) -

seo pune
7/4/2012 1:48:47 AM #

Thanks for sharing this great post. It’s very enlightening. I absolutely love to read informative stuff. Looking forward to find out more and acquire further knowledge from here! Cheers!

Melissa
7/25/2012 12:42:18 PM #

David,

Great stuff! I have encountered a similar situation and plan to reference your article.

Thanks for the guidance Smile
Melissa

Christian
9/25/2012 1:03:12 PM #

Thanks for your article. I found that I had an issue with canonical links (just by using a "/" at the end. I was able to eliminate directly from the configurations panel of my Wordpress theme.

John
12/18/2012 5:00:58 AM #

Penalties for duplicate content are very hard to find out, but your article to catch is upto mark. Thanks for sharing such a nice article with us.

Kevin
1/27/2013 8:52:43 AM #

It is necessary to avoid practicing the publication of duplicate content on the web space, and it can be done easily by using a plagiarism checker tool that will help you detect whether the content is already published elsewhere or it is absolutely genuine and freshly written. Thanks

hire mobile app developer
4/4/2013 5:43:55 AM #

Different different tools are available in internet, you can screaming frog tool for check the duplicate content.Google Webmaster also helps in checking the duplicate content.

Alfred
4/22/2013 9:07:49 PM #

Did your content ever recover its Google SERP rankings after you got the copyright infringer to remove their content?  

An infringer's site was larger and had more PR than ours.  They put their scraped copy of our page up on 2/5/2013 (according to archive.org).  By 3/15, the traffic to our key page was zip from Google.  We got them to respond just before sending out a DMCA.  While we were looking, we found several other partial and complete scrapes on low PR sites.  DMCA's have all of those removed except for two partial scrapes in Italy.

Dave Crader
4/23/2013 8:31:34 AM #

Hi Alfred,

Thanks for your comment. In this case, we did not experience a drop in rankings. We have a fairly active blog, so I'm assuming Google crawled our content piece before crawling the copier.

I'm sorry to hear about your situation. Syndication on such a large scale isn't necessarily a bad thing though. On what date did you publish the content on your website?  

Alfred
4/23/2013 10:39:41 AM #

The site has been around since maybe 2004, 2005.  The page in question originated in that time frame, and was moved to a different URL within a year or two.  Panda / Penguin had impacted traffic on the site overall.  The page I'm talking about is a PR4 page (home page is also PR4) that had been on page one of results for the keyword phrase we wanted, but had slowly drifted down.

After reading about underscores vs. dashes, we renamed the page around 10/12/2012 to use dashes, and used a 301 redirect.  That had relatively little effect on the search engines and Google PR translated nicely to the new page.

On 2/5/2013, a PR5 scraped our page in its entirety.  Within 6 weeks, our page was "end of results" / -950.  In diagnosing the problem, we saw other complete page scrapes and partial page scrapes from low PR sites.  We DMCA'd all complete scrapes and they've been down for two weeks.  

Two partial scrapes are in Italy, beyond the legal reach of a DMCA.  Their webmaster didn't respond to our request, but Google did remove AdSense ads from one page we reported to them as a violation of AdSense terms.  We need to report the other.

There is one other partial scrape out there that is heavily rewritten and does send us traffic.  We have not DMCA'd them, although we may still do that.  Archive.org was invaluable in proving our DMCA issues.  Copyscape.com was very helpful, but simply running multiword phrases from our page through Google helped the most.

Google Asserts that there is no manual penalty after a reinclusion request.  We've deoptimized, disavowed incoming links from a bad neighborhood, resubmitted via webmaster tools, modified our main menu (yesterday) to get our text/code ratio higher.  Nothing has budged the page in 6 weeks.  Everything there at "end of results" is highly scraped or widely distributed throughout the internet.

Panda #25 is supposed to be the last until Panda is integrated into their core search.  I'm wondering how long that will take, because I suspect our page won't budge until they get that done.  It continues to be #1 on the term on both Bing and Yahoo...





Add comment




biuquote
  • Comment
  • Preview
Loading








About Evolve Creative Group

Located in Akron OH, Evolve Creative Group offers a full suite of professional web design and online marketing services including:

 

At Evolve, we take pride in delivering innovative and effective web design & online marketing solutions that ensure an evolutionary change in your business. We also believe that continuing education is an important key element to any business. For this reason, we've created Evolve University. We hope you enjoy our blog!

Award Winning Blog

2013 Communicator Award Winner Learn More >