Follow on Twitter Follow on YouTube Follow on Facebook My books, on Amazon Subscribe via email

Want to catch people stealing your website content? Use an RSS honeypot

Robert Niles
By
Published: February 14, 2013 at 12:19 PM (MST)
One of the annoyances of publishing original content online is when Google chooses to return a copy of your work from another website before your original in its search results pages. Google's traffic is money for website publishers, as more readers can mean more money from ad revenue. But it's frustrating to see Google choose instead to deliver that potentially valuable traffic to another website that's copied stolen your content.

I'm not talking about sites that reference your work, summarizing it and providing a link back to the original. That's fair use, and I welcome the traffic from those sources, too. Heck, if people want to quote an entire article from me, then supplement it with their own original reporting and/or commentary, I probably wouldn't mind. It's the automated copying and reposting of content that really bothers me. So a while back, I decided to do something: I programmed the script that generates the RSS feed of my articles on ThemeParkInsider.com to append a line to end of each RSS feed article entry:

This article originally appeared at [article URL]. All rights reserved. If you are not reading this on a personal RSS reader (such as Feedburner) or on [website domain], you are reading a scraper website that has illegally copied and stolen [website domain]'s content. Please visit [article URL] for the original version, along with all its comments.

This way, I could do a Google search for that text string (well, for a part of it -- it is rather long), and discover which websites were automatically copying my site's content. This method wouldn't catch people who visited my site and manually copied the posts -- but I didn't care so much about them. I wanted to see who was simply creating scripts to automatically crawl RSS feeds and pull the content onto their own sites. That's the laziest form of scraping on the Web. (I figured that any human being copying the RSS feed would be smart enough to at least delete the disclaimer line.)

And here's what I found. Based my most recent search, here are the top 10 domains automatically scraping ThemeParkInsider.com's front-page content (domains in alphabetical order):

alltop.com
feed7.com
feedreader.com
mippin.com
nhsai.org
ofelio.com
orangehedgehog.com
regator.com
srsounds.com
tripsense.com

My questions for people running Google and other search engines would be: Why are you indexing content from these domains? Why not just ban the scrapers from your index and return instead links to original content? Forget for a moment complex questions about derivative content. This is just lazy automated copying and pasting. Dumping these domains from your index ought to be any easy call. Especially when publishers create honeypots like mine to help you identify the worst offenders.

Robert Niles also can be found at http://www.themeparkinsider.com

All posts: 2016 · 2015 · 2014 · 2013 · 2012 · 2011 · 2010 · 2009 · 2008

© Robert Niles. Contact