Blog scraping

From Wikipedia, the free encyclopedia

Blog scraping, is the process where automated software scans hundreds of thousands of blogs per day, searching for and copying content. The process is sometimes referenced by the name given the software or individuals responsible for the action, “blog scrapers.”

"Scraping" essentially stands for copying, or in the case of copyrighted material, stealing content off a blog that is not owned by the individual initiating the scraping process. The scraped content is often used on Spam blogs or splogs.

[edit] Dangers

Obviously, if blog scrapers are gathering content that is copyrighted material, that is a violation of law. But even ignoring for a moment the legal side, there are a number of more practical problems that Blog scraping causes for the person or business whose blog is being scraped. The problem of Blog scraping is particularly worrisome for business owners and business bloggers.

Sometimes a blog scraper will copy an entire post off an independent or business blog. That duplicate content will include the author's tag and a link back to the author's site (if that link appears in the author's tag.)

Many times though, blog scrapers copy just the portion of the content that is keyword relevant to their splog topic.

Why the more 'advanced' Blog scrapers do this is simple. By copying only the content that is relevant to their splog topic, they can increase the keyword relevancy of their site(s). Secondly, by not scraping the entire post, they eliminate any outbound links which would reduce their search engine ranking.

Additionally, scraped content can appear on literally any type of splog or RSS fed spam site. That means an unsuspecting individual could find their creative or even copyrighted material showing up on a site promoting pornography or other type of content that would be offensive to the original author or his/her audience. This can be damaging to the original author's reputation.

[edit] Defense

Blog scraping software is becoming more and more "intelligent" as time goes by. The "smarter" programs can bypass even the most valiant efforts, but that should not stop you from taking what are some very simple steps to help discourage the majority which are "average/dumb" scrapers.

#1. Include a strongly worded copyright tag

The bottom of each of your original content posts/articles should include a brief copyright tag. this is your first line of defense. This will prevent all but the most unscrupulous of people from stealing your content. As far as the blog scraping software that spam marketers use, it won’t do much on the prevention side. BUT — it is the only thing that gives you the legal right to go after the offending party.

#2. Use a summary feed for your business blog

Instead of choosing to send the full content of your posts via RSS, change the setting of your blog software to use "summary" or "truncated" feeds. This may not stop the couple of "smart" scrapers out there, but it will help reduce the incidence of blog scraping (until Blog scraping makes the next evolutionary jump.) Truncate with caution, however: summary feeds are repellant to many readers who dislike being forced to load an additional webpage instead of reading the story in their news aggregator.

#3 If you ‘must’ use a full text feed — copyright protect it!

You can add a copyright footer to your RSS feed. It’s something that can be done easily with most major blogging software. But you probably will not be able to do this with free blogging software such as Blogger.

WordPress users can install the "Feed Copywriter Plugin" that makes this process quick, easy, and painless.

#4 Place a hidden image in each entry

This is another interesting strategy that can be used. With the hidden image you can then use your referrer logs to track offenders since it's likely they'll never catch the hidden image.

[edit] Helpful Links

WordPress Feed Copywriter Plugin

Six Steps to Prevent Content Theft and Combat Copyright Infringement on Your Business Blog

Behind Splogging: Why Sploggers Splog