OCRopus

OCRopus
Developer(s)	Thomas Breuel, DFKI
Initial release	9 April 2007^[1]
Stable release	0.7 / 6 April 2013 (2013-04-06)
Written in	C++ and Python
Operating system	FreeBSD, Linux, Mac OS X
Type	Optical character recognition
License	Apache License v2.0
Website	github.com/tmbdev/ocropy

OCRopus is a free document analysis and optical character recognition (OCR) system released under the Apache License, Version 2.0 with a very modular design through the use of plugins. These plugins allow OCRopus to swap out components easily.

OCRopus is currently developed under the lead of Thomas Breuel from the German Research Centre for Artificial Intelligence in Kaiserslautern, Germany and is sponsored by Google.

OCRopus is developed for Linux; however, users have reported success with OCRopus on Mac OS X and an application called TakOCR^[2] has been developed that installs OCRopus on Mac OS X and provides a simple droplet interface.

How it works

OCRopus is an OCR system that combines pluggable layout analysis, pluggable character recognition, and pluggable language modeling. It aims primarily for high-volume document conversion, namely for Google Book Search, but also for desktop and office use or for vision impaired people.

OCRopus used Tesseract as its only character recognition plugin, but it uses its own engine in the 0.4 release.^[3] This is especially useful in expanding functionality to include additional languages and writing systems. OCRopus also contains disabled code for a handwriting recognition engine which may be repaired in the future.

OCRopus's layout analysis plugin does image preprocessing and layout analysis: it chops up the scanned document and passes the sections to a character recognition plugin for line-by-line or character-by-character recognition.

As of the alpha release, OCRopus uses the language modeling code from another Google-supported project, OpenFST,^[4] optional as of version pre-0.4.

History

Release history:^[5]

Initial announcement – 9 April 2007^[1]
0.1.0 (alpha) – 22 October 2007
0.1.1 (alpha) – 14 December 2007 - Improved build system
0.2 (alpha 2) – 31 May 2008
0.3 (alpha 3) – 16 October 2008.^[5]
pre-0.4 (alpha 4) – available for download May 2009^[6]
0.4.3 – July 2009
0.4.4 – March 2010
0.5 – June 2012
0.6 23 – August 2012
0.7 6 – April 2013

Usage

OCRopus can be used from the command line or inside gscan2pdf. Once installed, it can be invoked by specifying the input images. It will output hOCR (HTML-based) code to standard output. If more precise control is needed, options can be specified on the command line to perform specific operations (e.g. recognizing a single line).

References

↑ 1.0 1.1 Announcing the OCRopus Open Source OCR System (Thomas Breuel, OCRopus Project Leader)
↑ TakOCR website
↑ OCRopus doesn't even link with Tesseract by default
↑ Official OpenFST website
↑ 5.0 5.1 release notes
↑ Announcements - new repositories available

External links

OCRopus page on Github
IUPR Publication Server (papers behind many of the algorithms used in OCRopus)

Optical character recognition software

Free software	CuneiForm GOCR Ocrad OCRFeeder OCRopus Tesseract

Proprietary software	Asprise OCR ExperVision FineReader Microsoft Office Document Imaging OmniPage ReadSoft SmartScore VueScan

See also	List of optical character recognition software