OCRopus

From Wikipedia, the free encyclopedia

This article or section contains information about computer software currently in development.
The content may change as the software development progresses.

OCRopus
Developed by	Thomas Breuel, DFKI
Latest release	0.1.1 / December, 2007
Written in	C++ and Lua
OS	Linux
Genre	Optical character recognition
License	Apache License v2.0
Website	http://code.google.com/p/ocropus/

OCRopus is a free document analysis and OCR system released under the Apache License, Version 2.0 with a very modular design through the use of plugins. These plugins allow OCRopus to swap out components easily. OCRopus is currently developed under the lead of Thomas Breuel from the German Research Centre for Artificial Intelligence in Kaiserslautern, Germany and is sponsored by Google. OCRopus is currently available only for Linux.

1 How it works
2 History
3 Usage
4 See also
5 References
6 External links

[edit] How it works

OCRopus is an OCR system that combines pluggable layout analysis, pluggable character recognition, and pluggable language modeling. It aims primarily for high-volume document conversion, namely for Google Book Search, but also for desktop and office use or for vision impaired people.

Currently, OCRopus uses Tesseract as its only character recognition plugin, but others are expected to be added in the future. This is especially useful in expanding functionality to include additional languages and writing systems. OCRopus also contains disabled code for a handwriting recognition engine which may be repaired in the future.

OCRopus itself does image preprocessing and layout analysis; it chops up the scanned document before passing it to Tesseract for line-by-line or character-by-character recognition.

As of the alpha release, OCRopus uses the language modeling code from another Google-supported project, OpenFST.^[1].

[edit] History

Release history:^[2]

Initial announcement - 9 April 2007^[3]
0.1.0 - Alpha - 22 Oct 2007
0.1.1 - 14 Dec 2007 - Improved build system
0.2 - Alpha 2 - 31 May 2008^[4]
Beta - Scheduled for August 2008 - Commercial-quality accuracy for books and journal articles^[5]
1.0 - Scheduled for Q3 2008 - Packaging for additional operating systems, GUI

[edit] Usage

Currently OCRopus can only be used from the command line. Once installed, it can be invoked by specifying the input images. It will output hOCR HTML code to standard out. If more precise control is needed, options can be specified on the command line to perform specific operations (e.g. recognizing a single line).