OCRopus
From Wikipedia, the free encyclopedia
This article or section contains information about computer software currently in development. The content may change as the software development progresses. |
OCRopus | |
---|---|
Developed by | Thomas Breuel, DFKI |
Latest release | 0.1.1 / December, 2007 |
Written in | C++ and Lua |
OS | Linux |
Genre | Optical character recognition |
License | Apache License v2.0 |
Website | http://code.google.com/p/ocropus/ |
OCRopus is a free document analysis and OCR system released under the Apache License, Version 2.0 with a very modular design through the use of plugins. These plugins allow OCRopus to swap out components easily. OCRopus is currently developed under the lead of Thomas Breuel from the German Research Centre for Artificial Intelligence in Kaiserslautern, Germany and is sponsored by Google. OCRopus is currently available only for Linux.
Contents |
[edit] How it works
OCRopus is an OCR system that combines pluggable layout analysis, pluggable character recognition, and pluggable language modeling. It aims primarily for high-volume document conversion, namely for Google Book Search, but also for desktop and office use or for vision impaired people.
Currently, OCRopus uses Tesseract as its only character recognition plugin, but others are expected to be added in the future. This is especially useful in expanding functionality to include additional languages and writing systems. OCRopus also contains disabled code for a handwriting recognition engine which may be repaired in the future.
OCRopus itself does image preprocessing and layout analysis; it chops up the scanned document before passing it to Tesseract for line-by-line or character-by-character recognition.
As of the alpha release, OCRopus uses the language modeling code from another Google-supported project, OpenFST.[1].
[edit] History
Release history:[2]
- Initial announcement - 9 April 2007[3]
- 0.1.0 - Alpha - 22 Oct 2007
- 0.1.1 - 14 Dec 2007 - Improved build system
- 0.2 - Alpha 2 - 31 May 2008[4]
- Beta - Scheduled for August 2008 - Commercial-quality accuracy for books and journal articles[5]
- 1.0 - Scheduled for Q3 2008 - Packaging for additional operating systems, GUI
[edit] Usage
Currently OCRopus can only be used from the command line. Once installed, it can be invoked by specifying the input images. It will output hOCR HTML code to standard out. If more precise control is needed, options can be specified on the command line to perform specific operations (e.g. recognizing a single line).
[edit] See also
[edit] References
- ^ Official OpenFST website
- ^ Roadmap - ocropus - Google Code
- ^ http://google-code-updates.blogspot.com/2007/04/announcing-ocropus-open-source-ocr.html Announcing the OCRopus Open Source OCR System] (Thomas Breuel, OCRopus Project Leader)
- ^ alpha2 release available
- ^ Updated Roadmap
[edit] External links
- OCRopus (project page on Google Code)
- IUPR Publication Server (Papers behind many of the algorithms used in OCRopus)