Tesseract (software)
From Wikipedia, the free encyclopedia
Tesseract | |
---|---|
Design by | Ray Smith, Hewlett-Packard |
Developed by | |
Latest release | 2.03 / April 22, 2008 |
Written in | C and C++ |
OS | Linux, Windows and (unofficially) Mac OS X |
Genre | Optical character recognition |
License | Apache License v2.0 |
Website | http://code.google.com/p/tesseract-ocr/ |
In computer software, Tesseract is a free optical character recognition engine. It was originally developed at Hewlett-Packard from 1985 until 1995. After ten years with no development, Hewlett Packard and UNLV released it in 2005. Tesseract is currently developed by Google and released under the Apache License, Version 2.0. The current version of Tesseract is 2.03, released April 22, 2008.
Contents |
[edit] About the Tesseract OCR Engine
Tesseract is a raw OCR engine. It has no document layout analysis, no output formatting, and no graphical user interface. It only processes a TIFF or BMP image of a single column and creates text from it. TIFF compression is not supported unless libtiff is installed. It can detect fixed pitch vs proportional text. The engine was in the top 3 in terms of character accuracy in 1995. It compiles and runs on Linux, Windows and Mac OS X, however, due to limited resources only Windows and Ubuntu Linux are rigorously tested by developers.
Tesseract can process English, French, Italian, German, Spanish and Dutch. It can be trained to work in other languages as well.
Tesseract is suitable for use as a backend, and can be used for more complicated OCR tasks including layout analysis by using a frontend such as OCRopus. Further integration with programs such as OCRopus, to better support complicated layouts, is planned. Likewise, frontends such as FreeOCR can add a GUI to make the software easier to use for manual tasks.
[edit] History
The Tesseract engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler.
Currently Tesseract builds under Linux with GCC 2.95 or later and under Windows with Visual C++ 6. The C++ code makes heavy use of a list system using macros. This predates the C++ Standard Template Library and may be more efficient than Standard Template Library lists, but is reportedly harder to debug if you get a segmentation fault. Another side-effect of the C/C++ split is that the C++ data structures get converted to C data structures to call the low-level C code. This is clumsy, and the C++izing of the C code is a step towards eliminating the conversion, but it has not happened yet.
[edit] Usage
Tesseract is only the OCR engine, not a standalone application. Tesseract runs from the command line and the usage of both Windows and Linux versions is the same. Tesseract may be called from command line using the following format:
tesseract image.tif output [-l langid]
The image file requires the extension .tif (or .bmp for BMP images) for its type to be recognized correctly. If a file exists with the .tif extension replaced by .uzn, then it will be interpreted as a UNLV-style zone file. (See ISRI@UNLV (Information Science Research Institute at the University of Nevada, Las Vegas) for details of the zone files.)
[edit] References
- Announcing Tesseract OCR (Luc Vincent, Google Code Blog, August 2006)
[edit] See also
[edit] External links
- Tesseract OCR (project page on Google Code)
- Information Science Research Institute at the University of Nevada, Las Vegas (Information Science Research Institute at the University of Nevada, Las Vegas)
- http://www.ocropus.org/ - A high-performance handwriting recognizer developed in the mid-90's and deployed by the US Census bureau and novel high-performance layout analysis framework, currently using Tesseract as the OCR plugin.
- http://tesseract-ocr.repairfaq.org/ - C/C++ structure of Tesseract extracted from Doxyfied source code (based on Tesseract V1.03)
- Archivista Box - A complete GPL document management system based on Tesseract and Linux.
- [1] - some patches for training on a 64-bit machine.