Tesseract (software)

From Wikipedia, the free encyclopedia

In computer software, Tesseract is an optical character recognition engine. It was originally developed at Hewlett Packard from 1985 until 1995. After ten years with no development, Hewlett Packard and UNLV released it in 2005. Tesseract is currently developed by Google and released under the Apache License, Version 2.0. In order to use Tesseract, it is recommended to compile Tesseract from source on Linux using the GNU Compiler Collection, or GCC, or Microsoft Windows using Microsoft Visual C++ or GCC via X11. However, there are other alternatives available on the internet to avoid compiling from source, some of which are listed in the external links section below.

1 About the Tesseract OCR Engine
2 History
3 Usage
4 References
5 External links

[edit] About the Tesseract OCR Engine

Tesseract's code is a raw OCR engine. It has no page layout analysis, no output formatting, and no graphical user interface. It can only process an image of a single column and create text from it. It can detect fixed pitch vs proportional text. The engine was in 1995 in the top 3 in terms of character accuracy, and it compiles and runs on both Linux and Windows. Another current limitation is that it only recognizes English and its character set is only US-ASCII. Training code is included in the open source release however, and will be included in a future release.

[edit] History

The Tesseract engine was developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ compiler. Currently it builds under Linux with gcc2.95 and under Windows with VC++6. The C++ code makes heavy use of a list system using macros. This predates stl, was portable before stl, and is more efficient than stl lists, but has the big negative that if you do get a segmentation violation, it is hard to debug. Another "feature" of the C/C++ split is that the C++ data structures get converted to C data structures to call the low-level C code. This is ugly, and the C++izing of the C code is a step towards eliminating the conversion, but it has not happened yet.

[edit] Usage

Tesseract is only the OCR engine, not a standalone application. Tesseract runs from the command line and the usage of both Windows and Linux versions is the same. Tesseract may be called from command line using the following format:

tesseract <image.tif> <output> batch

The image file requires an .tif extension for its type to be recognized correctly. If a file exists with the .tif extension replaced by .uzn, then it will be interpreted as a UNLV-style zone file. (See ISRI@UNLV (Information Science Research Institute at the University of Nevada, Las Vegas) for details of the zone files.)