Talk:Optical character recognition

From Wikipedia, the free encyclopedia

	This article is within the scope of WikiProject Computer science, which aims to create a comprehensive computer science reference for Wikipedia. Visit the project page for more information and to join in on related discussions.
???	This article has not yet received a rating on the assessment scale.
???	This article has not yet received an importance rating on the assessment scale.

1 Missing an overview of where OCR fits into a document processing solution
2 Zip codes
3 Open source programs
- 3.1 Section about software
4 CJK Support?
5 MICR
6 Merge
7 Section "Optical Character Recognition in Unicode"
8 OCR for mathematical documents
9 Wrong word?
10 MAP
11 Tesseract
12 Citations?
13 unknown characters
14 Character 0x244B
15 Strongly suggest a 'software - last release date' column in table
16 Missing
17 This article doesn't even mention Cyrillic OCR!!!

[edit] Missing an overview of where OCR fits into a document processing solution

Key to a good OCR rate is the quality of input images and pre-processing of them. This needs to be added to the article. For example, thresholding low resolution images of text is critical for good OCR results. This leads into topics such as background removal, background normalization, Otsu thresholding, median filtering, demosacing, etc. A simple chart of OCR recognition rates for various scan DPI settings would help. Commercial products like Abbyy Finereader suggest that characters should be at least 20 pixel high to be OCR'd with good results.

A chart giving the resulting character size in pixels based on character point size, scan dpi would help also. E.g., 75dpi scans of 10point text produce horrible results whilst 300dpi scans of 10 point text produce excellent results. —Preceding unsigned comment added by 98.197.217.81 (talk) 21:00, August 25, 2007 (UTC)

[edit] Zip codes

The web page pete's history gives 1965 as the year the United States Postal Service first used OCR to read zip codes.--Rethunk

[edit] Open source programs

Are there any open source OCR programs available?

yes. THE KING 12:53, 5 May 2005 (UTC)

I see that http://simpleocr.com/ is free for "personal use"; is it really open source?

[edit] Section about software

Kooka - default scanning application in KDE. It uses GOCR for OCR
Tesseract is an open source OCR, initially developed by HP, and released under the Apache License, Version 2.0. It can be compiled using MSVC 6.0 or GCC (~120000 LOC)
Clara - [1], [2] (~50000 LOC)
GOCR - (~20000 LOC + Unpaper + Socrates) - GOCR included in Debian and other distributions (not for Windows)
Ocrad - [3] - (~9900 LOC) - "is an OCR [...] program based on a feature extraction method".
Simple OCR - freeware application available, as well as royalty free SDK and source code.
ISRI Software - some experimental OCR tools
OCRchie - dormant since 1996
OOCR OOCR is an OCR program still in development, under the GPL.
phpOCR A base implementation for an OCR tool in PHP
Kognition - [4]

[edit] CJK Support?

This article doesn't mention anything about OCR support for Chinese, Japanese, and Korean though that information would be very valuable, espescially if there is free software with CJK support. Theshibboleth 00:11, 10 May 2006 (UTC)

Seconded. I'm disappointed in you all! Astarica 09:37, 6 September 2007 (UTC)

[edit] MICR

The reference to MICR seems strangely disjointed, as though it is written in the context of human reading rather than machine reading. I am mindful to amend it. Would anyone object? Tom 00:00, 5 June 2006 (UTC)

[edit] Merge

I am proposing the merge. Neither article is unduly long and it would be much more convenient to the reader to have all the relevant information in one place. BlueValour 17:22, 5 November 2006 (UTC)

Agreed, it should be in this article under a subsection, makes it easier to find. —Preceding unsigned comment added by 207.81.148.242 (talk • contribs)

[edit] Section "Optical Character Recognition in Unicode"

It's not clear at all from the article, what those characters are used for. 83.79.33.140 19:47, 18 March 2007 (UTC)

[edit] OCR for mathematical documents

Searching a bit on the web for a taste of OCR for maths led me to this page: http://www.inftyproject.org Although it's labelled 'free software', going by the license it's obviously just freeware. Anyone know of free/open source alternatives? I'm surprised that there isn't any major software project for this, with (cheap) tablet PCs around the corner and Google's plans to digitise the planet being applied to books.

Most mathematical formulas have been set using TeX, so it shouldn't be that difficult to scan it back in again correctly, right? Merctio 23:01, 11 April 2007 (UTC)

For the InftyReader/open source alternatives: I believe there are no alternatives yet. The OSS world is still struggling with straight Latin. For the InftyEditor: actually quite common, f.ex. OpenOffice Math. For the rest: mixing audio with math text, I've never heard of the idea, and I couldn't envision it by myself, except possibly as my really mad ideas of combo-TV-garden-rake. I thought math was simply unspeakable! Said: Rursus ☺ ★ 09:56, 19 July 2007 (UTC)

[edit] Wrong word?

Should this say "handwritten" instead of "hand-printed?"

"These methods cannot be used in software that scans paper documents, so accurate recognition of hand-printed documents is still largely an open problem."

Matthias Röder 10:56, 6 July 2007 (UTC)

[edit] MAP

Ossware map, feel free to modf:

Inwiki: GOCR, Ocrad - cmdline?, OCRopus - new, merger?
Exwiki: Tesseract on g8gle - used by OCRopus, OCRopus on g8gle - nice link to transsurf, Leptonica - unknown whatis,
Known: ClaraOCR - almost no info there, inactive since 2003.

Said: Rursus ☺ ★ 07:01, 19 July 2007 (UTC)

Is this related to a project of some sort? It's not really appropriate for this talk page... Chris Cunningham 11:57, 19 July 2007 (UTC)

[edit] Tesseract

Its nice that Tesseract is free etc, but trying to use it seems rather tech-challenging at this point. Does anyone offer it to try as a free online conversion tool? FreeOCR may be a more user-friendly version, but they may all require 2K/XP for Windows version, so older OSes are out of luck. The only free online OCR I can find is scanR, but using it seems quite awkward (must email jpegs, get activation codes, etc.) -69.87.203.15 12:48, 2 October 2007 (UTC)

[edit] Citations?

This article does not have any citations. —Preceding unsigned comment added by 70.126.48.91 (talk) 00:37, 2 December 2007 (UTC)

[edit] unknown characters

Where do these OCR characters come from:

? They don't seem to be defined in the relevant standards. --Abdull (talk) 23:14, 15 February 2008 (UTC)

[edit] Character 0x244B

Why is 0x244B declared as "classified"? —Preceding unsigned comment added by AzaToth (talk • contribs) 03:40, 9 March 2008 (UTC)

[edit] Strongly suggest a 'software - last release date' column in table

The software list is misleading given that many of the open source OCR packages have not had a release in many years as well as that some of them are in pre-alpha status (Tesserect).—Preceding unsigned comment added by 98.197.209.187 (talk • contribs)

[edit] Missing

- Optical mark recognition link - Glyph recognition with user interaction (e.g., training an OCR package to learn to OCR latin texts) - Document preprocessing before OCR (deskew, threshold, etc.) - OCR test results to give a basic understanding of scan quality, character size and OCR effectiveness) - Mention output formats for OCR documents (plain text, PDF text on top of the original image, etc.) - Voting techniques for character recognition (i.e., comparing all letter 'e' on a page to help classify unknown glyphs as the letter 'e')—Preceding unsigned comment added by 98.197.209.187 (talk • contribs)

[edit] This article doesn't even mention Cyrillic OCR!!!

The HP scanner I bought for about $50 five years ago came bundled with software that can OCR Cyrillic text about as well as Roman. Apparently Russians have been making use of these capabilities to put huge amounts of writing from the tsarist and soviet periods online, in honor of "samizdat" traditions!

Apparently the newest versions of HP's bundled software also OCR Greek, Chinese (simplified or traditional), Arabic, Hebrew and Korean. The only really big omission in contemporary terms seems to be Indic scripts (including variants used outside the subcontinent for Tibetan, Burmese, Thai, Laotian and Cambodian).

This article really seems behind the times in not going beyond OCR of the Roman alphabet and its variations. LADave (talk) 02:20, 25 May 2008 (UTC)

Wow!!! Find some reliable sources and add it to the article. Of course, Cyrillic really is a variation of the Roman alphabet (well, the Latin-Greek-Cyrillic superalphabet), especially from the perspective of OCR.--Prosfilaes (talk) 13:26, 25 May 2008 (UTC)