[Bug-ocrad] Request for help in book based OCR

bug-ocrad

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-ocrad] Request for help in book based OCR

From:	Brewster Kahle
Subject:	[Bug-ocrad] Request for help in book based OCR
Date:	Sun, 02 Apr 2006 18:22:38 -0700
User-agent:	Thunderbird 1.5 (Macintosh/20051201)


Not to seem disrespectful, but we have tried the OCR performance of this 
package and it is not up to the standards we can get from abbyy and other 
commercial packages.

But-- I would like to ask for help in doing a new form of OCR that I believe 
will work, but do not know for sure.

We would like to do OCR at the book level, and migrate to language independent 
OCR.

The Internet Archive is in the process of scanning a large number of books and 
making them publicly available
(http://www.archive.org/details/texts).   The books we scan ourselves (for 
instance http://www.archive.org/details/americana )
are in very consistent form and we have alot of control as to how they are 
imaged and processed.

What we want to output is ocr output that is in an XML file format that keeps 
the pixel locations of words and then the utf-8 of the characters in that word.

We believe we can create a large training set for a large number of languages 
to train a word based OCR engine.  This could lead to language independent OCR 
based on relatively simple pattern matching.

See http://www.archive.org/details/document-word-segmenter for an overview of 
the idea.


What we believe we need:
   A word segmenter, and
   a trainable system for word recognition, and then
   large training sets.

The large training sets we can get from output of other programs.  So we have 
it for several romance languages already.

Is anyone interested in helping with this?   We can pay something, but most of the 
"compensation" will be in doing something publicly good and huge scale.

-brewster
Digital Librarian
Internet Archive

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-ocrad] Request for help in book based OCR, Brewster Kahle <=

Next by Date: [Bug-ocrad] Windows binary
Next by thread: [Bug-ocrad] Windows binary
Index(es):
- Date
- Thread