[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Bug-ocrad] Request for help in book based OCR
From: |
Brewster Kahle |
Subject: |
[Bug-ocrad] Request for help in book based OCR |
Date: |
Sun, 02 Apr 2006 18:22:38 -0700 |
User-agent: |
Thunderbird 1.5 (Macintosh/20051201) |
Not to seem disrespectful, but we have tried the OCR performance of this
package and it is not up to the standards we can get from abbyy and other
commercial packages.
But-- I would like to ask for help in doing a new form of OCR that I believe
will work, but do not know for sure.
We would like to do OCR at the book level, and migrate to language independent
OCR.
The Internet Archive is in the process of scanning a large number of books and
making them publicly available
(http://www.archive.org/details/texts). The books we scan ourselves (for
instance http://www.archive.org/details/americana )
are in very consistent form and we have alot of control as to how they are
imaged and processed.
What we want to output is ocr output that is in an XML file format that keeps
the pixel locations of words and then the utf-8 of the characters in that word.
We believe we can create a large training set for a large number of languages
to train a word based OCR engine. This could lead to language independent OCR
based on relatively simple pattern matching.
See http://www.archive.org/details/document-word-segmenter for an overview of
the idea.
What we believe we need:
A word segmenter, and
a trainable system for word recognition, and then
large training sets.
The large training sets we can get from output of other programs. So we have
it for several romance languages already.
Is anyone interested in helping with this? We can pay something, but most of the
"compensation" will be in doing something publicly good and huge scale.
-brewster
Digital Librarian
Internet Archive
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Bug-ocrad] Request for help in book based OCR,
Brewster Kahle <=