[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-ocrad] The function ignore_wide_blobs() doth ignore too much, m
From: |
Tilman Hausherr |
Subject: |
Re: [Bug-ocrad] The function ignore_wide_blobs() doth ignore too much, methinks |
Date: |
Mon, 30 Aug 2010 19:50:17 +0200 |
On Sat, 28 Aug 2010 17:17:31 +0200, Antonio Diaz Diaz wrote:
>Hello Tilman,
>
>Tilman Hausherr wrote:
>> I researched the issue why, for some images with tables and grey (noisy)
>> areas, OCRAD returns no text at all, although some of the texts are in
>> clean white areas. I was able to focus on a part in ignore_wide_blobs(),
>> which apparently decides about whether a wide blob is an "image" (I
>> assume you mean a photograph) or a frame. In my case, the function makes
>> a "wrong" decision and then completely deletes blobp_vector.
>
>Did you try the "--layout" or "--cut" options?
--layout did not bring a difference. --cut (with individual settings)
can't be used because I can't do individual parameters for each image.
>> Commenting out the "if" line does solve the problem with the test image,
>> obviously - but what are the risks? Getting a lot of useless output? Or
>> losing on speed?
>
>Getting a lot of useless output. A photograph can produce thousands of
>noise blobs.
hmm...
I first tested my "change" with my test set of about 70 pages.
With the change, one page has more dictionary hits than previously;
three pages that got no dictionary hits at all without the change now
have some.
Then I tested with production. I generally got better results (many
pages now do have useful output that didn't before), with one exception.
One of the images was a huge b/w photograph, thus, from an OCR point of
view, a huge amount of noise. The OCR needed several minutes (!). Thus,
although I can't look into your mind, I guess that the "if" statement
was probably meant as a safety against exactly that. However, that
safety measure also prevents the OCR of printed excel tables with grey
background cells. So instead of making a an assumption about individual
areas, OCRAD ignores the whole file.
Test with photograph:
file type is P4
file size is 4667w x 6222h
TH: blobs = 232481, b[0,0,4666,6221].size() = 29038074, b.size() / 400U
= 72595
TH: factor: 124
Test with printed Excel table with text and a few grey cells:
file type is P4
file size is 1653w x 2338h
TH: blobs = 26566, b[0,0,1652,2337].size() = 3864714, b.size() / 400U =
9661
TH: factor: 145
TH: blobs = 26557, b[182,191,1468,2114].size() = 2476188, b.size() /
400U = 6190
TH: factor: 93
Both files would get dumped because the "factor" is < 400. However the
second file does have text.
Tilman