bug-ocrad
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-ocrad] The function ignore_wide_blobs() doth ignore too much, m


From: Tilman Hausherr
Subject: Re: [Bug-ocrad] The function ignore_wide_blobs() doth ignore too much, methinks
Date: Mon, 30 Aug 2010 19:50:17 +0200

On Sat, 28 Aug 2010 17:17:31 +0200, Antonio Diaz Diaz wrote:

>Hello Tilman,
>
>Tilman Hausherr wrote:
>> I researched the issue why, for some images with tables and grey (noisy)
>> areas, OCRAD returns no text at all, although some of the texts are in
>> clean white areas. I was able to focus on a part in ignore_wide_blobs(),
>> which apparently decides about whether a wide blob is an "image" (I
>> assume you mean a photograph) or a frame. In my case, the function makes
>> a "wrong" decision and then completely deletes blobp_vector.
>
>Did you try the "--layout" or "--cut" options?

--layout did not bring a difference. --cut (with individual settings)
can't be used because I can't do individual parameters for each image.

>> Commenting out the "if" line does solve the problem with the test image,
>> obviously - but what are the risks? Getting a lot of useless output? Or
>> losing on speed?
>
>Getting a lot of useless output. A photograph can produce thousands of 
>noise blobs.

hmm...

I first tested my "change" with my test set of about 70 pages. 

With the change, one page has more dictionary hits than previously; 
three pages that got no dictionary hits at all without the change now
have some.

Then I tested with production. I generally got better results (many
pages now do have useful output that didn't before), with one exception.
One of the images was a huge b/w photograph, thus, from an OCR point of
view, a huge amount of noise. The OCR needed several minutes (!). Thus,
although I can't look into your mind, I guess that the "if" statement
was probably meant as a safety against exactly that. However, that
safety measure also prevents the OCR of printed excel tables with grey
background cells. So instead of making a an assumption about individual
areas, OCRAD ignores the whole file.

Test with photograph:

file type is P4
file size is 4667w x 6222h
TH: blobs = 232481, b[0,0,4666,6221].size() = 29038074, b.size() / 400U
= 72595
TH: factor: 124


Test with printed Excel table with text and a few grey cells:

file type is P4
file size is 1653w x 2338h
TH: blobs = 26566, b[0,0,1652,2337].size() = 3864714, b.size() / 400U =
9661
TH: factor: 145
TH: blobs = 26557, b[182,191,1468,2114].size() = 2476188, b.size() /
400U = 6190
TH: factor: 93


Both files would get dumped because the "factor" is < 400. However the
second file does have text.

Tilman




reply via email to

[Prev in Thread] Current Thread [Next in Thread]