Subject: [Swftools-common] pdf2swf textSnapshots in OCR'ed PDF files
Date: Mon, 3 Aug 2009 16:47:32 -0700

I'm writing a viewer for pdf2swf produced swf file that allows
searching for texts with the swf. I used the example provided here -
- and was able to successfully search for and highlight matching texts
on swf created from digital PDF documents (i.e. PDFs which are created
from other formats like MS word etc.)

I wanted to do the same for PDF documents created with a scanner. So I
took the scanned document, dumped the jpeg pages using xpdf and ran
cuneiform OCR with hocr2pdf script to create a searchable PDF. The
searchable PDF works as expected when opened in Adobe PDF reader with
search highlighting and text select/copy.

However when I load it in my viewer, the text highlights are not shown
even though the textsnapshots for each frame in the MovieClip has the
OCR'ed text including the correct font metrics/bounding boxes. When I
set the alpha of the movie clip to <1 the highlight shows up correctly
(albeit with spacing between characters probably due to inaccurate
font metric).

Is there anything different in the way pdf2swf creates a swf when it
is run on a searchable scanned PDF (jpegs + embedded text)? How can I
make the search highlighting work without having to reduce the alpha
value of the displayed document?


