[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev Converting PDF to text (was Re: Is Lynx still practical?)

From: Martin McCormick
Subject: Re: lynx-dev Converting PDF to text (was Re: Is Lynx still practical?)
Date: Tue, 13 Jun 2000 14:38:44 -0500

        It should be possible to retrieve a PDF document if there
is ASCII text in it to begin with.  The pdf standard allows ASCII
to be compressed and stored as compressed postscript, but there
is no guarantee that this is what happened when the document was
originally encoded.  Some PDF's are just photographs of a page
which look okay to the eyeballs, but which do not have any words
and letters in them.  They must be decompressed and then run
through an OCR program.

        The other thing I have noticed about decoding the ASCII
out of PDF documents is that there is an ambiguity that you had
better be ready for regarding tables.  It is kind of hard to
mechanically decide how to read it because the human being
reading may want to read across or down, depending upon what he
or she wants right now.  Tables tend to become unwound and read
as one very long column instead of three or four short ones.  A
body of text displayed in several columns like a newspaper or
magazine article, however, would be nonsense read across so there
will always be interpretation problems.  To get ASCII text, all
that is needed is an engine to unpack the compressed data and
send them to standard output if there are any to send.

Martin McCormick WB5AGZ  Stillwater, OK 
OSU Center for Computing and Information Services Data Communications Group

; To UNSUBSCRIBE: Send "unsubscribe lynx-dev" to address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]