[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev Converting PDF to text (was Re: Is Lynx still practical?)

From: David Woolley
Subject: Re: lynx-dev Converting PDF to text (was Re: Is Lynx still practical?)
Date: Wed, 14 Jun 2000 08:27:51 +0100 (BST)

> will always be interpretation problems.  To get ASCII text, all
> that is needed is an engine to unpack the compressed data and
> send them to standard output if there are any to send.

It's not quite as easy as this, as some word processors (MS Write)
go overboard with micro-spacing and instead of outputting a line at
a time, output a character at a time, each individually positioned.
This produced bloated PostScript and therefore bloated PDF.  It also
means that the text extraction process has to guess where the word
breaks are.

PDF, itself, was designed with the possiblity of extracting plain
text in mind, and documents which are photographs can also include an
OCRed version.  It can deal with the microspacing problem by specifying
stretch factors outside of the text.  Ghostscript can't automatically
convert individually positioned text, even if distiller can, but any 
tool that does has the same problem as one extracting text from a PDF
file with individually positioned characters.

The latest PDF version actually has extensions for showing logical
document structure.  Generally, in my view, PDF now can do a better
job of compromising between the demands of the budding graphic artists
who seem to design web sites, and accessibility, if used properly.
Unfortunately, they lost the marketing battle long ago, and one needs
to appreciate the technicalities and the intended true nature of HTML
to realise that HTML is not the best match for many intended uses, and
to use other tools properly (most people just convert PostScript to PDF
without thinking).  Very few HTML authors know what HTML was really about.

(I think PDF has always been a more appropriate tool for commercial web
pages (unless one really believes that commercial web authors care about
accessibility).  The weird thing though is that I find the useful bits
of web site are in PDF, because the people writing the white papers etc.
use conventional tools and convert to PDF, even though those are the
parts that often best fit the aims of HTML, whereas the no content
marketing stuff that one has to fight through to get there tends to
be written in HTML, when the authors have to abuse it in many ways in
order to achieve the precise GUI rendering that they crave.)

; To UNSUBSCRIBE: Send "unsubscribe lynx-dev" to address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]