groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Problems with .PDFPIC caused by pdfinfo


From: Deri
Subject: Re: Problems with .PDFPIC caused by pdfinfo
Date: Tue, 12 Oct 2021 16:12:56 +0100

On Tuesday, 12 October 2021 11:49:23 BST Keith Marshall wrote:
> Ref: https://savannah.gnu.org/bugs/index.php?55107
> 
> On 01/10/2021 01:10, Deri wrote:
> > I did try to help Keith with this previously, but I was mildly "told
> > off" (on list) for sending my help off list. I've learned my lesson.
> 
> Thanks, Deri.
> 
> IIRC, the reason for the "mild telling off" was that, by replying off
> list, you denied us the potential benefit from other list members who
> may have been willing to review the issue, and so contribute to the
> debugging effort.  I am pleased that, on this occasion, you have kept
> this on-list; even if the majority of list members aren't sufficiently
> interested to assist, there may be some who will, and any assistance
> will be gratefully accepted, and very much appreciated.
> 

Hi Keith,

I just assumed the best person for debugging faults in the code would probably 
be you rather than the rest of us. You may receive other "problem pdfs" from 
other members, but the debugging effort is likely to be yours alone.

What I did find useful while debugging the pdf parser in pdfbb/gropdf was the 
Ghent PDF Output Suite (which has some very esoteric examples - sorry it is 
144mb!), see:-

http://gwg.org/gos5/

> > I attach a couple of pdfs with which the current code has problems.
> > 
> > Picture.pdf
> > 
> > [derij@pip groff-psbb]$ ./psbb ../../Picture.pdf
> > ../../Picture.pdf: bounding box = (0,0)..(0,0)
> 
> This is caused by the nested /Group dictionary, within the /Page object;
> the current groff-psbb lexer is confused by it, and ends up in the wrong
> state, when it eventually encounters the /MediaBox key.  Adding one more
> rule (for "<<") to the PDF dictionary state scanning model gets us to:
> 
>    $ ./psbb Picture.pdf
>    Picture.pdf: bounding box = (0,0)..(592,842)
> 
> > [derij@pip groff-psbb]$ pdfbb ../../Picture.pdf
> > Processing '../../Picture.pdf'
> > ../../Picture.pdf: CropBox: 162.085,623.346,340.825,716.546  (178.74,93.2)
> 
> The psbb lexer doesn't handle the /CropBox key.  Should it?  Should
> /CropBox override any extant /MediaBox?

If you view Picture.pdf with a pdf viewer you will see a dumb bell shape, this 
is in fact the area of the A4 page described by the CropBox, not the complete 
A4 page described by the MediaBox. If the MediaBox dimensions were given to 
PDFPIC the included picture would be the wrong shape. Current gropdf honours 
the various "boxes" in this order:-

ArtBox TrimBox BleedBox CropBox MediaBox

(No idea if this is "correct", but the viewers I have tested definitely 
prioritise CropBox over MediaBox, you will have to experiment). 

You would also have to be careful, a MediaBox at the group level could be 
overridden by a CropBox at the page level, I assume.

> > croptest.pdf
> > 
> > [derij@pip groff-psbb]$ ./psbb ../../croptest.pdf
> > psbb:t-psbb (t-psbb.cpp):193: PDF file '../../croptest.pdf' is
> > malformed; no trailer found
> 
> Since croptest.pdf lacks both a trailer dictionary, and a free-standing
> cross reference table, (both are hidden away within a /XRefStm object,
> with a compressed cross reference table), croptest.pdf is _incompatible_
> with applications which do not support this feature of PDF-1.5 (and
> later).  The groff-psbb prototype implementation (currently) does not
> offer this level of PDF-1.5 support; thus, this behaviour is expected.

Gropdf/pdfbb now supports import of these later pdf versions (as does pdfinfo 
which PDFPIC currently uses) so it is important that whatever method is used 
to report the image dimensions back to PDFPIC is consistent with what a user 
would see when viewing the pdf in a viewer.

> > [derij@pip groff-psbb]$ pdfbb ../../croptest.pdf
> > Processing '../../croptest.pdf'
> > ../../croptest.pdf: MediaBox: 0,0,595,842  (595,842)
> 
> Well, this agrees with the result I've shown above, for Picture.pdf,

Croptest.pdf is an A4 page written as a PDF 1.7 file but the included image 
(three times) is the CropBox from Picture.pdf. So the dimensions reported by 
pdfbb are correct, its an A4 page, but not because the Picture.pdf is wrongly 
reported as A4 by psbb.

I have attached a new version called croptest-2.pdf, which psbb successfully 
reports as A4 (because this time it is written in PDF 1.4) but is showing that 
groff can embed a PDF 1.7 image (croptest.pdf) which itself contains three PDF 
1.5 images (Picture.pdf). I also enclose the troff files which created the two 
pdfs, which shows that you don't need to use PDFPIC if you are concerned about 
using unsafe mode in groff. The only thing which PDFPIC does is calculate the 
vertical movement to do after the call to \X'pdf: pdfpic’ to continue output 
after the image, which is fairly easy to do manually given the information 
from pdfinfo.

> with groff-psbb modified to properly handle nested dictionaries; some
> further (non-trivial) development effort will be required, to support
> concealment of trailer dictionaries and cross reference tables within
> /XRefStm objects.

There are several options which would address this problem, i.e. non 
portability of grep and desirability of avoiding groff unsafe mode.

A) Replace grep with sed/awk (still requires unsafe mode).

B) Use psbb (requires "non-trivial development").

C) Use pdfbb (requires hook in input.cpp to call pdfbb and return results).

D) Convert pdfbb to be a pre-gropdf (i.e. a preprocessor like pre-grohtml) 
which would look for .PDFPIC and replace with the appropriate calls to \X'pdf: 
pdfpic’ and add vertical space with .sp.

(A) is obviously the easiest and quickest, (C) and (D) are not too much work, 
since the parser required is already in use.

Cheers 

Deri

Attachment: croptest-2.pdf
Description: Adobe PDF document

Attachment: croptest-2.trf
Description: Text document

Attachment: croptest.trf
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]