groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Help needed with pdfroff and suppression nodes


From: G. Branden Robinson
Subject: Re: Help needed with pdfroff and suppression nodes
Date: Fri, 30 Jul 2021 10:20:50 +1000
User-agent: NeoMutt/20180716

Hi, Keith!

Thank you for the fast response.  I hope I haven't alarmed you!  :-O

At 2021-07-29T21:57:49+0100, Keith Marshall wrote:
> Hi Branden,
> 
> On 29/07/2021 10:45, G. Branden Robinson wrote:
> > As noted below, I caused a regression a couple of days ago, but one that
> > was hard to see since it didn't break the build.  The table of contents
> > was not getting relocated in pdfmark.pdf.  The only evidence of this (if
> > you didn't so something crazy like view the file) was a stray line in
> > the build:
> > 
> > pdfroff-option:set toc_relocation=enabled
> > 
> > This is emitted by spdf.tmac, and normally, apparently, something eats
> > it.
> 
> Yes, it is emitted (on stderr) when the document source invokes the TC
> macro, to flush out the TOC diversion to the stdout stream, at the end
> of input, (as is conventional in ms document processing).  pdfroff has
> captured that stderr stream, during its pdfmark reference collection
> phase, and subsequently interprets it, by evaluating:
> 
>    eval `$SED -n '/^ *pdfroff-option:set */s///p' $WRKFILE`
> 
> which becomes equivalent to the shell variable assignment:
> 
>    toc_relocation=enabled
> 
> This then prevents pdfroff from automatically activating the effect of
> its --no-toc-relocation option.

Yup, this all makes sense.  I'm familiar with the triple-pass
convention.

> > So if you see it, it indicates the TOC relocation failed--at least
> > it did in this case.
> 
> I think your conclusion is incorrect here — at least to the extent
> that the appearance of this stderr output isn't the *cause* of the
> problem, but is rather a symptom of a more serious underlying problem,
> which you have introduced; its appearance suggests that pdfmark is no
> longer being passed the reference data, which should have been
> captured during the aforementioned collection phase, so never sees the
> condition it expects for suppression of reference data output to
> stderr.

I think I'm following you.

> > Now that the error condition is caught and troff doesn't exit
> > unexpectedly, the regression is fixed.
> 
> Are you sure about that?  Why should troff be exiting unexpectedly
> anyway?

In this case it's because of a bug I introduced.  But (GNU) troff was
not bug-free before.  No matter how much confidence we have in a program
or other software module, I think we should code in preparation for
failure of the components we depend upon.

>  As Tadziu has already noted, pdfmark uses \O escapes to mark
> anchor points; specifically, it brackets each pdfhref link with:
> 
>    \O1\Z'|'\O2\c
> 
> at both the start and the end of the link region; these result in
> grohtml-info:page references being written to stderr, (specifically
> emitted by the \O2 escape), from whence pdfroff converts them to:
> 
>    .pdfhref Z ...
> 
> references, which are then inserted into the input stream, for all
> subsequent processing cycles.  If those references aren't produced, in
> an *exactly* matching number to the number of \O2 markers, the pdfroff
> never knows when to stop emitting reference data on stderr, and its
> behaviour *will* become undefined.

Okay, the foregoing is extremely helpful.  Our documentation of the \O
escapes is seriously deficient IMO.  groff(7) covers only \O0 and \O1,
and discourages further investigation even of those with the claim
"[m]ainly for internal use".

groff_diff(7) is a little bit better, and now that you've provided me
with a use case, I find its description of \O2 more intelligible.

The internal design and comments of \O on the troff side also frustrate
me.  A seemingly Boolean variable, `is_on`, can take three values: 0
(off), 1 (on), ... and "2", which is not explained in any .h or .cpp
files but seems downright magical.

Now I have a better idea what spell is being cast.  I have a notion that
a separate flag should be used to determine whether bounding box
information is to be written.  Maybe that already exists as
`emit_limits`.  (There's also the frustrating `is_special` member
variable, which has the same name as one in the font class but,
probably, a completely different purpose.)

But first I'll try to clean up the mess I made.

> > (Should pdfroff be checking groff's exit status?)
> 
> I don't know what it would do with it — apart from possibly just
> aborting early.

That's what I would do, and as a user what I would like it to do.  (I
admit that not everyone shares my "fail fast, fail hard" systems design
preferences.)

> There is no image_filename involved; nor is there any need for one.
> The \O1 and \O2 escapes *must* work (as before), *regardless* of
> whether an image_filename is available, or not; (in the pdfmark
> context, it is unlikely that there will ever be one).

Yup, that's clear to me now.

> I guess you need to rework (or revert) your change, to ensure that \O
> escapes continue to work, with or without an image_filename.

Yes.

> Why do you suspect that the lack of an image_filename would cause a
> SIGSEGV abort?

Because it was declared as a null pointer,

        static const char *last_image_filename = 0;

...every path through the code needs to ensure it's assigned a valid
value before being dereferenced.

The earlier code _did_ do this, by assigning it the address of a string
terminator literal[2] if the pointer was null.  But I didn't understand
its purpose; it was unmotivated by a comment and, more importantly, the
very notion of bounding box information being useful for any purpose
other than cropping an embedded image file was unknown to me.

Apparently, it turns out that you can infer it from this test:

        if (is_on == 2) {

> I have never seen any evidence of that occurring, with pdfroff.

No, I think now that it wouldn't have.  But there was a potential
problem with image file name handling.

        sprintf(name, last_image_filename, last_image_id);

Use of a user-controlled format string is a known security problem[3].

While pre-grohtml takes steps to prevent image file name stems or
directories from containing conversion specifiers supplied as input to
command-line arguments[4], nothing sanitizes or rejects conversion
specifiers supplied as escape arguments in groff language input (as far
as I can see), so one could do something \O[5l%nefarious.png].  I filed
this as a Savannah ticket[5].

I apologize for the disruption and will set things right soon.  Thanks
for the feedback; I now have an idea of another problem to look for
(missing or messed-up link targets in pdfroff output even if TOC
relocation is successful).

Regards,
Branden

[1] 
https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/node.cpp?id=9b7cb47f760e48d474353eb2362e4b2c64a08859#n4056
[2] 
https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/node.cpp?id=9b7cb47f760e48d474353eb2362e4b2c64a08859#n4102
[3] https://owasp.org/www-community/attacks/Format_string_attack
[4] 
https://git.savannah.gnu.org/cgit/groff.git/tree/src/preproc/html/pre-html.cpp#n548
[5] https://savannah.gnu.org/bugs/?60977

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]