groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 1.23 prints some strange error


From: G. Branden Robinson
Subject: Re: 1.23 prints some strange error
Date: Thu, 26 Oct 2023 09:35:07 -0500

At 2023-10-25T16:20:27+0200, Walter Alejandro Iglesias wrote:
> What you did above is not the step by step way I posted to reproduce
> the bug.  Of course it won't be helpful if you overlook it.

You've already gotten what I would have thought to be a sufficient
explanation of the diagnostic messages you saw.

1.  GNU troff (the formatter program) doesn't accept UTF-8 input;
2.  Your list of hyphenation exceptions (`hw` requests) is formatted in
    UTF-8;
3.  Your document is using `mso` rather than `so` requests to load the
    list of hyphenation exceptions;
4.  The soelim(1) program does not operate on `mso` requests (nor should
    it, in my opinion); therefore,
5.  Your input confuses the formatter, producing diagnostics.

Why these exact diagnostics?

Well, let's have a look at the first.

$ nroff -M. ./doc.tr
troff:./list.tr:1: error: expected ordinary or special character, got an 
escaped '%'
$ head -n 1 list.tr
.hw a-hí
$ hd list.tr
00000000  2e 68 77 20 61 2d 68 c3  ad 0a 2e 68 77 20 61 2d  |.hw a-h....hw a-|
00000010  c3 b1 6f 0a 2e 68 77 20  c3 a1 72 2d 62 6f 6c 0a  |..o..hw ..r-bol.|
00000020  2e 68 77 20 63 75 2d 62  72 c3 ad 2d 61 0a 2e 68  |.hw cu-br..-a..h|
00000030  77 20 65 2d 74 c3 a9 2d  72 65 2d 6f 0a 2e 68 77  |w e-t..-re-o..hw|
00000040  20 63 61 2d 6d 69 c3 b3  6e 0a 2e 68 77 20 c3 ba  | ca-mi..n..hw ..|
00000050  2d 74 65 2d 72 6f 0a 2e  68 77 20 70 69 6e 2d 67  |-te-ro..hw pin-g|
00000060  c3 bc 69 2d 6e 6f 0a                              |..i-no.|
00000067

GNU troff reads line 1 of list.tr, interpreting it as ISO Latin-1.  The
bytes of interest are therefore 0xc3 and 0xad.

C3 is "LATIN CAPITAL LETTER A WITH TILDE".

AD is "SOFT HYPHEN".

groff_char(7) explains what the formatter does with the latter.

   Eight‐bit encodings and Latin‐1 supplement
       ISO 646 is a seven‐bit code encoding 128 code points; eight‐bit
       codes are twice the size.  ISO 8859‐1 and code page 1047
       allocated the additional space to what Unicode calls “C1
       controls” (control characters) and the “Latin‐1 supplement”.  The
       C1 controls are neither printable nor usable as groff input.

       Two Latin‐1 supplement characters are handled specially on input.
       troff never produces them as output.

       NBSP   encodes a no‐break space; it is mapped to \~, the
              adjustable non‐breaking space escape sequence.

       SHY    encodes a soft hyphen; it is mapped to \%, the hyphenation
              control escape sequence.

The formatter does not expect to see a hyphen control escape sequence
inside the definition of a hyphenation exception, and it complains if it
gets one.

That is why you got the error message you did.

That is why my advice is to either maintain files you `mso` in Latin-1
(or ASCII), or go ahead and maintain them in UTF-8, but as ".in" files
that your Makefile converts to input GNU troff will accept, using
preconv.

list.tr: list.tr.in
        preconv -e utf-8 $< > $@

GNU troff does not reject code points A0-FF as invalid because they
aren't invalid; every single one might be found in a valid Latin-1
document.  The formatter _does_ reject code points 80-9F as input.  That
might not come up when inadvertently giving the formatter (valid) UTF-8
input, however; I haven't done the arithmetic, but it seems possible to
me that some or all of these would be treated as "overlong encodings" of
Basic Latin code points.

See, e.g., "Canonicalization of Non-Shortest Form UTF-8".

https://websec.github.io/unicode-security-guide/character-transformations/

As it happens, GNU troff uses the C1 Control block (U+0080..U+009F) for
internal purposes.  That is one of the reasons it's non-trivial to
covert it to understand UTF-8 natively, an outcome pretty much everyone
desires.

https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.h?h=1.23.0

Regards,
Branden

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]