groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Groff] ASCII Minus Sign in man Pages


From: Ingo Schwarze
Subject: Re: [Groff] ASCII Minus Sign in man Pages
Date: Thu, 4 May 2017 03:31:04 +0200
User-agent: Mutt/1.6.2 (2016-07-01)

Hi Ralph,

Ralph Corderoy wrote on Wed, May 03, 2017 at 03:51:24PM +0100:

>     -       A hyphen for text, e.g. beer-flavoured ice-cream.
>     \-      A minus sign in the current font.
>     \(mi    A minus sign in the special font.
>     \(hy    Another name for plain `-', so a hyphen for text.
>     \N'45'  Glyph 45 in the current font.

The trouble with \N'45' is that it has not a fixed meaning
and that the resulting glyph varies wildly.

Even if you only look at groff and only at git master HEAD,
\N'45' means:

 -Tascii:   U+002D HYPHEN-MINUS
 -Tlatin1:  U+002D HYPHEN-MINUS
 -Tutf8:    U+002D HYPHEN-MINUS
 -Thtml:    U+002D HYPHEN-MINUS
 -Tcp1047:  nothing useful, a control character (ENQ, enquiry character)
 -Tps:      hyphen; that's the same character as \(hy
 -Tpdf:     hyphen; that's the same character as \(hy
 -Tdvi:     hyphen; that's the same character as \(hy
 -Tlbp:     hyphen; that's the same character as \(hy
 -Tlj4:     undefined, no character at all

While this is clearly what you want for -Tascii, -Tlatin1, -Tutf8,
and -Thtml, and in particular for -Tutf8 where it is a reasonably
wide glyph looking like a minus sign and also the right character
for copy and paste, it is dubious whether \N'45' is what you want
for -Tps, -Tpdf, -Tdvi, and -Tlbp.

First, the fact that the character number agrees with ASCII doesn't
mean much.  While the arrangement of the glyphs for the TR font of
the -Tps device is loosely based on ASCII, the codepoints for several
characters mismatch.  For example, U+0027 APOSTROPHE, which you get
with \(aq, is codepoint 8 in -Tps TR, *not* codepoint 39 as you
would expect, which is instead \(cq = U+2019 RIGHT SINGLE QUOTATION
MARK.  U+0060 GRAVE ACCENT is codepoint 146, not 96, which is instead
\(oq = U+2018 LEFT SINGLE QUOTATION MARK.  U+005E CIRCUMFLEX ACCENT
has two associated codepoints, the expected 94 (produced with ^)
and the unexpected 0 (\(ha).  So has U+007E TILDE, the expected 126
(~) and the unexpected 1 (\(ti).  It is also amusing that in the
appendix below, the character \(en transforms to seven different
glyph numbers in nine different output devices, and only one of
them is related to ASCII, which emphasizes the weak relationship
between glyph numbers (even for -Tps) and ASCII character numbers.

Then, the glyph you get for \N'45' in -Tps TR is *not* similar to
the wide glyph that you would expect for U+002D HYPHEN-MINUS.
Instead, it is a typical, short hyphen.

So, which glyph in -Tps TR does represents U+002D HYPHEN-MINUS?
Asking the question that way, the full dilemma becomes obvious:
Just like in classical typography, there is *none*.

So not only does groff provide no way to request an output glyph
for "ASCII -", worse, the fonts for the important -Tps and -Tpdf
devices do not even contain such a glyph!

Consider a program that wants to copy text out of a groff-generated
PostScript or PDF document and paste it into a Unicode terminal
window.  Such a program could reasonably be expected to translate
-Tps TR codepoint 45 into U+2010 HYPHEN because that's what you get
from the unambiguous \(hy input character, and it could reasonably
be expected to translate -Tps TR codepoint 173 into U+2212 MINUS
SIGN because, as Doug kindly reminded us, \- is the minus sign in
the current font and \(mi is not in the TR font at all.  But now
we have already exhausted the glyphs in -Tps TR and there is no
glyph left that could be converted into U+002D HYPHEN-MINUS.

Many people have said during this discussion that they wouldn't
expect copy and paste from a PDF viewer to a UTF-8 terminal to work.
The above may well be part of the precise reason why indeed it
cannot fully work.

On the other hand, i stumbled because nowadays, we have got used
to the feeling that Unicode might be a superset of everything.  So
i considered \- and \(mi redundant because both map to U+2212, and
so i hoped one of them might be up for grabs.  Yet Doug kindly
reminded us what the distinction is, and that distinction indeed
cannot be represented in terms of Unicode codepoints.

So i have to reluctantly conclude that your original problem, Ralph,
of requesting "a glyph representing U+002D HYPHEN-MINUS" is unsolvable.
At least not without adding yet another glyph to the -Tps TR font.
But even that would hardly help, for two reasons: all PostScript
and PDF viewer software would first have to catch up, correctly
recognize the glyph, and correctly translate it to U+002D HYPHEN-MINUS
- but the ecosystem such software lives in evolves incredibly slowly
nowadays.  And then you would have to define a new character escape
sequence to access the new glyph, since all the existing escapes
already mean something else.  But that's effectively a reductio ad
absurdum: telling all manual page authors to, henceforth, write "wc
\(hml" is not going to fly.  Very few would understand and follow
that, and even i would tend to resist it as excessive complication.


So i fear we are left with the traditional workaround:
Use \- if you mean U+002D HYPHEN-MINUS and live with the
three unsolvable problematic consequences:

 1. The fact that -Tps TR has no glyph for U+002D HYPHEN-MINUS.
 2. The fact that for Unicode output outside manual pages, you get
    U+2212 instead of the U+002D you wanted.  If that is unacceptable
    in some specific situation *and* you are only targeting some
    subset of output devices, there may be a case-by-case workaround,
    but there is no general solution.
 3. For Unicode output of manual pages, the
      .char \- \N'45'
    in the two macro sets needs to be kept for good, even though
    that implies that in manual pages, the occasional \- that *is*
    intended as a mathematical minus sign will also (mis)render as
    U+002D.

Unfortunately, the patch that i submitted to the bugtracker appears
to be wrong.  As no other solution appears to be possible, I should
probably mark it as invalid and close the ticket.

If you agree with this analysis, i'm planning to look through the
documentation and try to make all this clearer whereever needed,
such that people are less likely to waste time again redoing this
analysis in a few years.

Yours,
  Ingo


Appendix:
Rendering of hyphens, minuses, and dashes in all of groff's output devices.

Format of the following list:

Block headers:
column 1: input character
column 2: description in documentation
column 3 (in parentheses): whether provided in a special font

Block bodies:
column 1 (prefix -T, suffix :): device name
column 2: font file name (if any, otherwise src/libs/libgroff/glyphuni.cpp)
column 3: glyph code number (number format is device dependent)
column 4 (after =): output character (provided only if ASCII)
column 5 (in parentheses): exceptions for manual page macro sets

-     hyphen (never special)
\(hy  hyphen (never special)
      -Tascii:  R  45 = -
      -Tutf8:   -  U+2010 HYPHEN (currently U+002D in manuals, to be removed)
      -Thtml:   R  U+002D HYPHEN-MINUS = -
      -Tcp1047: R  0140
      -Tps:     TR 45 hyphen = -
      -Tdvi:    TR 0055
      -Tlbp:    TR 0x2d hyphen
      -Tlj4:    TR 161069 -- 19U 45

\-    minus in current font (special only in -Tdvi)
      -Tascii:  R  45 = -
      -Tutf8:   -  U+2212 MINUS SIGN (U+002D in manuals, for good)
      -Thtml:   -  U+2212 MINUS SIGN = −
      -Tcp1047: R  0140
      -Tps:     TR 173 minus = \255
      -Tdvi:    S  0000
      -Tlbp:    TR 0x2d hyphen
      -Tlj4:    TR 60096 -- 7J 192
+     plus in current font (never special)
      -Tascii:  R  43 = +
      -Tutf8:   -  U+002B PLUS SIGN
      -Thtml:   -  U+002B PLUS SIGN = +
      -Tcp1047: R  0116
      -Tps:     TR 43 plus = +
      -Tdvi:    TR 0053
      -Tlbp:    TR 0x2b plus
      -Tlj4:    TR 161067 -- 19U 43

\(mi  minus in special font (always special)
      -Tascii:  R  45 -
      -Tutf8:   -  U+2212 MINUS SIGN
      -Thtml:   -  U+2212 MINUS SIGN = −
      -Tcp1047: R  0140
      -Tps:     S  45 minus
      -Tdvi:    S  0000
      -Tlbp:    not provided
      -Tlj4:    S  68909 -- 8M 45
\(pl  plus in special font (sometimes special)
      -Tascii:  R  43 +
      -Tutf8:   -  U+002B PLUS SIGN
      -Thtml:   -  U+002B PLUS SIGN
      -Tcp1047: R  0116
      -Tps:     S  43 plus
      -Tdvi:    TR 0053
      -Tlbp:    TR 0x32b plusmath
      -Tlj4:    S  68907 -- 8M 43

\(en  en-dash (never special)
      -Tascii:  R  45 -
      -Tutf8:   -  U+2013 EN DASH
      -Thtml:   -  U+2013 EN DASH = –
      -Tcp1047: R  0140
      -Tps:     TR 137 endash = \211
      -Tdvi:    TR 0173
      -Tlbp:    TR 0x132 endash
      -Tlj4:    TR 161174 -- 19U 150

\(em  em-dash (never special)
      -Tascii:  -  .fchar \[em] --  \" tty.tmac
      -Tutf8:   -  U+2014 EM DASH
      -Thtml:   -  U+2014 EM DASH = —
      -Tcp1047: -  .fchar \[em] --  \" tty.tmac
      -Tps:     TR 138 emdash = \212
      -Tdvi:    TR 0174
      -Tlbp:    TR 0x123 emdash
      -Tlj4:    TR 161175 -- 19U 151

Devices using special fonts (* provides at least one of the above glyphs):
  -Tps   S* SS ZD ZDR
  -Tpdf  S* ZD
  -Tdvi  MI S* EX CW
  -Tlj4  S*

Devices not using any special fonts:
  -Tascii
  -Tlatin1
  -Tutf8
  -Thtml
  -Tcp1047
  -Tlbp



reply via email to

[Prev in Thread] Current Thread [Next in Thread]