groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Plan 9 man added a new macro for man page references


From: G. Branden Robinson
Subject: Re: Plan 9 man added a new macro for man page references
Date: Wed, 4 Aug 2021 14:06:09 +1000
User-agent: NeoMutt/20180716

Hi Alex & Ingo,

I owe Doug McIlroy an apology for, some months ago on this list,
significantly understating his diligence as editor of Volume 1 of the
Version 7 Unix manual (1979).  A meticulously numerical accounting of
just one aspect of that effort follows in this (lengthy) email.

At 2021-08-01T14:06:38+0200, Alejandro Colomar (man-pages) wrote:
> >     For more on what can go wrong you when you screw up concurrency,
> >     see
> >     .MR membarrier "2 Errors" .
> 
> Interesting.  You could make it search for SH and SS coincidences.

They don't have to be coincidences; it's possible to extend SH and SS
themselves to plant "anchor targets" which can be used by MR.  Such an
extension involves *roff "device escapes" (probably using groff's
devtags.tmac) and would not be visible to, or require anything of, the
man page writer at all.

> I think a 3rd (or maybe a 4th if the 3rd corresponds to the
> punctuation) argument would be better, as it doesn't have very much
> relation to the 2nd one.  It will be simpler to understand in separate
> arguments.

That's true; I have a competing bias to keep more closely related things
topologically closer.

> > > I support this plan ;-)

I've noted you've cooled off on this.  Ingo's a more seasoned campaigner
than I am.  :-O

> > You can see why I'm not in sales.
> I have alx.manpages@ for normal usage, and then alx.mailinglists@ for
> subscribing to mailing lists.  That way I avoid noise in the other one
> (I'm subscribed to libc-alpha@, linux-man@, linux-api@, which are
> very-high traffic).  And groff@ mails may get lost between many of
> those linux-api@ and libc-alpha@ mails.  That's also why I prefer that
> people CC me in patches instead of sending them only to linux-man@
> (there they may get lost) (Michael too BTW, for similar reasons).

Understood, and honored above.

At 2021-08-01T15:49:25+0200, Ingo Schwarze wrote:
> Hi Branden,
> 
> note that mdoc(7) has most of what you are talking about - not just
> as a freshly invented concept yet to be tested, but actively used
> and proven adequate in practice.  In particular the .Xr macro has
> seen consititent use in all mdoc(7) manual pages for more than 30
> years, and it has been in use for hyperlinking on the web for about
> ten years, or even much longer if you count the original FreeBSD
> man.cgi implementation.  It has exactly the syntax you propose for
> .MR:
> 
>   .Xr page_name section_number [punctuation_suffix_args]

I'm aware.  And parallelism between the macro interfaces on this point
seems like a feature rather than a bug.

I will however note that groff mdoc does not regard a single-argument
call to Xr as an error condition as mandoc does.  This mode of calling
it has a use case that we exercise in the groff_mdoc(7) page; I asked
you about it in December[1] and haven't heard back.

> I first discussed that idea during EuroBSDCon 2015 in Stockholm:
> 
>   https://www.openbsd.org/papers/eurobsdcon2015-mandoc.pdf
>   see pages 15 to 18
> 
> It turns out the concept of remote deep linking in manual pages is
> rarely needed, for several reasosn.
> 
> Well-designed programs tend to be simple, doing one thing well.

Even granting this arguendo--I'm not sure it holds for programs that
aren't _designed_ to participate in the Unix pipeline/filter
model--there are plenty of counterexamples in the field with a long and
successful history, like ffmpeg and ImageMagick/GraphicsMagick.

I know you have an even stronger prescriptivist bent than I do, and so
your response might be to leave such ill-conceived programs
undocumented, thus hastening their deaths in the hopes that
philosophically pure replacements will come along and which will
incidentally fit neatly into the mdoc(7) schema without requiring deep
links.

That might be the OpenBSD way, but it's not how many GNU/Linux
distributions operate; they're eclectic, and for better or worse
assimilate software projects that adhere weakly or not at all to that
principle.  Some of the people who use them are going to want to see
them documented.  A few of those, perceiving a lack of documentation,
are going to want to supply it.

> Consequently, well-written manual pages for well-designed programs
> tend to be short.  When linking to a short document, deep linking
> matters little.  Besides, deep linking is not necessarily beneficial.
> The reader being refered to that other page needs to grasp some
> context regarding what that other page is about, and that is easiest
> to get from the page title, the Synopsis section, and the first
> sentence of the Description section - i.e. from the beginning of the
> page.  Being plunged right into the middle of a document is not always
> helpful, *especially* when the document is large or complex.

Some readers, you can trust to recognize when they're missing context
and to "zoom out" with respect to the scope that the reference takes
them.

[...]
> I'm not saying deep linking is completely irrelevant or i would not
> habe been considering and discussing it for the last six years.
> But i do insist that it must not dominate the discussion of linking
> as a whole.  It is *much* more important that simple use cases of
> linking work in a way as simple as possible for authors and readers
> than that deep linking is available.  In other words, designing deep
> linking must not spoil the overall design of linking.  There is a
> very substantial danger of overengineering here.

I'm willing to postpone implementation of that aspect of MR pending an
expression of demand; Alejandro is well-placed to assert whether he sees
a need based on his Linux man-pages experience.  Given the interface
proposals mooted already, I think the feature can (and, if done at all,
should) be grafted on without invalidating the "shallow" syntax.

> Note that the concept of trailing punctuation arguments is standard
> for mdoc(7) macros but feels somewhat alien to the man(7) macros.

Not if you support groff man(7)'s .MT/.ME and .UR/.UE.  ;-)

> > That is not fatal to my evil plans; the internal anchor reference
> > could be appended to the section somehow,
> 
> I strongly advise against that.  Combining arguments of different
> purpose into a single function argument is terrible practice in the
> first place.  It doubles complexity because without it, you have one
> level of parsing: identify arguments and use them.  Now, you suddenly
> have two levels of parsing: after identifying this kind of argument,
> you have to start a whole new parsing algorithm to parse that argument
> and then handle its components.  It also insults the eye by
> non-uniformity of syntax.  Before, you had the space character as an
> argument separator.  Now, you suddenly have two different separators
> for no good reason.

That's a fair point; mdoc's own practice of macros calling their own
parameters as macros is one of the things about its syntax I find
discouraging.  What I suggested is a simpler case, but it, too, is an
instance of two levels of parsing as you point out.

> Besides, combining the section number and the deep link target name
> makes no sense at all because both are completely unrelated to each
> other.  The arguments describing the target form a natural hierarchy:
> 
>  1. target section
>  2. target manual page name (within the target section)
>  3. deep linking target (within the target page)

It looks like Plan 9 missed an opportunity to order the arguments
"correctly", then, even omitting item 3.  (I think they made the right
choice; people are so used to $MAN($SECTION) that people would be
resistant to take up the macro in the opposite order.)

> In spite of this natural ordering, starting with the page name is
> good because that's what authors and readers should think about
> first and also because we certainly mustn't abandon the name(sec)
> output convention, so having the same argument order in the .Xr
> and .MR macros on the input side really helps sporadic authors
> to remember the input syntax.

Agreed.

> Well, it must of course not come after the punctuation argument,
> the obvious syntax would be
> 
>   .Xr/.MR page sec [deep_target] [punctuation_suffix_args]
> 
> And in the extremely unusual case that some punctuation_suffix_arg
> would not look like punctuation, you would have to write
> 
>   .ME page sec "" [punctuation_suffix_args]
> 
> In mdoc(7), that cannot ever happen because mdoc(7) very specifically
> defines what closing punctuation is, and none of that can possibly
> occur as a deep_target:
> 
>   https://man.openbsd.org/mdoc.7#Delimiters

I am not a fan of mdoc's practice of inferring a data type from a macro
argument by parsing it.  Without a mechanism for named parameters (which
mdoc also implements along the lines of Unix option syntax in macros
like 'Bl', and which some groff macros like PSPIC also employ), position
in the argument list is the thing the user, or package, can know.  In
man(7)'s case, I think that's a benefit.  I want to keep the language
that simple.

> An alternative would be using the .Tg macro that already exists in
> the mdoc(7) language for a related purpose, as follows:
> 
>   .Tg deep_target
>   .Xr page sec [punctuation_suffix_args]
> 
> The purpose of .Tg is to mark the next token as a link target.
> Since .Xr can never be a useful link target, letting the deep_target
> name refer to the *target* page rather than the *source* page when .Tg
> precedes .Xr feels kind of natural.
> 
> This .Tg / .Xr design provides the side benefit of not changing the
> syntax of .Xr that has been established for three decades, so it has
> better backward compatibility properties than the three-argument idea.
> Again, i'm not claiming just yet this is the best idea.

Yes, I think the above idea has some merit.  For man(7) we can infer
link targets automatically for SH, SS, and TP.

> > > > * Added support for another string, perhaps 'MB' ("manref
> > > > base"?), supplying a base URL which can be set at
> > > > page-generation time.  Embedding a full URL in man pages sources
> > > > to an inherently relocatable page hierarchy is a bad idea.
> 
> That feels like a feature for the formatter, *not* a feature for the
> markup language.
> 
>   https://man.openbsd.org/mandoc.1#man~2
> 
> Note that the mdoc(7) documentation is not encumbered by this -O
> man=... feature of the mandoc(1) formatter at all.

You're confusing me.  troff _is_ the formatter (well, in conjunction
with the output device driver/postprocessor, which is what matters
here).  A link will not be resolvable until the page is formatted, and
when it is, it will be generated in a context.  That context has to be
communicated to troff (and then the output driver) somehow.  That
context _cannot come from the man(7) document source_.  But it _can_
come from the man(7) macro _package_, _at the time the document is
processed_.  A roff string seems like the obvious way to implement this.
It would still be a man(7) feature because it's going to be up to
man(7)'s MR implementation to inject the contents of this string into
the device escape that contains the URI.

As with the PT and BT hooks of groff_man(7), such an MB string is not
something than a man _page_ should ever touch or worry about.

> > "30 years after Sir Tim Berners-Lee brought you HTML, groff is hot
> > on his heels!"
> 
> And about 33 years after Cynthia Livingston invented .Xr on behalf of
> USENIX, and 12 years after Kristaps implemented .Xr / <A HREF>
> support in mandoc -T html:
[...]

I know.  The thrust of my joke is that I am not implying a pioneering
effort.  It's a catch-up measure for an old technology.

> > I should go ahead and mention that I'm resolved to implement a
> > string called (probably) MF, so make the font used for setting man
> > page names configurable at rendering time.
> 
> Don't.  When designing a hammer, don't add bells and whistles as
> features to it.  User-configurable fonts in manual pages provide
> no benefit whatsoever, just like bells and whistles on a hammer
> wouldn't, so the hammer is better without them.  But making this
> user-configurabe has a clear downside: it reduces the uniformity
> of rendered manual pages, to the detriment of users who would have
> a harder time of getting used to how manual pages look like and
> what the fonts used in them mean.

This is why italics are the correct choice.  A piece of software
referred to by name is a work title, like Beethoven's _Fidelio_ or
Harper Lee's _To Kill a Mockingbird_.  Italics are how you refer to such
works in formal English writing, across all disciplines.  One can argue
that the convention used for articles or shorter works, simple
quotation, would have been more appropriate for the small programs that
could fit in core on a PDP-11, but that's not the choice the folks at
Bell Labs made.  (A good thing, too, because the double quotes on the
Teletype Model 37 did double duty as dieresis accents and were, bluntly,
uglier than hell.)

> Almost no user would configure this themselves, but package
> maintainers in operating systems would be likely to fiddle with it.
> So you would actively encourage incompatibility across operating
> systems.

I find this overstated.  It's like saying that having a different $PS1
default across operating systems is an incompatibility.  The font style
used for man page titles is a stylistic preference, and one people fight
like dogs over.  I'm not willing to impose my preference or yours on
people as the price of adoption for the MR feature.

> Making this user-configurable would feel like design by committee:
> The committee couldn't agree on which of the equivalent colours to
> use for the bikeshed, so they required the construction of multiple
> bikesheds in various colours, and while they were about it, neglected
> to consider the features of the bikesheds that actually matter to
> their users.

I fear that supporting this configurability is the price of buy-in for
the feature, since people have this ahistorical hatred of italics for
titles of man pages.  If we're going to get this modest reform of the
man(7) language deployed in the field, we will indeed have to do some of
the work that a committee has to do.

> Regarding which colour is best, at the risk of repeating myself:
> 
> UNIX-7 is inconsistent in this respect, in part I(R), in part R(R).

You're basing this claim on some ad hoc research I did last August[2].
I have since seen that I need to improve the precision of my reasoning.
Since you are deploying it against my position, you must indulge me a
data-driven excursion on the point at issue.

Problem
=======
Quantify the inconsistency of (Volume 1 of) the Version 7 Unix manual's
styling of man page cross references.

Method
======
(1) Retrieve the V7 Unix archive from a reputable source.
(2) Unpack it.
(3) cd into usr/man.
(4) Count things that look roughly like man page cross references.

for MP in $(find -name "*.[1-8]*"); do \
 sed '/[A-Za-z.]\+ \+([0-9][a-z]\?)/!d' "$MP"; done

I include (1) capital letters in the pattern matching page names because
on rare occasions, man page names in text were subjected to English
sentence capitalization rules; and (2) the literal dot due to the lone
page a.out(5).

I call these "total-reflikes".  There are 725.

(5) Since the "SEE ALSO" section consistently uses roman-on-roman,
perhaps for an editorial reason Doug McIlroy can shed some light on,
weed it out.

$ for MP in $(find -name "*.[1-8]*"); do \
 sed '/^\..*SEE ALSO/,/^\.SH/d;/[A-Za-z.]\+ \+([0-9][a-z]\?)/!d' "$MP";\
 done

The above tells us how many lines containing xref-looking strings there
are _outside_ the "SEE ALSO" section.  I call them "possible-refs".
There are 425.

(6) See how many italicized page name cross references there are in the
stereotypical presentation form of ".IR manpage (x)", where 'manpage' is
any lowercase alphabetic string and 'x' is a single decimal digit.

$ for MP in $(find -name "*.[1-8]*"); do \
 sed '/^\..*SEE ALSO/,/^\.SH/d;/^\.IR [A-Za-z.]\+ \?([0-9][a-z]\?)/!d' \
"$MP"; done

I call these "stereotypical-refs"; there are 384.

(7) Inspect the exceptions.

$ diff -u stereotypical-refs possible-refs | grep '^+'

I'll rearrange (but only that) the output to clarify things.  First
let's get some false positives out of the way.

+++ possible-refs       2021-08-04 11:01:03.376431376 +1000
+return a null pointer (0) if there is no available memory
+Returns NULL (0) if name not found.
+       monitor(0);
+is false (0).
+returns a null (0) pointer if packet protocol
+indicate errors with a null (0)
+is reliably returned by `sbrk(0)',
+returns a null pointer (0) if
+.B long time(0)
+.B wait(0)

The above refer to literals in parenthesis or are example function calls
with single, small-integer arguments.  (Because the Linux man-pages
styling practice is to use bold used for both function call literals and
man page names, similar confusing collisions have arisen in practice
there.)

+tables for a simple automaton which executes an LR(1) parsing

The foregoing is a notation familiar to those who have studied parser
theory.

+.IP (1)
+.IP (2)
+.IP (3)

The above is an enumerated list.

+.RI ( stdio (3)),
+.RI ( passwd (5))
+.RI ( date (8))

These are perfectly idiomatic man page references, but because they are
nested in parentheses, they use a font style alternation macro of
complementary ordering.

+devices (4)
+system calls (2)
+other functions in (3)
+Section (6) for computer games.

These are references to manual sections, not page references per se.

+troff(1)

This is a false positive from cat(4): a "SEE ALSO" section was the last
section in the page (an unusual occurrence in the V7 Unix manual).  I
did not take the time to craft a more sophisticated sed script to handle
this case, which would have required use of its hold space, a technique
that comes to me only haltingly.

+uux(1)
+pwd(1)
+dc(1)  desk calculator proper
+nm(1), sed(1), sort(1), join(1)

These look like errors in mail(1), at(1), bc(1), and lorder(1),
respectively; the above lines are at the end of "FILES" sections when
they should have occurred one or two lines later in the "SEE ALSO"
sections that immediately follow in each case.  When one considers that
ed(1), rather than a full-screen editor in a video terminal, was likely
used to update these files, the plausibility of such an error seems
greater.

+deroff(1), sort(1), tee(1), sed(1)

This case is similar to the above, except that the "SEE ALSO" section
heading is missing entirely.  Perhaps ed(1)'s "c" command was mistakenly
used instead of "i" or "a".

+.RI (2)\|b " label"
+.RI (2)\|r " rfile"
+.RI (2)\|s /regular\ expression/replacement/flags
+.RI (2)\|t " label"
+.RI (2)\|w " wfile"
+.RI (2)\|x
+.RI (2)\|y /string1/string2/
+.RI (2)! " function"
+.RI (0)\|: " label"

The foregoing (from sed(1)) are all obviously not man page references.

+ptrace(2),
+a.out(5),
+core(5)

These three are false positives because I failed to handle an
idiosyncratic section heading in adb(1), shown indented below.

        .SH SEE\ ALSO

+(see umask(2))
+If you have a hierarchy to restore you can use dumpdir(1)

These are full-on styling solecisms (in chmod(1) and restor(1m),
respectively) under the editorial standard in evidence--they constitute
the veritable nose of the mdoc Xr camel!

From the above evaluation of cases, I produced a final artifact,
"scrubbed-refs", leaving only the man page reference that were truly man
page references, excepting those in or intended for a "SEE ALSO"
section.

I'm attaching all of the files produced by grep and/or sed for the
benefit of future scholars.

Findings
========

Version 7 Unix's inconsistency rate in using italics for man page names
in cross-references is more like 0.51% (2/389).  While the italics were
omitted in "SEE ALSO sections (and only there, as a strong rule), this
was clearly the result of intention, not error or inconsistency.

Doug McIlroy thus did even more of a solid job as Volume 1 editor than I
thought he did, which doesn't surprise me.

Popping the stack from V7 Unix manual practice...

> Linux is inconsistent, in part I(R), in part B(R).

Definitely.  As I've said before, probably on this list, I speculate
that this was due to nonexistent support for underlining in the Linux PC
console driver for VGA hardware.  I vaguely recall MDA and/or Hercules
graphics cards supporting underlining (if not real italics), but no
proud Linux hacker in the early '90s was going to saddle himself with
such an encumbrance when VGA was available--not when you could play
color-enhanced NetHack or scrape USENET for interesting image files.

> BSD has been completely consistent for 30 years: R(R).

mdoc(7), considered across implementations, has not been.  groff mdoc(7)
sets the first argument of Xr in the Courier family if it is available
(and has done so for at least 20 years), so on typesetter devices the
man page name doesn't blend in with the surrounding text as your
notation R(R) implies.

> I claim I(R) is outright misleading because manual pages mostly
> reserve italic for placeholders, for words the user needs to replace
> with their own content - plus relatively few unrelated, general
> typesetting features like stress emphasis.

As observed in groff_man_style(7)[3] and above, ordinary prose in
English professional writing has its own uses for italics, and that
field brings its own expectations and practices to bear.  Certain
readers will expect them even if the man page _writer_ is the sort who
disdains liberal arts majors in general and English majors in
particular.  (I recall much institutionalized snobbery along these lines
uttered by my fellow engineering majors at Purdue.)

> B(R) is clearly better because manual pages mostly use bold face for
> keywords and other fixed strings the user has to type verbatim,
> and page names, just like command names, are fixed strings, not
> placeholders.

For decades before Thompson coded Unix, typographers had noticed that
excessive use of boldface in type was objectionable, and as subjective
as you may find that esthetic preference, it is more entrenched and
enjoys much wider currency than Unix does, let alone any of our
internecine debates.

> But manual page markup tends to be heavy on the eye anyway, with lots
> of unavoidable bold face and italics.  Where bold face and italics add
> no benefit, they should consequently be avoided, for better aesthetic
> effect and for reducing distraction of the eye.

You recognize the principle but are underplaying it in this discussion,
I think for tactical purposes--perhaps to seduce Alejandro into
championing mdoc(7) over at the Linux man-pages project.  ;-)

> For name(section) manual page references, bold or italic is just
> not needed.  The name(section) syntax is very iconic and readily
> recognizeable on its own, so using R(R) is clearly best.

I am not persuaded, except insofar as function call literals with single
small integer arguments are confusable with man page cross references
when the same face is used for both, B(R) is the worst of the three
worlds being argued.

Regards,
Branden

[1] https://lists.gnu.org/archive/html/groff/2020-12/msg00116.html
[2] https://lists.gnu.org/archive/html/groff/2020-08/msg00068.html
[3] https://man7.org/linux/man-pages/man7/groff_man_style.7.html

Attachment: total-reflikes
Description: Text document

Attachment: possible-refs
Description: Text document

Attachment: stereotypical-refs
Description: Text document

Attachment: scrubbed-refs
Description: Text document

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]