Re: Warn on mid-input line sentence endings

groff

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Warn on mid-input line sentence endings

From:	G. Branden Robinson
Subject:	Re: Warn on mid-input line sentence endings
Date:	Sat, 29 Apr 2023 19:05:12 -0500

I should clarify a couple of points here since I was feeling grumpy when
I wrote the following, and that made me forget things.

At 2023-04-27T09:45:40-0500, G. Branden Robinson wrote:
> We're re-covering some familiar ground here.
> 
> I have a few points I'd like to make.
> 
> 1.  "Semantic newlines" is a terrible term.

I should have said "_Warn on_ semantic newlines" is a terrible
instruction/summary.

They are what we _don't_ want to warn about upon encountering them.

If man-pages(7) or other people continue to call the practice of
breaking *roff input lines after sentence-ending punctuation "semantic
newlines", I have no complaint.  It could also be called "Kernighan
breaking", in honor of an early popularizer of the practice.

> 2.  Bjarni's comment '"groff" is not the right tool for such things,
>     but "grep" is.' is thoroughly wrong-headed and Ingo was right to
>     reject it with great force.  Here a few reasons why.  I don't
>     think any of B through D are relevant to mandoc(1) since it
>     doesn't support the features in question (as far as I know).
> 
>     A.  The formatter decides where sentence boundaries are based on
>     its input.
> 
>     B.  Use of the `cflags' request can change the characters that
>     have sentence-ending semantics.  grep(1) cannot know this.
> 
>     C.  Sentence-ending characters are subject to character
>     translation (the `tr` request).  grep(1) cannot know this.
> 
>     D.  The user/document could define a special character that is a
>     sentence-ending character (with `char` and `cflags`).  grep(1)
>     cannot know this.

      E.  Because '.', '?', and '!' are valid characters in *roff
      identifiers, grep(1) can be fooled by special character, register,
      or string interpolations in the input if their identifiers use
      those characters.

Example:

I can't believe \*(I.  ate the whole thing.

It is only valid to detect the end of a sentence here if the (recursive)
_expansion_ of the `I.` string ends with a sentence-ending punctuation
character.

Further, since string interpolations can result in further string
interpolations, a finite-state automaton will not suffice to analyze
this input.  You need a stack machine.  (IIRC, a stack machine
recognizes "recursively enumerable" languages.)

This is categorically not what regular expressions can cope with,
formally.  My vague understanding of modern regex implementations is
that they are not finite state automata; the drive for extra features
has caused them to add limited support for recursively enumerable
languages.  (If memory and comprehension serve, "backreferences" in
matches, like "grep 'foo\(bar\)baz\1qux'" were the camel's nose
admitting unbounded memory usage to the regex interpreters of the land.
Perl added many more.[2])

But even knowing that modern regex engines aren't (more precisely: don't
construct) strict finite state machines doesn't save you; they still
understand only their own grammar, not *roff's, so they have no way of
knowing how a *roff string will ultimately expand.

And, to put a bow on that observation, by the time a grep(1) is looking
at the line above, it has already discarded all of the input that set up
the string definitions it would need to know.

So that's yet another reason why, if mid-input line sentence endings are
to be warned about, they must be detected in the formatter, or an
interpreter for so much of the formatter's grammar that one might as
well write a formatter.

I think this is one reason all of the deroff(1) projects in the world
have died.  Eventually they will all fail given a sufficiently complex
input.  I don't have a theorem/proof to back this up, but my hunch is
that since *roff is a Turing-complete language, then deciding what a
*roff formatter will output with "all of the formatting stripped away"
is equivalent to solving the halting problem.

It occurs to me that the right way to attack the problem of extracting
the text from a *roff document is to scrape it out of the device-
independent output format.  Only a handful of commands in that language
produce text glyphs, and they are easy to parse.  This _still_ isn't a
100% solution; access to the current font's glyphs by their index values
can still conceal text.[3][4]  But it strikes me as a far more reliable
approach to several nines of efficacy in this task than any other I've
seen.

But as far as I know no one has ever done this.  I admit that I'm
baffled why not.

Regards,
Branden

[1] I get the impression that Jeffrey Friedl quit updating his O'Reilly
    book on regular expressions because he kept getting punked on the
    Internet by (pseudo?)academics over the distinction between
    "regexes" (Unixy stuff that supports backreferences and all kinds of
    other un-Kleene extensions) and regular expressions "proper".  While
    the distinction is useful--especially if you're a programmer and
    have decided to bite off the task of writing a regex matcher for
    yourself--the choice of terminology is poor because it's not
    distinct _enough_.  It's extremely predictable that anyone not
    trained in automata theory is going to infer that "regex" is an
    abbreviation for "regular expression".  What Unix people should do
    is simply be frank that software practitioners apply the term
    "regular expression" more broadly than computation theorists do.
    It's like how that neighbor of yours who is convinced of the healing
    power of crystals is concerned about the "chemicals" in our food...

[2] And now I know why the camel was chosen as Perl's O'Reilly mascot.

[3] Demonstration:

$ printf '\\N@72@\\N@69@\\N@76@\\N@76@\\N@79@\\N@44@ 
\\N@87@\\N@79@\\N@82@\\N@76@\\N@68\n' | troff -Tascii
x T ascii
x res 240 24 40
x init
p1
x font 1 R
f1
s10
V40
H0
md
DFd
N72
H24
N69
h24
N76
h24
N76
h24
N79
h24
N44
wh48
N87
h24
N79
h24
N82
h24
N76
h24
N68
h24
n40 0
x trailer
V2640
x stop

[4] And if you know how the font is encoded, you are still not defeated.
    Historically, device-independent troffs do not report this
    information, but it would be straightforward to extend groff to do
    so.

signature.asc
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Warn on semantic newlines, Bjarni Ingi Gislason, 2023/04/26
- Re: Warn on semantic newlines, Alejandro Colomar, 2023/04/27
  - Re: Warn on semantic newlines, Bjarni Ingi Gislason, 2023/04/27
- Message not available
  - Message not available
    - Message not available
    - Re: Warn on semantic newlines, Alejandro Colomar, 2023/04/27
- Re: Warn on semantic newlines, Douglas McIlroy, 2023/04/27
  - Re: Warn on mid-input line sentence endings, G. Branden Robinson, 2023/04/27
    - Re: Warn on mid-input line sentence endings, G. Branden Robinson <=
    - Re: Warn on mid-input line sentence endings, Alejandro Colomar, 2023/04/29
    - Re: Warn on mid-input line sentence endings, G. Branden Robinson, 2023/04/30
    - Re: Warn on mid-input line sentence endings, Ingo Schwarze, 2023/04/30
    - Re: Warn on mid-input line sentence endings, Alejandro Colomar, 2023/04/30
  - Re: Warn on semantic newlines, Dave Kemper, 2023/04/30

Prev by Date: Re: Multi-columns in ms
Next by Date: Re: Multi-columns in ms
Previous by thread: Re: Warn on mid-input line sentence endings
Next by thread: Re: Warn on mid-input line sentence endings
Index(es):
- Date
- Thread