[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: z/OS porting issues, UTF-8 support, and the groff man(1) page

From: Mike Fulton
Subject: Re: z/OS porting issues, UTF-8 support, and the groff man(1) page
Date: Fri, 31 Mar 2023 13:05:09 -0700

On Fri, Mar 31, 2023 at 8:57 AM G. Branden Robinson <> wrote:

> [let me know if you're subscribed to the list or if you'd prefer not to
> be CCed]
> [also, if you want to break any of the several subjects arising in this
> message into a separate thread, please feel free]
> Hi Mike,
> At 2023-03-31T07:29:16-0700, Mike Fulton wrote:
> > Over the last year, we have been working hard in the z/OS Open Tools
> > community ( to not only port
> > the fundamental tools to z/OS, but also to do it completely in the
> > open.
> This is good news!  Knowing that you're a software developer might also
> make communications easier.  :)
> > We create one 'port' repo for each Open Source package and the repo
> > contains information on compiler options, dependencies, and so forth
> > so that anyone can (relatively easily) build the software.
> > We also have a special repo (meta) that has a rudimentary package
> > manager and build tool that we use (e.g. _zopen install_ to install
> > binaries, _zopen build_ to build from source, etc.).
> Much as with GNU/Linux distributions; this is a pleasure to hear.
> As a groff developer, I'm interested in minimizing the number of patches
> you have to carry "downstream" to support groff.
Definitely - I have not yet been able to build with the 'git' dev build but
have been building from the tarball. I was planning to work to upstream
once I had the 'git' build working (we are getting there now that we have
more tools
in place - it's a circuitous process!)

> I assume the change here:
> is due to a limitation of the system's sed(1)?
Yes - that is the change. No - it's not because of sed. We have ported sed
and could rely
on it as a dependency. The issue we hit is a bit ugly.
Because z/OS is a 'multi-tenant' operating system, we want people to be
able to install
into a particular location of their choice (either as developer _or_ as a
consumer of the binary).
To make that work, we run a post-process on the files when someone
downloads them to change
the install 'root' location from where we built the code to the target
location they want to install into.
It's ugly and we end up doing a find across files to do this trick. If that
'sed' change is in there,
we end up 'missing' some particular updates because the string gets changed
on us for the 'root'
and so I took out that sed update (a complete hack that I need to do

> If the problem is the '\+' part of the pattern, I see that POSIX says
> that the interpretation of that is "implementation-defined", though the
> latest draft of Issue 8 (just out in the past 24 hours or so) says that
> "a future version of this standard may require "\?", "\+", and "\|" to
> behave as described for the ERE special characters '?', '+', and '|',
> respectively." (IEEE P1003.1™-202x/D3, March 2023, p. 181).
> A workaround would be:
> -s|[^ ]/\+|&\\\\:|g
> +s|[^ ]//*|&\\\\:|g
> If you also want to steal a slight improvement from groff 1.23, you can
> do this instead:
> -s|[^ ]/\+|&\\\\:|g
> +s|[^ ]//*|&\\\\:\\\\%|g
> > We have indeed moved to a 'UTF-8 first' model, which for the most part
> > is a 'ISO8859-1 first' model
> Interestingly, this meshes closely with groff's assumptions.  Due to its
> chronological origins ca. 1990, it does not accept UTF-8 input, but it
> aware of UTF-8 and can produce it as output.  The formatter, troff(1),
> accepts ISO Latin-1 input, except on systems where the C preprocessor
> macro "IS_EBCDIC_HOST" evaluates true; it then assumes that its input is
> encoded using code page 1047.
>From my perspective, we can drop support for 1047 altogether. However,
I don't know if someone else has done their own 'separate' port. I haven't
seen it if there is one.
Correct. I don't set that symbol.

> I reckon you've already dealt with this if necessary, and ensured that
> your groff 1.22.4 build does not define that symbol.
> Is code page 1047 deprecated or obsolescent on z/OS?  If groff dropped
> support for it, do you suspect any z/OS users would be inconvenienced?
I would say neither. An application can choose whether it wants to work in
UTF-8/ASCII or whether it wants to work in EBCDIC (or both if it's careful).
I wrote a blog on this awhile back:

> > and we have a special OS library that takes care of edge case
> > conversions to EBCDIC (and provides a couple functions that are
> > missing).  This is also Open Source (zoslib).
> This really good stuff to hear about; thanks for bringing this
> initiative to my attention.
> > We have about 80 packages we are porting / have ported. Some are very
> > far along like gnu make and Perl with many fixes upstreamed. Some are
> > just barely building - htop is probably a good example of one we have
> > just started on.
> I'm glad groff is a member of the first 100!  :D

> > I am also not sure if we want to work in UTF-8 or in ISO-8859-1. My
> > goal would be UTF-8 across the board, but I expect there are things we
> > still need to fix to get there. Our vim port seems to work well with
> > UTF-8 but I'll be honest that the testing of that is sparse still.
> My suggestion would be to back the UTF-8 horse.  groff already has
> machinery in place for accommodating input in UTF-8 via the preconv(1)
> preprocessor.
> If there is no longer an audience for code page 1047, several aspects of
> groff could be simplified, and it might make the transition of GNU
> troff's internal type to int32_t easier.  (I started down this road once
> before.)
This makes sense to me. I know for Perl, we made sure to keep EBCDIC
there, but the z/OS Open Tools community doesn't build with EBCDIC.

> > With all that background, I'm wondering if 'both' is the right answer?
> I don't feel qualified to answer this question in general; for groff,
> it's a pickle because the original implementer (James Clark) used many
> C0 and C1 control code points for internal purposes, to encode "node
> types" that could be encountered internally by the formatter when
> processing diversions (a Unix nroff/troff feature that usually only
> authors of macro packages mess with).
> You can see these assignments in the "input.h" header file.
> Use of these codes for internal purposes isn't necessarily incompatible
> with UTF-8 input; GNU troff already rejects them upon input, and almost
> none of them are meaningful for a "plain text" document that is going to
> achieve format control mostly via roff language features rather than
> control characters.  Input processing could be made more sophisticated
> (and more stateful when reading the input byte stream to keep track of
> UTF-8 sequences).
> > Would others also find it valuable to be able to have the mathematical
> > angle brackets in UTF-8 be transliterated to angle brackets in
> > ISO8859-1?
> Unless you mean degradation to basic Latin less than and greater than
> signs, U+003C and U+003E, then I don't think there are any valid
> transliteration targets in ISO Latin-1.  The "left-" and "right-pointing
> double angle quotation mark"s (U+00AB and U+00BB) are indeed visually
> similar but semantically pretty distinct.  I don't think I'd want to
> impose such a fallback in general.  (There are multiple ways groff users
> could provide fallbacks for themselves.)
Fair enough!

> > If so, perhaps a 'starter fix' would be if I worked with the libiconv
> > folks to see if that can be added (I opened a similar question in the
> > libiconv channel since honestly I'm not sure the best way to fix
> > this).
> You can pursue both lines of attack independently, especially if the
> iconv developers have a good reason for not performing this fallback
> already.
> I'm not sure groff has a good reason for not performing this fallback.
> At this point I think I will tap Dave Kemper, another groff developer
> who has a fairly strong interest in the fallback issue.
Thank you.

> > In parallel, I think I need to understand how I could change the way I
> > build man so that it operates in UTF-8 mode.
> I think that is a good idea.  It looks like your man is man-db, which is
> really good news because that's developed by Colin Watson who has also
> been groff's package maintainer for Debian for a long time.
> Probably the first thing to do is make sure we know what groff is
> producing in your environment.
> Here is how to (mostly) bypass man(1) and render the groff(1) man page
> much as man(1) itself would do.
> $ zcat $(man -w groff) | groff -man -Tutf8 | less -R
> (If less(1) is not available, try "more", "more -b", or this:
> $ zcat $(man -w groff) | groff -man -Tutf8 -P -c | ul | more
> FYI: The version of "more" on my Debian system breaks lines at incorrect
> places when given the above.)
> Here, we are using man(1) only as a librarian, to tell us where the
> groff(1) man page is.  We are directing formatting ourselves.
> If this looks fine and you get the angle brackets you're expecting, then
> something is running in the pipeline man-db man(1) constructs, _after_
> grotty(1) produces the output, and doing violence to the angle brackets;
> that would be where the bug lies.
> To cut out yet another source of trouble, if your terminal emulator has
> more than 765 lines of scrollback buffer, you can omit paging the
> groff(1) document entirely.
I did this and it _does_ look good! When I ran it through less -R I did hit
problems with the angled brackets - that may be an issue with less.

> But if it _doesn't_ look fine, then we need to find out why.
> I would next inspect groff's device-independent output (which I call
> "grout" for short) to see what's being handed to groff's terminal output
> driver (grotty(1)).
> $ zcat $(man -w groff) | groff -man -Tutf8 | less
> Around line 459 you should see a sequence of lines like this.
> tGNU
> wh24
> Cla
> h24
> t
> Cra
> h24
> t.
> Those "Cla" and "Cra" lines are key.  If they are not absent, then you
> have almost certainly found a bug in groff.
> Another thing I would do is to view the groff_char(7) man page.
> $ man groff_char
> I don't get warnings here, but the Output and Input columns under:
8-bit Character Codes 160 to 255
are all
�        �

> On my system, code point coverage is complete except for three
> characters.
> troff: <standard input>:1051: warning: can't find special character 'bs'
> troff: <standard input>:1192: warning: can't find special character
> 'radicalex'
> troff: <standard input>:1195: warning: can't find special character
> 'sqrtex'
> These problems are expected everywhere[1] for historical and technical
> reasons I won't get into unless asked.
> Let me know what you find and we'll see if we can narrow this down.
> Regards,
> Branden
> [1] the first everywhere, the last two on all terminal devices

reply via email to

[Prev in Thread] Current Thread [Next in Thread]