groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Questions concerning hyphenation patterns for non-Latin languages, e


From: Oliver Corff
Subject: Re: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian
Date: Tue, 25 Apr 2023 20:02:00 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1

Hi Branden,

thank you very much for the detailed answer.

Yes, KOI8-R has the Cyrillic uppercase in 0xE0..0xFF, lowercase in
0xC0..0xDF; in the control code area, there are no letters in the human
sense of the word. I had a look at the current groff documentation
referenced by your footnote, and I imagine that KOI8-R-encoded Cyrillic
text will be processed seamlessly (that was the basic assumption behind
my recent and only temporary suggestion to process Greek in ISO
encoding), yet my input is \[u04xx]-style Unicode Cyrillic.

Somehow Cyrillic input in utf8, made readable by preconv(1), should
match the letter code positions in KOI8-R, otherwise pattern matching
for hyphenation would fail. How is Unicode Cyrillic text in groff
internally represented? When dumping gtroff output to the console, I see
u04xx codepoints. In my naive understanding I assume it would be the
same internally.

I tried the KOI8-R-encoded hyphenation file in my little russ.ms
document, but no hyphenation was introduced. I set the .hy register
etc., but nothing happened: no hyphenation. That's also why I put these
monster words with 30-odd characters into the file and forced everything
to be in two-column mode, in order to make the line-breaking as
challenging as possible.

There is another strong argument against any KOI8-R hack. It does not
have the full Cyrillic alphabet. Even Russian typesetting is defective
(modern Russian has 33 letters, if you include pre-modern Russian, the
character set grows even more), let alone other languages written in
Cyrillic (like Ukrainian, Mongolian and Kazakh). These languages have a
larger vowel set than Russian and in the case of Mongolian and Kazakh
use vowel symbols which are best matched by umlauts in the Latin
alphabet: compare уг and үг, толь and төлөө. So, a Mongolian word like
төлөвлөгөө or төлөөлөгчдийн would never be writable, let alone be
hyphenatable in KOI8-R. Kazakh and Bashkyr alphabets, for instance,
comprise about 42 letters.

So, for me there are sound reasons not to try to make KOI8-R work
*somehow*, as it would not solve the fundamental problems just mentioned.

The hyphenation file parser you referred to looks innocent enough to the
untrained eye. Do you think expanding the current ^^xx notation to
^^^^xxxx notation would derail the input processor? A hyphenation file
would then not be human-readable, but this is a minor problem;
hyphenation patterns look highly inintelligible anyway.

Best regards,

Oliver.


On 25/04/2023 18:51, G. Branden Robinson wrote:
Hi Oliver,

At 2023-04-25T16:25:49+0200, Oliver Corff wrote:
In the meantime, I had a look at that Russian hyphenation file, and to
my relief, the structure of the groff hyphenation pattern files is
that of TeX hyphenation pattern files, which I have worked on before.
Yup.  They were born that way.

But... the hyphenation file hyphen.ru in the aforementioned source is
not usable in the current set-up because the Russian syllable
fragments are encoded in KOI-8, an 8 bit encoding based on a GOST
Standard of the USSR.

So, it does not match the internal code representation of Unicode code
points.
No, it doesn't.  But some of the other hyphenation pattern files don't,
either; if you look you will see that they're encoded variously in ISO
646, ISO 8859-1, ISO 8859-2, and ISO 8859-15.

This is because groff's hyphenation pattern file parser doesn't
understand UTF-8.

That would be a nice thing to have.

hyphen.ru does a very sneaky thing that I did not think was possible
before Nikita Ivanov dropped it on our doorstep and I took a closer look
at the KOI8-R encoding.

You might know that code points in the "C1 Controls" block of Unicode
(U+0080..U+009F) are invalid input characters to groff.  groff uses them
for internal, bespoke purposes.[1]  This is a barrier to making groff
support UTF-8 input directly, as noted in our documentation.[2][3]

But an interesting property of KOI8-R is that none of the glyphs it
heaps up in the C1 region are alphabetic.

Therefore they don't require hyphenation.

Therefore the Russian hyphenation patterns, using KOI8-R, can masquerade
effectively as an ISO 8859 encoding.

This is the same deal that lets us support ISO 8859-{2,15} in our
hyphenation patterns.  GNU troff doesn't actually care what these code
points "are", it only needs to know their values to make hyphenation
decisions.  The intelligibility of the hyphenation patterns to a human
reader is determined by the character encoding, but within the range
U+00A0..U+00FF (actually more than that: U+0021..U+007F as well), groff
has no dog in the semantic interpretation fight.

Since groff internally seems to work with Unicode code positions, the
question is: in which format should the hyphenation patterns be
presented to groff? As-is, that is as utf8 text, or in \[u04xx] form?
That does not seem to work either, according to my last experiment.
For now, neither; the KOI8-R cheat seems to work fine, as far as I can
tell or understand.  Admittedly, I'm not a Russian speaker.  But I
believe the contributor is.

Eventually, we will need a way for our hyphenation pattern file reader
function[6] to interpret UTF-8 input.  The cleanest thing to do would be
to have it use the same facility as regular GNU troff input stream
reading support for UTF-8.  But that has to be written first.

Regards,
Branden

[1] https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.h

[2] https://www.dropbox.com/sh/17ftu3z31couf07/AAC_9kq0ZA-Ra2ZhmZFWlLuva?dl=0

     Open groff.YYYY-MM-DD.pdf, where the date changes from time to time;
     see pages 73 and 84 (as of this writing).

[3] It can be done; it's just harder than migrating from ASCII to UTF-8.
     My idea is to relocate all these bespoke groff symbols to the
     Unicode Private Use Area.  But for that we need to change groff's
     string class[4] to build upon either (1) wide characters or (2)
     multibyte characters.  My preference is to go straight to char32_t.

[4] groff, having been first written in about 1989, does not use the
     Standard C++ library string class.  This has proven unproblematic;
     it's implemented well and I'm not aware of any defect _ever_ being
     exposed in it.  (This illustrates that James Clark was a better C++
     programmer than most.)  But if I change it, someone's going to ask
     me why I don't just migrate to Standard C++ library facilities for
     it and I need a good answer.  I'm working on that.  When defending
     my engineering decisions, I prefer to be equipped with stone tablets
     strong enough smash over the head of my interlocutor.  I'm not quite
     there yet with groff strings: The Next Generation.

     While I'm pontificating I'll opine that I'm not a huge fan of C++ as
     a language, but I have found with groff that, given discipline, and
     by maintaining a clear view of its roots in C (_also_ not my
     favorite language--but one alienating, enemy-making rant at a time),
     and not picking up every f***ing new feature that gets shoved into
     the language as soon as (or before) it's standardized, it _can_ be
     managed.  But I also think that the C++ templating facility was, in
     implementation, one of the worst features ever developed for any
     programming language.

     I've decided to try to keep groff's C++ codebase ISO C++98
     compatible for the foreseeable future, even though there are _some_
     aspects of later C++ standards that I like quite a bit.  (Simple
     things, like proper damn data types and constants for null
     pointers.)  Clark wrote groff before name spaces, templates, and
     exceptions were added to the language, so you don't see them in its
     sources--it's pretty much in "Annotated Reference Manual C++", but
     if you look carefully you _will_ find some use of vec<>, added by
     later contributors.  And I have seen the pre-template,
     preprocessor-based implementation of "ITABLES" and "PTABLES", and
     no, I don't think it's prettier than templates.  The interesting
     thing is, 30+ years after adding these generic programming
     facilities, nothing in groff _ever_ specialized them beyond the the
     base types they were initially used with.  I find that suggestive.

     If you want to see generics done right, look at Ada.[5]  <mic drop>

[5] Yes, the background of C++ templates' authorship is a tragedy.

[6] 
https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/env.cpp#n3790

--
Dr. Oliver Corff
mailto:oliver.corff@email.de




reply via email to

[Prev in Thread] Current Thread [Next in Thread]