groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Questions concerning hyphenation patterns for non-Latin languages, e


From: G. Branden Robinson
Subject: Re: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian
Date: Tue, 25 Apr 2023 23:42:22 -0500

Hi Oliver,

At 2023-04-25T20:02:00+0200, Oliver Corff wrote:
> Yes, KOI8-R has the Cyrillic uppercase in 0xE0..0xFF, lowercase in
> 0xC0..0xDF; in the control code area, there are no letters in the
> human sense of the word. I had a look at the current groff
> documentation referenced by your footnote, and I imagine that
> KOI8-R-encoded Cyrillic text will be processed seamlessly (that was
> the basic assumption behind my recent and only temporary suggestion to
> process Greek in ISO encoding), yet my input is \[u04xx]-style Unicode
> Cyrillic.

Right.  I don't think we can support that at present.

> Somehow Cyrillic input in utf8, made readable by preconv(1), should
> match the letter code positions in KOI8-R, otherwise pattern matching
> for hyphenation would fail.

For Unicode-encoded Cyrillic input, I think you're going to need to
covert the input to KOI8-R first with iconv.

> How is Unicode Cyrillic text in groff internally represented? When
> dumping gtroff output to the console, I see u04xx codepoints. In my
> naive understanding I assume it would be the same internally.

At 2023-04-25T16:25:49+0200, Oliver Corff wrote:
> Since groff internally seems to work with Unicode code positions, the
> question is: in which format should the hyphenation patterns be
> presented to groff? As-is, that is as utf8 text, or in \[u04xx] form?
> That does not seem to work either, according to my last experiment.

I didn't squarely address this question of yours earlier, which might
have helped.  Sorry about that.

There are a couple of answers to that depending on what stage of
processing we're talking about, but the earlier one is of more interest.

groff internally represents characters as bytes.  8-bit bytes.  That's
all we have.

We support Unicode code points the same way we represent everything else
that isn't ASCII--with "special characters".  \(hy, \[coproduct],
\[u0400] and so on.

> I tried the KOI8-R-encoded hyphenation file in my little russ.ms
> document, but no hyphenation was introduced. I set the .hy register
> etc., but nothing happened: no hyphenation. That's also why I put
> these monster words with 30-odd characters into the file and forced
> everything to be in two-column mode, in order to make the
> line-breaking as challenging as possible.

Hmm.  Did you load the Russian localization file, as suggested by the
documentation?

Here's an exhibit I've prepared.

$ file ATTIC/udhr-ru-koi8r.ms
ATTIC/udhr-ru-koi8r.ms: troff or preprocessor input, ISO-8859 text
$ iconv -f koi8-r -t utf8 ATTIC/udhr-ru-koi8r.ms
.nr LL 28n
.LP
Все люди рождаются свободными и равными в своем достоинстве и правах.
Они наделены разумом и совестью и должны поступать в отношении друг
друга в духе братства.
.LP
Каждый человек должен обладать всеми правами и всеми свободами,
провозглашенными настоящей Декларацией, без какого бы то ни было
различия, как-то в отношении расы, цвета кожи, пола, языка, религии,
политических или иных убеждений, национального или социального
происхождения, имущественного, сословного или иного положения.
.LP
Кроме того, не должно проводиться никакого различия на основе
политического, правового или международного статуса страны или
территории, к которой человек принадлежит, независимо от того, является
ли эта территория независимой, подопечной, несамоуправляющейся или
как-либо иначе ограниченной в своем суверенитете.
.LP
Каждый человек имеет право на жизнь, на свободу и на личную
неприкосновенность.
.LP
Никто не должен содержаться в рабстве или в подневольном состоянии;
рабство и работорговля запрещаются во всех их видах.
.LP
Никто не должен подвергаться пыткам или жестоким, бесчеловечным или
унижающим его достоинство обращению и наказанию.
.LP
Каждый человек, где бы он ни находился, имеет право на признание его
$ ./build/test-groff -ms -mru -Tutf8 ATTIC/udhr-ru-koi8r.ms




Все люди рождаются свободны‐
ми  и равными в своем досто‐
инстве и правах. Они наделе‐
ны  разумом  и  совестью   и
должны поступать в отношении
друг друга в духе братства.

Каждый  человек должен обла‐
дать всеми правами  и  всеми
свободами,  провозглашенными
настоящей  Декларацией,  без
какого  бы то ни было разли‐
чия,  как‐то   в   отношении
расы,   цвета   кожи,  пола,
языка, религии, политических
или иных  убеждений,  нацио‐
нального   или   социального
происхождения, имущественно‐
го,  сословного  или   иного
положения.

Кроме того, не должно прово‐
диться  никакого различия на
основе политического, право‐
вого или международного ста‐
туса страны или  территории,
к  которой человек принадле‐
жит,  независимо  от   того,
является  ли  эта территория
независимой,     подопечной,
несамоуправляющейся      или
как‐либо иначе  ограниченной
в своем суверенитете.

Каждый  человек  имеет право
на жизнь, на  свободу  и  на
личную неприкосновенность.

Никто  не должен содержаться
в рабстве или в подневольном
состоянии; рабство  и  рабо‐
торговля запрещаются во всех
их видах.

Никто не должен подвергаться
пыткам  или жестоким, бесче‐
ловечным или  унижающим  его
достоинство    обращению   и
наказанию.

Каждый человек, где бы он ни
находился,  имеет  право  на
признание     его     право‐
субъектности.






That's what I get, 6 blank lines of vertical margin at the top and
bottom and everything.

> There is another strong argument against any KOI8-R hack. It does not
> have the full Cyrillic alphabet. Even Russian typesetting is defective
> (modern Russian has 33 letters, if you include pre-modern Russian, the
> character set grows even more), let alone other languages written in
> Cyrillic (like Ukrainian, Mongolian and Kazakh). These languages have
> a larger vowel set than Russian and in the case of Mongolian and
> Kazakh use vowel symbols which are best matched by umlauts in the
> Latin alphabet: compare уг and үг, толь and төлөө. So, a Mongolian
> word like төлөвлөгөө or төлөөлөгчдийн would never be writable, let
> alone be hyphenatable in KOI8-R. Kazakh and Bashkyr alphabets, for
> instance, comprise about 42 letters.

I was aware of some of these issues (particularly the imperfect coverage
of Ukrainian in KOI8-R, a question with ramifications beyond typesetting
these days).  A big advantage to Nikita's approach is that it works with
what we have.

> So, for me there are sound reasons not to try to make KOI8-R work
> *somehow*, as it would not solve the fundamental problems just
> mentioned.

We're not having to put hacks into any part of groff to accommodate
Nikita's contribution.  Under those conditions, and as long as we
acknowledge its limitations (only "Great" Russian in KOI8-R encoding is
supported) it seems hard to say no.

With a little help, we can support KOI8-U; the alphabetic characters it
adds remain in the Latin-1 extension code block, replacing box-drawing
symbols that we don't predefine special characters for anyway.  (If you
want those, a groff document in any encoding can access them by loading
the rfc1345.tmac package new to groff 1.23.0.[1])  All we need is for
someone to contribute support just as Nikita has.

> The hyphenation file parser you referred to looks innocent enough to
> the untrained eye. Do you think expanding the current ^^xx notation to
> ^^^^xxxx notation would derail the input processor?

No, because groff's hyphenation codes correspond to character code
points, and those are only one byte wide in groff anyway.

https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/charinfo.h#n28

> A hyphenation file would then not be human-readable, but this is a
> minor problem; hyphenation patterns look highly inintelligible anyway.

I think it would be a win if we could consume TeX hyphenation files
exactly as they ship them.  groff's mailing list is not, as far as I can
tell, thick with hyphenation specialists.  For that matter, the TeX
community may not be, either.

Regards,
Branden

[1] https://git.savannah.gnu.org/cgit/groff.git/tree/contrib/rfc1345

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]