[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: minor hyphenation issue
From: |
Werner LEMBERG |
Subject: |
Re: minor hyphenation issue |
Date: |
Wed, 12 Apr 2017 08:08:05 +0200 (CEST) |
>> the basic ("knuthian") tex hyphenation algorithm does not handle
>> any words with diacritics, and that is what the us list is based
>> on.
In general, this is not a restriction since up to 256 characters are
allowed in `patgen', which is the ultimate program to generate
hyphenation patterns. Non-English hyphenation patterns simply use
precomposed characters with diacritics; for example, the German
patterns now use the latin-9 character set. The English patterns
could do exactly the same to allow stuff like `chef d'œuvre' (assuming
that this word could be hyphenated, which is probably not true :-).
The very issue is rather that *users* are not accomodated to select an
input and/or font encoding while typesetting US English texts. The
only chance to improve that IMHO is to use TeX systems that natively
use UTF-8. So groff has a slight advantage here over plain TeX since
it is set up by default to use latin-1.
Note, however, that noone takes care of the US patterns. The most
recent version used in the `tex-hyphen' project at
https://github.com/hyphenation/tex-hyphen
is from 1990! In other words, the only `standardized' corrective is
Barbara's list...
> I see. Werner (or anyone else familiar with the groff side of
> things), is this limitation also present in groff? Or could groff's
> version of tmac/hyphenex.us be put into Latin-9 encoding to
> accommodate these words?
It could. However, for the sake of maintainability, I strongly
suggest that `hyphenex.us' stays in sync with the original one edited
by Barbara. You can always add new entries with the `.hw' request
(provided your setup correctly understands the corresponding encoding;
have a look how German is handled, for example).
>> i'm surprised that the encoding is (still?) listed as latin-* --
>> there has been an effort to support utf8, so i (perhaps rashly)
>> assumed that would be the base encoding.
groff cannot digest UTF-8 natively. However, there are means to
automatically map UTF-8 to its internal representation, which usually
is latin-1, together with constructs like \[uXXXX] to access Unicode
encoded characters outside the selected encoding.
> http://git.savannah.gnu.org/gitweb/?p=groff.git;a=history;f=tmac/hyphenex.det;h=c74eebabff8e35353fdfb176a5c98df56c3e4ea0;hb=HEAD
`hyphenex.det' is no longer maintained – and now deleted from the
repository: I took the opportunity to completely update the German
hyphenation patterns, and this file is no longer needed.
> Their encodings on the TeX side may have been updated, and the
> changes never pulled to groff.
Today, almost all hyphenation patterns in the `tex-hyphen' repository
(and thus in the distribution from CTAN) are in UTF-8 encoding.
> In contrast (and probably because of this thread), groff's
> tmac/hyphenex.us was updated from TeX four days ago:
Exactly.
> This file does not specify any encoding, but its entire contents
> fall into 7-bit ASCII.
Well, the list simply doesn't contain any non-ASCII words...
Werner
- minor hyphenation issue, Dave Kemper, 2017/04/06
- Re: minor hyphenation issue, Werner LEMBERG, 2017/04/07
- Re: minor hyphenation issue, Barbara Beeton, 2017/04/07
- Re: minor hyphenation issue, Dave Kemper, 2017/04/08
- Message not available
- Re: minor hyphenation issue, Dave Kemper, 2017/04/11
- Re: minor hyphenation issue,
Werner LEMBERG <=
- Re: minor hyphenation issue, Dave Kemper, 2017/04/18
- Re: minor hyphenation issue, Werner LEMBERG, 2017/04/19
- RE: minor hyphenation issue, Barbara Beeton, 2017/04/19