[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: The hyphenation algorithm produces wrong results
From: |
Werner LEMBERG |
Subject: |
Re: The hyphenation algorithm produces wrong results |
Date: |
Sun, 04 Mar 2018 08:39:44 +0100 (CET) |
> .ll 1n
> .hy 48
You *must not* use such values if the patterns don't allow it! From
groff.texi:
For historical reasons the default value of the @code{hy} request
doesn't fit the American English hyphenation patterns that are used
by groff as the default. These patterns expect that neither the
first character nor the last two characters are to be hyphenated;
this corresponds to address@hidden Consequently, @code{hy}'s default
address@hidden or even setting values 16 address@hidden might lead to
(additional) incorrect hyphenation points.
Anyway, I've now replaced this with
The number of characters at the beginning of a word after which the
first hyphenation point should be inserted is determined by the
patterns themselves; it can't be reduced further without introducing
additional, invalid hyphenation points (unfortunately, this
information is not part of a pattern file, you have to know it in
advance). The same is true for the number of characters at the end
of word before the last hyphenation point should be inserted. For
example, the code
@Example
.ll 1
.hy 48
@endExample
returns
@Example
s-
plit-
t-
in-
g
@endExample
instead of the correct `split-ting'. US-English patterns as
distributed with groff need two characters at the beginning and
three characters at the end; this means that address@hidden of
@code{hy} is mandatory. address@hidden is possible as an additional
restriction, but address@hidden (the default!), 16, address@hidden
should be avoided.
to clarify the issue even more.
> The algorithm
>
> 1) uses pattern in the wrong places, at the beginning of a word
> although no period is in the pattern
You have a too simplistic view how patterns work...
> 2) splits off one letter at the end although I found no corresponding
> pattern in the "hyphen.us" file.
>
> splitting s-plit-t-in-g
OK, let's look at the word `splitting', using the `patternize.lua'
demo program from the padrinoma project
(https://github.com/sh2d/padrinoma).
> texlua patternize.lua -p hyphen.us -l 1 -t 1 -m 1 -v
pattern file: hyphen.us (4555 patterns read)
spot mins, special characters: 1 1 '-=.'
splitting
. s p l i t t i n g .
1p2l2
l1i t
4t3t2
4i t t
2t1i n
.0s1p2l4i4t3t2i0n0g0.
s-plit-ting
As can be seen, the patterns themselves contain a breakpoint after the
leading `s' character!
However, your extreme line length settings make groff emit `can't
break line' warnings. If groff does that, it apparently starts anew
with searching hyphenation points for the remaining substring.
plitting
. p l i t t i n g .
1p2l2
l1i t
4t3t2
4i t t
2t1i n
.0p2l4i4t3t2i0n0g0.
plit-ting
ting
. t i n g .
. t i2
2t1i n
. t i n g4
.0t1i2n0g0.
t-ing
ing
. i n g .
. i n1
.0i0n1g0.
in-g
I'm not sure whether I should classify groff's behaviour of restarting
the hyphenation process a feature or a bug (I tend to the latter).
However, I don't have time to work on that.
> The cases '16' and '32' (for .hy) may not add hyphenation points,
> just allow already found ones, if otherwise forbidden.
Nice idea, but impossible to implement without meta-knowledge. As
mentioned above, the hyphenation patterns are constructed with certain
\lefthyphenmin and \righthyphenmin values. However, those values are
*not* present in the hyphenation patterns – you have to know them (I
consider this a design bug in TeX). In other words, only the user
knows that values 16 or 32 are valid for a given language's
hyphenation patterns or not.
> [...] So the algorithm has to be fixed ...
Definitely not.
> ... and tested with ".hy 1" (the current stable version) and with
> ".hy 48" (development) to see if it works correctly according to the
> used hyphenation pattern file.
The algorithm works as expected, there is nothing to fix. Barring
still hidden bugs, the problem *is* fixed. It probably doesn't meet
your expectations, though :-)
Werner