[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: The hyphenation algorithm produces wrong results
From: |
Werner LEMBERG |
Subject: |
Re: The hyphenation algorithm produces wrong results |
Date: |
Sun, 18 Mar 2018 21:12:01 +0100 (CET) |
>> > .ll 1n
>> > .hy 48
>>
>> You *must not* use such values if the patterns don't allow it!
>> From groff.texi:
>>
> I may. That is what testing is about. And I must, otherwise it
> is an insufficient testing.
Well, yes, for testing. And you have to expect incorrect results.
>> instead of the correct `split-ting'. US-English patterns as
>> distributed with groff need two characters at the beginning and
>> three characters at the end; this means that address@hidden of
>> @code{hy} is mandatory. address@hidden is possible as an
>> additional restriction, but address@hidden (the default!), 16,
>> address@hidden should be avoided.
>>
>
> The pattern file has patterns of type '[135]xy.'. The stated
> number (three at the end) could thus include the 'period' (.) at the
> end of the pattern. But it looks to me that the pattern file was
> created with the rightmarginlimit = 2.
I have never created English patterns, only German ones, so I have to
rely on the information given elsewhere (`plain.tex', the US English
module of the Babel package), whatever it `looks to you'.
> There are many hyphenation points in dictionaries that split two
> letters at the beginning; also at the end.
>
> If one wants to use these patterns, hy=4 and hy=8 are not to be
> used!
Are we still talking about US English hyphenation patterns? Simply
use German patterns if you want something to play with
lefthyphenmin=righthyphenmin=2!
If you want something with lefthyphenmin=righthyphenmin=1 you should
use Greek patterns...
For US English patterns, however, since those patterns have been taken
from TeX, we should also have to use the values given there, namely
lefthyphenmin=2, righthyphenmin=3.
> The wrong case of hyphenation can easily be corrected by creating
> a file which bans such cases:
>
> a file with lines (or one file for each type) which match every of
> the following regular expressions:
>
> 1) .[a-z]4
>
> 2) 4[a-z].
I don't understand what you want to demonstrate.
> Donald E. Knuth hard-coded the "search limits" to 2 and 3 in his
> TeX software 30 years ago.
No, he did not. You can change those values at will. Please repeat
with me: For US English, lefthyphenmin=2 and righthyphenmin=3 BECAUSE
THE PATTERNS HAVE BEEN CREATED FOR THOSE VALUES.
> The used "hyphen.us" file in "groff" is simply too old.
Not at all. It's the current version of US English pattern AFAIK.
> It should contain these "pattern matching limits" so that the
> algorithm knows where to begin and where to end.
Yes, lefthyphenmin and righthyphenmin are not explicitly stated in the
pattern file. The TeX hyphenation project is going to change this by
introducing YAML headers to all pattern files.
> And there should be more that just one hyphenation file for some
> languages, like one without restrictions and one or two with
> different restrictions, if that makes sense. People preferences are
> different and writers of software should provide choices for its
> users.
Indeed. Being the maintainer of the German set of hyphenation
patterns (new orthography, old orthography, old Swiss orthography) I
know this too well.
>> The algorithm works as expected, there is nothing to fix. Barring
>> still hidden bugs, the problem *is* fixed.
>
> Works as you expected, but not I.
Then go forth and create a bjarff program.
> The algorithm goes too far, as it does not know what kind of string
> it is dealing with,
>
> a) one word
>
> b) part of a word, and thus tries to find a pattern for that part,
> which is already an "invariable string", which is not to be
> hyphenated, but output unchanged to the line.
I was too fast classifying this behaviour of groff as a bug. Adding
more invalid hyphenation points only happens if you use an incorrect
(i.e., a too small) value for lefthyphenmin and/or righthyphenmin.
Werner