[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: lynx-dev hyphenation (was tech. question: translating strings)
From: |
Vlad Harchev |
Subject: |
Re: lynx-dev hyphenation (was tech. question: translating strings) |
Date: |
Tue, 7 Sep 1999 16:18:20 +0500 (SAMST) |
On Mon, 6 Sep 1999, Klaus Weide wrote:
> [ last part of a series of replies ]
> On Sun, 5 Sep 1999, Vlad Harchev wrote:
>
> > I plan to add better support for hyphenation to lynx than it currently has
> > :).
>
> It's an open question whether incomplete support for hyhenation, that
> relies on a specific display character setting or it will hyphenate
> wrong, is better than no hyphenation at all.
>
> Hypenation by lynx isn't exactly a feature that many people are
> missing. That's my impression so far, based on interest expressed on
> the list in response to your ideas (iirc, basically none or negative.)
> I certainly don't miss it.
>
Yes, there were no response from others.
> By the way (not that you said anything else) Lynx _does_ have "support
> for hyphenation" alread. Support for author-provided hyphenation that
> is, in the form of ­ or equivalent.
Of course, but seems it's incomparable to mine in visual results and
flexibility :)
>[...]
> The problem is that fixing it later will be more difficult than
> implementing it in a general way (more general than your immediate
> needs) from the start. Every patch added to lynx for some extra
> feature that makes some assumptions binds the hands of other lynx
> developers more, for changing the way things work later (especially
> if the patches look like what you did to SGML.c).
Sorry for SGML.c, but seems there are no better ways to do this (just shorten
macro names, write more generic macros - IMO only that can be done).
> For example, and specifically relevant to the topic of hyphenation,
> HText_append* currently gets its input fed in the current_char_set
> (i.e. already translated to the d.c.s.). That need not remain so, in
> fact it would be better IMO, for several reasons, to eventually feed
> characters to the HText object in a 'standard' form (probably UTF-8).
Wow, you plan to make it yet more slower :) But what for? And please finish
what you started (or post it here at least).
> Translation to the d.c.s. would then occur in GridText.c. The four
> UCStages kept in the HTParentAnchor object are already designed to
> account for this variation of procedure. Now if you add hyphenation
> at the HText_append* level making the assumption that (charset of the
> character stream)==d.c.s., and start writing code around this
> assumption (including configuration, messages, documentation),
> changing the assumption cannot be done without breaking your stuff.
>
> Actually I fell I should start making those changes _now_, before
> patches from you get added that make unwise assumptions.
It's up to you, but what for, again? IMO lynx is still missing a lot of
user-level features, and you plan to make some internal redesign; any user
will notice that lynx become slower (or won't note this if he/she has good
CPU). Also, say, russian users will note that lynx uses much more memory since
due to the russian texts encoded as utf8.
And IMO d.c.s is rather a permament setting (that isn't changed very
often) - what for to translate some c.s. to utf and then to other c.s. when
displaying - IMO I don't see the goals for keeping HText in utf8. What are
they, again?
>[...]
> That argument is completely based on the assumption that hyphenation
> gets applied _after_ translation to the d.c.s. Which is exactly what
> I am trying to tell you not to assume. Of course the hyphenation should
> be applied to the "real characters" (which would be Cyrillic characters
> in this case), not to their ASCII replacement representation!
I don't see any advantages of this (except the problems with 2 words you
called "useful").
> And that is one good reason why translation to the d.c.s. should be
> deferred to a later stage, i.e. it should be done as late as possible
> (GridText.c instead of SGML.c) so that various pieces of code that look
> at the data stream can assume it is in a standard encoding.
Better have it in wide characters rather than in utf8 then. But I don't see
any use of it, really (it would be useful for generalized 'isalpha()',
'tolower()', etc, but this IMO is used only in searching for strings).
>[...]
> > > You're using linux. Give --enable-font-switch a try!
> >
> > I found it unstable (or that version of kernel console driver was
> > unreliable), and I don't know any languages except English and Russian -
> > that
> > can be displayed in at the same time without changing d.c.s.
>
> It depends on what kbd font files you have installed. It works only
> for some fonts, and just doesn't do anything if you switch to a d.c.s.
> it doesn't know about. So in that case you have to do the font loading
> or other manipulation still externally - or if you can't, you shouldn't
> have selected that d.c.s in the first place. Still it works well enough
> for me in various situations (I know the limitations). If it does not
> work right for you in a situation where you think it should, report a
> bug. (I have some changes to UCAuto.c that should help.)
>
I had the following problems:
When exiting from lynx, the something wrong went with console driver, each
letter is doubled in height (ie each letter occupied 2 rows). When I invoke
'reset', the height of each letter returned to 1 row, but only the upper half
of the display was used, while lower was also changing with some strange
stuff. I had to reboot linux to fix this (I didn't try to set the console
dimentsions to match real). And I have no reason to change fonts: russian,
pseudographics and ascii symbols fit in one font.
> > I plan to detect d.c.s changes to recalculate lookup tables, so no
> > translation will be necessary. Will you use hyphenation?
>
> No, as far as I know now. If you make it easy enough to apply the
> necessary extra files, I will probably test it out of curiosity.
> But I don't need it, don't really want it. Why should I, lynx's text
> display generally looks fine (or at least if it doesn't it's not the
> fault of missing hyphenation, but mostly the fault of HTML (ab)use by
> authors that has nothing to do with hyphenation).
IMO it's better to be used with hyphenation - then the lynx is very visually
attractive. And IMO it will help to produce better rendering of tables (to be
implemented).
> > If not, I recommend
> > to compile it in - with and and justification, lynx becomes a very good
> > html->txt translator (we have stylesheets implemenation pending for more
> > flexibility), --with-backspaces complements this. At least I'll inform Linux
> > Documentation Project coordinator about the lynx capabilities (they are
> > using
> > some stupid programs to translate sgml -> txt with backspaces).
>
> Thanks, but I already have man and groff and various other text
> processing tools (most of them unused). Yeah, those LDP people are
> probably stupid enough to use SGML tools for an SGML job, instead of
> a text HTML browser, how could they?
I meant that that program doesn't do justification (and hyphenation of
course) probably it's a perl script - I don't remember, so the produced files
look very ugly.
> > Lynx takes as much memory as NS does. (After 5 hours of browsing, single
> > instance takes 35 Mb of virtual memory - due to terrific emmory
> > fragmentation).
>
> Time for you to compile with --enable-find-leaks then. You should do
> that anyway after making significant changes, or any changes that
> use malloc etc. unless you are very sure you have not introduced memory
> leaks.
When loading 900Kb file as mainpage, with and without source_cache, the VSS
is 29Mb. (lss-disabled lynx's VSS is 4Mb on this file).
Yes, this is probably due to leaks (I tried lss-disabled lynx 1st time on
that file). Stylechanges can't take to much IMO.
>[...]
> So you have to incorporate it from somewhere. You might as well use
> the universal source then, instead of requiring each hyphenation file
> provider to redo the work.
IMO it's easier for provider to type
Aa Bb Cc Dd Ee <etc>, rather than to find out unicode values for each of the
characters, and to write the special awk and perl scripts to translate the
TeX hyrules file.
And anyway, upper->lower and 'isalpha' mapping should be provided
somehow in case of unicode.
> > The
> > thing that will be left to do is to write uft8 character gathering (in case
> > of utf8
> > d.c.s), converting it to lowercase and then to hyrules charset.
>
> I don't understand the details of what you're saying here. Just
> the notion of having a "hyrules charset" seems wrong (unless that's
> a character encoding scheme that provides for all possible characters,
> you know what I mean...)
"gathering" means calculating the unicode character code (ie 32 bit value
from multibyte utf8-encoded character).
>
> > I don't have time to implement complete thing (hacking libnhj will be
> > necessary, shipping unicode tables will be required ...)
> > Anyway, I'll try to help people to solve their problems with hyphenation.
> > English-speaking-or-reading-only people won't have any problems.
>
> I never believe claims that such-and-such people will not have any
> problems.
But seems my statement is correct.
> > Though people
> > that use documents with several (say) latin-1 encoded languages will be
> > unable
> > to use hyphenation at all (since hydict for only one of those languages can
> > be
> > loaded due to the fact that chsets are not disjoint), so they'll get
> > incorrect
> > hyphenation for words in other languages. To solve this problem, <span
> > lang=x>
> > must be used (it's hard to convince german writer to surround "debian" with
> > <span lang=en></span>, thou' such words can be added to the hyphenation
> > exceptions. My experience can tell that collisions will be unlikely, since
> > hyphenation patterns are build by scanning a bunch of taive-language
> > documents, so probably "debian" and other english words won't be hyphenated
> > at all with german hyrules).
>
> You haven't looked at really multilingual texts, with more than a few
> single words from a different than the "main" language. Such texts
> are rare. But lynx should support them, at least not mess them up,
> when they do occur. Authors of such pages will use LANG attributes if
> they care about correct handling, since that is the HTML way of doing
> it. If they don't care, there isn't much lynx can do about it, except
> allowing the user to switch betwen several assumptions. For documents
> where the author did care: even if hyhenation can be done only for one
> language "at a time" (where "at a time" could mean for one document),
> the hyphenation algorithm should at least be turned off in <SPAN
> LANG=fr> text portions where the specified language differs from that
> of the hyphenation rules (like this one)</SPAN>.
I plan to support "lang" attribute.
> > And IMO, as log as UTF8 is not widely used _in_documents_ (not on
> > terminals),
> > the problem with documents mixing several,say, latin-1 encoded languages
> > will
> > remain.
>
> What does UTF-8 in documents have to do with mixing several languages
> that use the same repertoire in one document? Nothing as far as I
> can tell. UTF-8 is just a trannsmission format. And its slow rate
> of adoption in the outside world has not kept lynx from using it
> internally.
I'm glad that you understand that UTF-8 (and UCS*) doesn't have anything
with "mixing several languages that use the same repertoire in one document"
(I thought I thought that this was a solution). The 'lang=' is for solving
this. Why do you push "unicode" everywhere?
> Be ready for the future. Lynx has been for years, in some respects.
> Maybe the world will catch up sometime.
>
> > > And in practice German is rarely written in Cyrillic letters, so it
> > > doesn't
> > > make sense to include e.g. Cyrillic letter patterns in the set for German.
> >
> > As I said, the hyrules for these particular languages can be concatenated
> > to
> > get hyrules for Cyrillic and German - they have disjoint set of character
> > codes.
>
> Merely an accident (as said elsewhere), and does it really work in your
> approach unless you have a display character set with both LATIN
> CAPITAL LETTER A WITH DIAERESIS and CYRILLIC CAPITAL LETTER IO?
I assume you mean these letters have equal char.codes in d.c.s.
If I was encountering such documents, I'd compose or choose another font -
that means that these 2 chars will have different character codes in that
d.c.s. Or (another, looser's solution) - use hyrules for either of the
languages. Or (good hacker's solution) - fix the code that deals with
hyphenation (ie my code). Or (TeX user's solution) - don't do anything, since
russian and german hyrules won't collide - in order for them to collide, at
least 2 characters must be present of either of the languages (frankly
speaking, this is incorrect, since one russian hydictionary uses patterns
with 1 letter, but there are another dictionaries).
So, I rely on the 'good hackers'.
As you saw the lynx.cfg setting I plan to introduce, with domain name
matching and file matching, the problem can be solved for www-based docs if
one languages dominates over others.
But let's speak how your TRST is complete and flexible :)
In general, neither you nor me have time to cover all cases.
> > >[...]
> >
> > So, I'll add support for any d.c.s other than uft8 and like, provided
> > chset of hyrules is not utf8 too.
>
> I don't exactly understand the meat of this promise, too many "other
> than" and "like" and "provide".
"and like" means CJK texts (hyphenation doesn't make sense for J, but for C
and K I don't know). As for utf8-encoded hyrules - the hyphenation simply
won't work or dictionary won't load by libhnj. In other words, each signle
byte in hyrules denotes a single "human letter", each single byte in d.c.s.
denotes a single "human letter" (and not part of letter) - to make direct
table-driven translation possible.
> > As I remember, you have to post some patch to lynx too :)
>
> Yea yea. You're just keeping me from it.:)
Yes, our chats become too long.. Please more conlusions next time.
And I ask you to finish TRST - add support for lss (at least write code
without extensive testing).
> Klaus
>
To lynx-dev: sorry for huge message.
Best regards,
-Vlad
- Re: lynx-dev tech. question: translating strings to different charsets, (continued)
- lynx-dev hyphenation (was tech. question: translating strings), Klaus Weide, 1999/09/06
- Re: lynx-dev hyphenation (was tech. question: translating strings), Vlad Harchev, 1999/09/06
- Re: lynx-dev tech. question: translating strings to different charsets, Klaus Weide, 1999/09/06
- Re: lynx-dev tech. question: translating strings to different charsets, Vlad Harchev, 1999/09/06
- lynx-dev hyhenation (was tech. question: translating strings), Klaus Weide, 1999/09/07
- Re: lynx-dev hyhenation (was tech. question: translating strings), Vlad Harchev, 1999/09/07
- lynx-dev hyphenation (was tech. question: translating strings), Klaus Weide, 1999/09/06
- Re: lynx-dev hyphenation (was tech. question: translating strings),
Vlad Harchev <=
- Re: lynx-dev hyphenation (was tech. question: translating strings), Doug Kaufman, 1999/09/06
- Re: lynx-dev hyphenation (was tech. question: translating strings), Vlad Harchev, 1999/09/06
- Re: lynx-dev hyphenation (was tech. question: translating strings), Klaus Weide, 1999/09/06
- Re: lynx-dev hyphenation (was tech. question: translating strings), Vlad Harchev, 1999/09/06
- Re: lynx-dev hyphenation (was tech. question: translating strings), rjp, 1999/09/06
- Re: lynx-dev hyphenation (was tech. question: translating strings), Vlad Harchev, 1999/09/07
- Re: lynx-dev hyphenation (was tech. question: translating strings), Vlad Harchev, 1999/09/07
- Re: lynx-dev hyphenation (was tech. question: translating strings), Klaus Weide, 1999/09/07
- Re: lynx-dev hyphenation (was tech. question: translating strings), Vlad Harchev, 1999/09/07
Re: lynx-dev tech. question: translating strings to different charsets, Henry Nelson, 1999/09/13