lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: LYNX-DEV charset issues (was: hodge-podge, updates)


From: Drazen Kacar
Subject: Re: LYNX-DEV charset issues (was: hodge-podge, updates)
Date: Tue, 12 Nov 1996 11:02:12 +0100 (MET)

Klaus Weide wrote:
> On Tue, 12 Nov 1996, Drazen Kacar wrote:
> 
> > Klaus Weide wrote:
> > 
> > > We don't have to accept that for a fact just yet.
> > > (Except that Lynx will always be limited by what characters terminals
> > > and emulators provide.)
> > 
> > Yes, but users don't usually have a need for 100 (8-bit) code pages. Only
> > two ISO 8859-x pages can represent very wide range of languages. 
> 
> True.  As long as all those languages are based on the latin alphabet...
> Heh, we could probably fit all the displayable characters from 
> Latin 1+2 in one combined code page, and distribute that with Lynx!

Without © and &tm;? :)
More serious, ISO 8859-5 is Cyrillic. There are languages which are natively
written in Cyrillic, but there is 1:1 mapping to latin alphabet. 1:1
means you'll have the same number of letters, not necessarily characters.

> But that wouldn't help in the general case where we cannot assume that
> the user can do anything about his/her code page.  (At least I think
> that is the general case...)  It would work for Linux, as long as we
> assume user is using console in 80x25 mode..

Depends on hardware and sysop. There is no way to install LC_CTYPE files
if you're not root. In countries where ASCII is not enough, terminals
usually have a way to represent national characters. Hoping that there
is a standard way is too much, though...

> > Unix
> > lacks a few things here. For example, if the terminal can switch code
> > pages, there is no termcap/terminfo capability to indicate this. There
> > are no standardized LC_CTYPE names, no mapping between IANA registered
> > charset names and LC_CTYPE files, no curses (or ncurses) functions for
> > approximation of one code page with another...
> 
> I assume the most common situation will remain, for a while, where Lynx
> cannot make any assumption about extended capabilities of the terminal/
> emulator, but has to be able to give a reasonable representation under
> limited conditions.  

Termcap registration should not be a problem. I intend to do it if I can
ensure funding for the internationalization project. And once the codes
are registered, any program can make use of them. Waiting for suitable
routines in curses packages might turn out to be very long, so those
things should be in Lynx source.

> > Not good enough. Lynx can currently approximate Latin 1 characters with any
> > local terminal definition, but the reverse is impossible. Unicode support
> > should be able to transpose to any of the local code pages.
> 
> (I suppose you mean "to and from", between any two encodings it knows.)

Yup.

> Yes, it should.  But how?  The question (I have) is actually not how
> to map characters from one encoding to another, - I have code for that
> in my prototype - but rather where and when.  There has to be a model
> for at which stage in the processing Lynx is expecting what charset,
> and how to specify it.  (At least that's what I am thinking).

There is a problem with charset specified in META tag. You'll be working
with Latin 1, and then, suddenly, it's something else. If that META is the
first one in HEAD, everything's fine, but if TITLE or LINKs come before...

> In a way, it would be easiest to follow the model from the HTML i18n
> draft and have just *one* charset during the SGML/HTML processing,
> and convert everything to/from it before and after that.  UTF-8 (RFC 2044)
> could be used internally.

It would, except for the performance issues...

> That would be a drastic change that I cannot
> and do not wish to make alone..  How would this interfere with the CJK
> charset processing (which probably has to be left as it is)?  

I don't know what exactly CJK processing does. I can look at the sources,
but it would be nice if someone could tell me what those routines are
supposed to do.

> Should any new methods for translating character supersede the current
> tables in LYCharSets.c, or somehow incorporate them?

Supersede and incorporate. :)

> > > Is it reasonable oberhead to call a translation function for each
> > > (maybe just non-ASCII) character?
> > 
> > Can it be a table look-up? Function call will be incredibly slow.
> 
> That was my first (maybe just instinctive:)) reaction.  But the code
> already (for HTML text) is going through several character-by-character
> function calls, so there is not that much _additional_ overhead..

Not much additional, I agree. But, did you ever run Lynx through profiler?

> A straightforward table lookup is not possible, since I don't think
> we want to keep several 64kbyte tables around in memory.

Who says that the table should be in memory all the time? There is mmap
call, shared libraries, MMU chips can do wonders. Tables can be made
smaller depending on what the user wants to have. Reducing tables is an
interesting problem, and the solution will probably rule out internal
use of only one charset.

I was told that the newest release of Digital UNIX comes (or will come)
with the full Unicode support. We should probably take a look at the
implementation.

-- 
Life is a sexually transmitted disease.

address@hidden
address@hidden
;
; To UNSUBSCRIBE:  Send a mail message to address@hidden
;                  with "unsubscribe lynx-dev" (without the
;                  quotation marks) on a line by itself.
;



reply via email to

[Prev in Thread] Current Thread [Next in Thread]