lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev reading sjis docs [was Re: lynxcgi problem]


From: Klaus Weide
Subject: Re: lynx-dev reading sjis docs [was Re: lynxcgi problem]
Date: Tue, 4 Jan 2000 12:30:49 -0600 (CST)

On Fri, 31 Dec 1999, Hataguchi Takeshi wrote:
> On Thu, 30 Dec 1999, Klaus Weide wrote:
> > On Thu, 30 Dec 1999, Hataguchi Takeshi wrote:
> > > On Tue, 28 Dec 1999, Henry Nelson wrote:
> >     Hataguchi Takeshi wrote:
> > > > > By the way, I'm wondering ASSUME_CHARSET doesn't work for Japanese
> > > > > as expected now as you've ever wrote.
> > > > > Do you know the relationship between ASSUME_CHARSET and
> > > > > "kanji code", which can be changed by ^L with SH_EX?
> > > >
> > > > ASSUME_CHARSET is turned off for CJK, as far as I know. [...]
> > 
> > > Thank you very much. Now I see ASSUME_CHARSET is off for CJK.
> > > But I've not understood why it's off. I'll continue to check archives.
> > 
> > Can you please describe in detail what you mean with "is off".
> > What did you try, what did you expect, and what did actually happen?
> 
> I might be confusing. I'm sorry that no one wrote "is off" in the 
> thread. 

Yes, I don't think that the thread that Henry referred to can serve to
explain "why ASSUME_CHARSET is turned off" (which is still an
assertion in question anyway).  I didn't reread all of it, so I may be
wrong.  It looks relevant for other reasons, though.

> I found this description in lynx.cfg and thought 
> if CJK mode is on, then ASSUME_CHARSET has no meaning.
> 
> | # Raw (CJK) mode
> | #
> | # Lynx normally translates characters from a document's charset to display
> | # charset, using ASSUME_CHARSET value (see below) if the document's charset
> | # is not specified explicitly.  Raw (CJK) mode is OFF for this case.
> 
> I hadn't try anything when I wrote the last mail.
> 
> Now I tried some files attached to this mail.
> 
>     metaEUC.html
>         Documents in euc-jp with META tag
>     metaSJIS.html
>         Documents in shift_jis with META tag
>     metaSJIS2.html
>         Documents in shift_jis with wrong META tag (x-sjis)
>     nometaEUC.html
>         Documents in euc-jp without META tag
>     nometaSJIS.html
>         Documents in shift_jis without META tag
> 
> I got the result from first two files as expected.

Is this with CJK_EX?  without CJK_EX?

> I knew the charset specified by META tag is valid as you wrote.
> 
> I got bad result from the third file metaSJIS2.html, 
> which declares charset as x-sjis. 
> I know x-sjis isn't in IANA's character sets.
> But there are some pages declaring charset as x-sjis, 
> because Netscape had added x-sjis and x-euc-jp to the charset
> and allowed to use them in the META tag independently
> before Shift_JIS and EUC-JP were added in IANA charset.
> So I feel happy if Lynx allows x-sjis and x-euc-jp.
>
> # I refered this page, but unfortunately it's in Japanese.
> # http://www.bekkoame.or.jp/~poetlabo/WWW/charset.html

If there is anything of particular importance, and that isn't general
knowledge, and that you haven't said already, please translate or
paraphrase. :)

> I tried nometaEUC.html by setting ASSUME_CHARSET as euc-jp and
> DISPLAY_CHARSET as Japanese (EUC-JP), but I got bad result, 
> which is as same as the result by setting ASSUME_CHARSET as iso-8859-1.
> I wanted the same result as one from metaEUC.html.

(Let's call this [A] for reference)

You don't say how you set ASSUME_CHARSET, i.e., whether from lynx.cfg,
from the command line, or from the Options Menu.

If you were using the O.M., please try again using one of the other
methods.  Also try starting lynx with and without the -raw toggle
(possibly in addition to an -assume_...).  Not sure this makes a
difference, but let's just eliminate potential complications that
might be introduced by the Options menu code.

> I got also bad result from nometaSJIS.html by setting 
> ASSUME_CHARSET as shift_jis and DISPLAY_CHARSET as Japanese (EUC-JP).

Can you characterize this bad result in simple words?  (Like you did
with [A] "same as the result by setting ASSUME_CHARSET as iso-8859-1").

> It seems ASSUME_CHARSET has no effect in this experiments.

> > I am not aware of ASSUME_CHARSET being explicitly turned off for CJK.
> > It's just that ASSUME_CHARSET, basically, has the equivalent effect of
> > a META tag with a charset (only with a lower priority); or possibly has
> > less effect (no call to HText_setKcode - see below).  If an explicit
> > charset in a META tag has no effect for CJK, then it is no surprise if
> > ASSUME_CHARSET has no effect, either.
> 
> META tag has effect but ASSUME_CHARSET doesn't as I wrote above.
> 
> > 
> > Well - I expect that ASSUME_CHARSET does have an effect if
> >  (a) Display Character Set is a CJK character set, and ASSUME_CHARSET points
> >      to a non CJK charset (possibly only with raw/CJK toggle state being 
> > off?)
> >      or
> >  (b) Display Character Set is a non-CJK character set, and ASSUME_CHARSET
> >      points to a CJK charset.
> 
> I tried. But I can't read many Japanese documents.
> What kind of situations do I have to use these settings?

As for (a) - 
This should be useful. Say your local environment is Japanese, with Japanese
fonts (say, EUC encoding), that's why you should have "Display Character
Set is a CJK character set".  Now you want to browse Russian Web sites that
are written in Cyrillic. (All just for example, of course.  Replace with
charset for any other non-CJK encoding recognized by Lynx if you like.)
You encounter http://xxx.yyy.ru which is in windows-1251 encoding without
being labelled as such (or as anything else).  So you should set
ASSUME_CHARSET to "windows-1251".  This *should* have the result that you
will see the Cyrillic characters transliterated (or is that transcoded or
transcripted) into Latin ASCII characters.  At least, when the Raw/CJK
toggle state is OFF.

As for (b) -
This case shouldn't be useful for anyone who does have a Japanese
environment.  It would be useful for someone who doesn't and wants to read
(charset-unlabelled) pages in Japanese.  But it cannot be in fact useful
unless and until Lynx learns to translate *from* Japanese encoding(s) *to*
other encodings.

> > > I think ASSUME_CHARSET is a something which should play this role.
> > > Anyway I'll try to find the reason ASSUME_CHARSET is off for CJK.

#######################################################################
A History Of Raw Mode And All That Stuff.
(maybe more logical than factual.  probably not even that.)

First lynx could translate entities.  It could even translate
characters, *from* iso-8859-1 *to* some code pages.  The latter were
called just character_set (no need to qualify, since that was the only
ocurrence of anything characterset-like).  We now call it Display
c.s. (but its historical primordiality still shows in the lynx.cfg
option name CHARACTER_SET).

What's more, Lynx could even *not*-translate input characters.  Thus
two modes were possible, for each possible D.c.s. (except perhaps they
were more-or-less the same for D.c.s.==iso-8859-1), so "Raw Mode" toggling
('@', -raw) was invented.

[Yes, the order of -raw vs. -jpn is actually wrong, if you really want
to know the historically correct series of events study the CHANGES*... ]

There was no explicit notion of an "assumed charset".  There was no need
for it, since - whenever translation of characters did take place -
iso-8859-1 was implicitly assumed.

Then lynx learned to do "Japanese character translations" (this is the
term used for the first recorded eoccurrence in the annals), which is
some kind of magic usually not understood by Europeans.  Of course there
had to be a way to turn this OFF.  I.e., a LYK_JPN_TOGGLE / -jpn switch.

Then there came C and K to make J into "CJK".  Then the functionality of
-jpn and -raw was unified into -raw, the "generic -raw switch", also
known as "raw or CJK mode" or "raw 8-bit or CJK Mode".

For "CJK languages", i.e. in its function as "CJK Mode", i.e. when the
D.c.s. was a CJK one, setting this ON would mean that that magical
"Japanese character translation" would take place.  For other D.c.s.s,
setting this ON would mean that NO table-based character translation
('t.b.c.t.', see below) was done.  Setting it OFF would mean that the
usual iso-8859-1 -> D.c.s. t.b.c.t. took place for raw input
characters.

"Translation" (i.e. expansion) of HTML character entities (and NCRs)
to the D.c.s. was always done.

The 't.b.c.t.' at that point was based on tables in LYCharSets.c, in
conjunction with some entity tables in SGML.c.  Those were mappings
from entity names and (effectively, though indirectly) from
iso-8859-1 to the various known D.c.s.s.  For may of the latter
(including all CJK ones), they would just map to 7-bit replacement
strings.

Once again, to summarize the meaning/effect (then and btw. still current)

  Display  | '@'/-raw is more   |                  setting
   c. s.   | specifically called|        ON           |     OFF
==========================================================================
   non-CJK | "Raw Mode" toggle  | Don't transl. 8-bit | t.b.c.t. for 8-bit
--------------------------------------------------------------------------
     CJK   | "CJK mode" toggle  | Do CJK (Jpn) magic  | t.b.c.t. for 8-bit
--------------------------------------------------------------------------

Then - "chartrans" code came into existence.  Which stands just for
translation (more properly probably, transcoding) between character
sets (or encodings), but with a heavy emphasis on non-CJK, since the
author of most of the initial code (me) didn't understand a thing
about that magical CJK stuff - or just enough to recognize that it
"does something" and that it should better continue doing that
something whatever changes are made for other parts.

The basic new idea of chartrans was to not only translate *to* N
different (display) character sets, but to also be able to translate
*from* M different input charsets.  So instead of just N (or 2*N, if
one considers the '@' state as a separate thing), there would be M*N
different from/to combinations.  In principle (at least), the code
should be able to char-translate for all those pairs.

A fateful decision was made at this point.  To simplify coding (one
kind of 'LYhndl' index could be used to identify both input and
output), the possible M input 'charsets' and the possible N output
D.c.s.s where identified as being the same thing, just under two
different kinds of names.  I.e. "Western (ISO-8859-1)" (to use the
current name) is a different kind of name for "iso-8859-1", etc.  This
makes sense for most or all of the 8-bit (mostly "European") character
sets which were the focus for the chartrans code, and even for
"UNICODE (UTF-8)" aka "utf-8".  It doesn't really capture reality for
the CJK charsets though, let's take Japanese: there are only two
D.c.s.s, but there are more (significantly different) input 'charset's
that somehow correspond to those two.  Well.  The method of Procrustes
the ancient programmer was applied.

But - you see - the CJK "magic" stuff (IOW the stuff I didn't try to
disturb too much) already had its own variables - things like HTCJK
and kanji_code - and thus seemed perfectly willing, as well as able,
to look after itself.  As long as we continue to feed it the same
input, it should continue to do the same thing and continue to work
as before, at least.

Now, (back to the general case) as for selecting *from* what to
translate - well, there are the various ways a server (or page) can
tell us what charset a page is in (HTTP header, META).  Try to
recognize and use that.  (Well, to be more correct, it already *was*
being parsed and used before.)  (Also can recognize a charset as part
of SUFFIX rule - but let's ignore that here, or regard it as
equivalent to one of the above).  If no charset label present - we
have to assume something.  Just assuming "iso-8859-1" in all cases
wouldn't really cut it.  So "ASSUME_CHARSET" / -assume_charset (and
variations) came into existence.  A simple charset thingie that gets
set at startup and then never changes, at first (UCLYhndl_for_unspec).

Then [did I mention this isn't chronologically correct?] it was
realized that there were still all those legacy variables, including
the one(s) representing Raw mode, which had become ill-defined in the
context of the new (N*N instead of 2*N) world, somehow the existence
of Raw mode had to be reconciled with the ASSUME_CHARSET thing.  Raw
did suppress table based translations somehow, but those translations
weren't really the same any more, and in some cases raw as "really
completely untranslated bytes" made less and less sense.

Looking at all that Raw Mode actually did - *for not-CJK D.c.s.s at
least* -, I realized that its function (in the old world, but
expressed in the language of the new) could be understood as just
toggling between two different from/two pairs.  Namely between
"iso-8859-1"/D.c.s. and "[D.c.s.-equivalent MIMEname]"/D.c.s., holding
the D.c.s.  constant.  In other words, a toggling of the actual
(effective) ASSUME_CHARSET between two values.

So, since an ASSUME_CHARSET was already there, '@'/-raw could just be
reimplemented/redefined, more or less, as far as non-CJK went at
least, in terms of doing just that - toggle the value of
UCLYhndl_for_unspec between two possible ones (one being the
current D.c.s.).

As an added generalization, the other of those two possible values
wasn't left fixed as "iso-8859-1" (as would be required for strict
modeling of the pre-chartrans behavior) - no, it became the new
meaning of the ASSUME_CHARSET configuration option to specify that
other value.  IOW, ASSUME_CHARSET(likewise -assume_charset) specified
(and still does) a "default input charset", but that isn't always the
*effective* "default input charset".

So -raw, as a toggle, synonymous with selecting "the other" from/two
combination.  As well as, actually, the other way around: the current
from/to combination can be said to determine whether we are in Raw mode
or not.  (If from==two then we are; otherwise we aren't.)

Still later, the equivalent of change ASSUME_CHARSET became also
available on the Options Menu.  Well.  Does that option actually
change the currently effective "default input charset", or the
backup "default input charset" that isn't necessarily actually
effective, or both?  I think both, but I'm mot so sure that's
always the case.  In any case, changing this from the O.M. may
not always become effective immediately.

Now, where did all this leave CJK?  More specifically, what function
does ASSUME_CHARSET, and what does '@'/-raw have, if the D.c.s. is a
CJK one?  Well - I'm not sure, now.

As far as the basic functionality addition of chartrans goes -
being able to table-translate from M charsets instead of just one -
that should work fine.  As long as that "from" charset is actually
one that is subject to chartrans table-translation, i.e. not one
that is another CJK charset.

As for other combinations, especially when ASSUME_CHARSET points to a
CJK charset - that situation probably still awaits a sensible
definition of its meaning.  Or something like that.  So far nobody
concerned about CJK matters seems to have cared too much, so it has
remained a mushy area.  Maybe you can change that.

Can the equivalence between setting -raw on the one hand, and setting
-assume_charset together with a specific D.c.s. on the other hand,
for selecting a specific from/to pair, be maintained (or created
for the forst time...) for CJK D.c.s.s?  I dunno.

Finally two pices of CHANGE* entries as some kind of snapshot
evidence.  The first one may or may not be equivalent to the text
that's now in lynx.cfg, or somewhere else. - I didn't specifically
compare.

1997-05-14
* (chartrans) Changes in LYCharSets.c to HTMLSetCharacterHandling() and
  HTMLSetUseDefaultRawMode() to support (hopefully) more consistent
  and user-friendly handling of raw mode and its default.
  Note that the following description does not apply if the display
  character set is one of the CJK settings.  In that case, -raw and
  the corresponding Options setting is used as a CJK toggle as before.
  Note that the -raw flag is a toggle.  It changes the "raw mode"
  setting from the default.  The current setting of "raw mode" can be
  seen on the Options screen, and is also shown in a statusline message
  when the RAW_TOGGLE key (normally '@') is used.
  The default depends on the display character set (as previously)
  but now also on the ASSUME_CHARSET setting (as determined by a setting
  in lynx.cfg, possibly overridden by -assume_charset on the command
  line, or the default iso-8859-1).  When the display character set
  corresponds to the ASSUME_CHARSET, the default for "raw mode" is ON,
  otherwise it is OFF.
  The effect of "raw mode" on the interpretation of documents which have
  no explicit charset label (from HTTP headers, a META tag, or otherwise)
  is as follows.  There is an internal "assume charset" which may differ
  from the user-specified ASSUME_CHARSET value.  When "raw mode" is set
  ON, the internal variable is set to correspond to the display character
  set.  When "raw mode" is set OFF, the internal variable changes to
  the user-specified ASSUME_CHARSET or, if that also corresponds to the
  display character set (so that otherwise there would be no change),
  reverts back to the iso-8859-1 Web default.
  Raw mode doesn't imply total rawness.  HTML character entities may
  be expanded and translated with either setting, 8-bit characters which
  are inappropriate for the display character set will not be sent to
  the terminal.  For a "more raw" setting, try the "Transparent" pseudo
  display character set. - KW

Finally, the seems to represent the latest significan re(de)finement
in this area.  Does lynx.cfg somehow not adequately reflect it?  i haven't
checked.

1997-10-04
* Changed effect of -raw / '@' for CJK display character sets: it now toggles
  the effective charset assumption between that specified with ASSUME_CHARSET
  or -assume_charset (or iso-8859-1 if none given) and the charset that
  corresponds to the selected display character set, as for non-CJK.  An
  exception is made if both charsets are CJK charsets, so that the toggle
  will still have the function of toggling CJK mode on and off.  Explicitly
  specifying a CJK charset as assumed is currently not very useful, since we
  cannot translate from that to other character sets. - KW

That "exception is made" sounds fishy... maybe that's where you should
start changing.

Btw nowadays Leonid may understand more about those iffy functions in
LYCharSets.c that I.  (And don't read my description above as if I did
everything.  I was just trying to give you my braindump on the topic,
can't dump someone else's mind...)


####################################################################

> > The first question should be why the CJK magic doesn't listen to any
> > sorts of charset at all.  Whether the best way for toggling is via
> > the ASSUME_CHARSET mechanism or some other mechanism can then be decided
> > later.

Has this - the question whether Lynx honors explicit charset labels for 
Japanese encodings - been answered positively now? Or is it still "maybe,
sometimes"?

> 
> I thought it's ASSUME_CHARSET. But now I can't understand how 
> Lynx does/should process ASSUME_CHARSET at all.
> 
> > > Thanks. It seems there are no differences between output of them.
> > > It seems <META ... CONTENT="text/html;charset=hogehoge"> has no effect
> > > for Japanese documents.
> > 
> > See especially
> >   <http://www.flora.org/lynx-dev/html/month1097/msg00110.html>
> > and
> >   <http://www.flora.org/lynx-dev/html/month1097/msg00151.html>
> > from the thread that Henry pointed out.
> 
> Thank you. I see the META tag happened to have no effect in Henry's 
> examples.
> 
> # Oops! Henry uses only x-sjis and x-euc-jp as charset.

And he should have known better...

> # I tried again by replacing x-sjis to Shift_JIS and x-euc-jp to EUC-JP
> # and got the same result.
> 
> > The code fragment quoted in the first message is still present in the
> > most recent Lynx code.  Just search for "if (ch == ' ') {".  What it
> > means, according to my understanding when I wrote that message (I have
> > no re-examined this with the current code, but I asusme the effect is
> > still the same): First, we go to some trouble to set text->kcode in
> > HText_setKcode() (GridText.c), based on the charset in a META.  But
> > then HText_appendCharacter() goes and almost immediately cancels the
> > effect.  All it takes is a space (' ') character.
> 
> This strategy is useful for documents which has more than two charsets
> like Henry's. But I think they are quite rare.
> Especially in case of charset is declared explicitly, I think it's not 
> useful. 

So can you confirm that ' ' still does have the effect I described?
Also, is that with/without CJK_EX?


> > (Possibly HText_setKcode() should be called from more places, not just
> > LYHandleMETA in LYCharUtils.c, but also from MTMIME.c and HTFile.c, at
> > least; but given that it has no real effect, it's no surprise that
> > those calls have never been added.)
> 
> I hope it's also called from such places.

After looking more closely at the code (but not testing), it seems I was
wrong - the HText_setKcode() is in fact being done, but started from
HText_new().  Calling it explicitly from outside of GridText.c should be
necessary only if what-we-know-about-charset changes *after* the new
HTMainText has been started.  IOW, for META - yes, for HTMIME.c, HTFile.c -
no.

HOWEVER -
    if (anchor->charset)
        HText_setKcode(self, anchor->charset,
                       HTAnchor_getUCInfoStage(anchor, UCT_STAGE_HTEXT));
that only gets done when anchor->charset != NULL, i.e. when a charset
was specified *explicitly* (not from ASSUME_CHARSET).

That may need changing.  I think you should play arround with it and
see what happens if you relax that condition.

Another place to possibly change is HTMLSetCharacterHandling() in
LYCharSets.c.  Especially this part:

    if (LYRawMode) {
        UCLYhndl_for_unspec = i;  /* UCAssume_MIMEcharset not changed! */
    } else {
        if (chndl != i &&
            (LYCharSet_UC[i].enc != UCT_ENC_CJK ||
             LYCharSet_UC[chndl].enc != UCT_ENC_CJK)) {
            UCLYhndl_for_unspec = chndl;  /* fall to UCAssume_MIMEcharset */
        } else {
            UCLYhndl_for_unspec = LATIN1;  /* UCAssume_MIMEcharset not changed! 
*/
        }
    }

The UCT_ENC_CJK logic there may not make much sense.  In fact, it
*probably* doesn't.  But what logic *would* make more sense instead
depends on how you *want* ASSUME_CHARSET to behave, and how you
want '@'/-raw toggle to behave, and how you want them to interact
(how you want their combination to behave).  It's not clear to me how
these things should behave.

So I come back to especially the middle part of my earlier question:

> > What did you try, what did you expect, and what did actually happen?

(If that whole HTMLSetCharacterHandling() thing is too confusing, or
you don't understand how ASSUME_CHARSET normally acts for non-CJK
charsets, please ask.)

   Klaus





reply via email to

[Prev in Thread] Current Thread [Next in Thread]