lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev reading sjis docs [was Re: lynxcgi problem]


From: Klaus Weide
Subject: Re: lynx-dev reading sjis docs [was Re: lynxcgi problem]
Date: Thu, 30 Dec 1999 20:57:26 -0600 (CST)

On Thu, 30 Dec 1999, Hataguchi Takeshi wrote:
> On Tue, 28 Dec 1999, Henry Nelson wrote:
> 
    Hataguchi Takeshi wrote:
> > > By the way, I'm wondering ASSUME_CHARSET doesn't work for Japanese
> > > as expected now as you've ever wrote.
> > > Do you know the relationship between ASSUME_CHARSET and
> > > "kanji code", which can be changed by ^L with SH_EX?
> > 
> > ASSUME_CHARSET is turned off for CJK, as far as I know.  Our LAN service
> > is very unstable right now, so I cannot try to search the archives for you,
> > but look in the "http://www.flora.org/lynx-dev/html/month1097"; archives,
> > and grep for "did something happen to."

> Thank you very much. Now I see ASSUME_CHARSET is off for CJK.
> But I've not understood why it's off. I'll continue to check archives.

Can you please describe in detail what you mean with "is off".
What did you try, what did you expect, and what did actually happen?

I am not aware of ASSUME_CHARSET being explicitly turned off for CJK.
It's just that ASSUME_CHARSET, basically, has the equivalent effect of
a META tag with a charset (only with a lower priority); or possibly has
less effect (no call to HText_setKcode - see below).  If an explicit
charset in a META tag has no effect for CJK, then it is no surprise if
ASSUME_CHARSET has no effect, either.

Well - I expect that ASSUME_CHARSET does have an effect if
 (a) Display Character Set is a CJK character set, and ASSUME_CHARSET points
     to a non CJK charset (possibly only with raw/CJK toggle state being off?)
     or
 (b) Display Character Set is a non-CJK character set, and ASSUME_CHARSET
     points to a CJK charset.

But I haven't tested this now - maybe you could.

> > My *hunch* is that ASSUME_CHARSET would not offer much to help Lynx render
> > Japanese documents.  How can you assume?
> 
> My idea is almost same as Hiroyuki's manual overriding switch.
> We usually set it as "Japanese (Auto Detect)" and sometimes
> set it as "Japanese (Shift_JIS)" or "Japanese (EUC)"
> when Lynx fails to detect document character set.
> 
> I think ASSUME_CHARSET is a something which should play this role.
> Anyway I'll try to find the reason ASSUME_CHARSET is off for CJK.

The first question should be why the CJK magic doesn't listen to any
sorts of charset at all.  Whether the best way for toggling is via
the ASSUME_CHARSET mechanism or some other mechanism can then be decided
later.

> Thanks. It seems there are no differences between output of them.
> It seems <META ... CONTENT="text/html;charset=hogehoge"> has no effect
> for Japanese documents.

See especially
  <http://www.flora.org/lynx-dev/html/month1097/msg00110.html>
and
  <http://www.flora.org/lynx-dev/html/month1097/msg00151.html>
from the thread that Henry pointed out.

The code fragment quoted in the first message is still present in the
most recent Lynx code.  Just search for "if (ch == ' ') {".  What it
means, according to my understanding when I wrote that message (I have
no re-examined this with the current code, but I asusme the effect is
still the same): First, we go to some trouble to set text->kcode in
HText_setKcode() (GridText.c), based on the charset in a META.  But
then HText_appendCharacter() goes and almost immediately cancels the
effect.  All it takes is a space (' ') character.

(Possibly HText_setKcode() should be called from more places, not just
LYHandleMETA in LYCharUtils.c, but also from MTMIME.c and HTFile.c, at
least; but given that it has no real effect, it's no surprise that
those calls have never been added.)

Of course, there may be other places (e.g., earlier in processing than
GridText.c) where reacting to the input charset may need to be implemented.
But it probably doesn't make sense thinking about that, as long as even the
one place where it's kinda implemented doesn't honor it.

If text->kcode has any effect in GridText.c, then the CJK charset from
a META tag (if recognized) *should* have an effect.  The effect should
"persist" (if you can call it that) only until the first space
character gets fed to HText_appendCharacter().  You should be able to
make a test document, with a valid META charset, to investigate this
effect of the first ' '.

Note that the canonical MIME name for Shift-JIS is "shift_jis", and
"x-sjis" is NOT a recognized synonym for it.  Someone has added it
to HText_setKcode() in GridText.c, but it is *not* in UCGetLYhndl_byMIME()
in UCdomap.c.  All test using "x-sjis" are flawed.  Use "shift_jis".

> # There are some Japanese documents which declare WRONG character set.
> # If Lynx processs the META tag strictly, we can't get proper output 
> # from such wrong pages. I'm wondering this is one reason that Hiroyuki 
> # added manual overriding function.
> # In the case of NN and IE, it seems they don't processs the META tag
> # strictly. I think that's the reason why there exists wrong documents 
> # in Japan. :-<
> 
> > > If this is right, I think ASSUME_CHARSET should work properly.
> > > # "Japanese (Auto Detect)" should be added in the list, if needed.
> > > Don't you agree with me, Henry?

That is a user interface question that should be deferred until later.
But it would make more sense to have
    
   x-autodetect_jp   # or similar
in addition to
   shift_jis
   euc-jp

in the Assumed document character set list than having "Japanese (Auto
Detect)" in the Display character set list.  One is for input, the other
for output, and it is the character encoding of the input that would be
"detected", not the state of the terminal display.  (I guess this is
basically what you mean when you think about ASSUME_CHARSET?)

But this only makes sense when there actually are 3 different ways to
interpret input.

   Klaus


reply via email to

[Prev in Thread] Current Thread [Next in Thread]