lynx-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev reading sjis docs [was Re: lynxcgi problem]


From: Hataguchi Takeshi
Subject: Re: lynx-dev reading sjis docs [was Re: lynxcgi problem]
Date: Fri, 31 Dec 1999 22:12:13 +0900 (JST)

On Thu, 30 Dec 1999, Klaus Weide wrote:

> On Thu, 30 Dec 1999, Hataguchi Takeshi wrote:
> > On Tue, 28 Dec 1999, Henry Nelson wrote:
> >
>     Hataguchi Takeshi wrote:
> > > > By the way, I'm wondering ASSUME_CHARSET doesn't work for Japanese
> > > > as expected now as you've ever wrote.
> > > > Do you know the relationship between ASSUME_CHARSET and
> > > > "kanji code", which can be changed by ^L with SH_EX?
> > >
> > > ASSUME_CHARSET is turned off for CJK, as far as I know.  Our LAN service
> > > is very unstable right now, so I cannot try to search the archives for 
> > > you,
> > > but look in the "http://www.flora.org/lynx-dev/html/month1097"; archives,
> > > and grep for "did something happen to."
> 
> > Thank you very much. Now I see ASSUME_CHARSET is off for CJK.
> > But I've not understood why it's off. I'll continue to check archives.
> 
> Can you please describe in detail what you mean with "is off".
> What did you try, what did you expect, and what did actually happen?

I might be confusing. I'm sorry that no one wrote "is off" in the 
thread. I found this description in lynx.cfg and thought 
if CJK mode is on, then ASSUME_CHARSET has no meaning.

| # Raw (CJK) mode
| #
| # Lynx normally translates characters from a document's charset to display
| # charset, using ASSUME_CHARSET value (see below) if the document's charset
| # is not specified explicitly.  Raw (CJK) mode is OFF for this case.

I hadn't try anything when I wrote the last mail.

Now I tried some files attached to this mail.

    metaEUC.html
        Documents in euc-jp with META tag
    metaSJIS.html
        Documents in shift_jis with META tag
    metaSJIS2.html
        Documents in shift_jis with wrong META tag (x-sjis)
    nometaEUC.html
        Documents in euc-jp without META tag
    nometaSJIS.html
        Documents in shift_jis without META tag

I got the result from first two files as expected.
I knew the charset specified by META tag is valid as you wrote.

I got bad result from the third file metaSJIS2.html, 
which declares charset as x-sjis. 
I know x-sjis isn't in IANA's character sets.
But there are some pages declaring charset as x-sjis, 
because Netscape had added x-sjis and x-euc-jp to the charset
and allowed to use them in the META tag independently
before Shift_JIS and EUC-JP were added in IANA charset.
So I feel happy if Lynx allows x-sjis and x-euc-jp.

# I refered this page, but unfortunately it's in Japanese.
# http://www.bekkoame.or.jp/~poetlabo/WWW/charset.html

I tried nometaEUC.html by setting ASSUME_CHARSET as euc-jp and
DISPLAY_CHARSET as Japanese (EUC-JP), but I got bad result, 
which is as same as the result by setting ASSUME_CHARSET as iso-8859-1.
I wanted the same result as one from metaEUC.html.

I got also bad result from nometaSJIS.html by setting 
ASSUME_CHARSET as shift_jis and DISPLAY_CHARSET as Japanese (EUC-JP).

It seems ASSUME_CHARSET has no effect in this experiments.

> I am not aware of ASSUME_CHARSET being explicitly turned off for CJK.
> It's just that ASSUME_CHARSET, basically, has the equivalent effect of
> a META tag with a charset (only with a lower priority); or possibly has
> less effect (no call to HText_setKcode - see below).  If an explicit
> charset in a META tag has no effect for CJK, then it is no surprise if
> ASSUME_CHARSET has no effect, either.

META tag has effect but ASSUME_CHARSET doesn't as I wrote above.

> 
> Well - I expect that ASSUME_CHARSET does have an effect if
>  (a) Display Character Set is a CJK character set, and ASSUME_CHARSET points
>      to a non CJK charset (possibly only with raw/CJK toggle state being off?)
>      or
>  (b) Display Character Set is a non-CJK character set, and ASSUME_CHARSET
>      points to a CJK charset.

I tried. But I can't read many Japanese documents.
What kind of situations do I have to use these settings?

> > > My *hunch* is that ASSUME_CHARSET would not offer much to help Lynx render
> > > Japanese documents.  How can you assume?
> >
> > My idea is almost same as Hiroyuki's manual overriding switch.
> > We usually set it as "Japanese (Auto Detect)" and sometimes
> > set it as "Japanese (Shift_JIS)" or "Japanese (EUC)"
> > when Lynx fails to detect document character set.
> >
> > I think ASSUME_CHARSET is a something which should play this role.
> > Anyway I'll try to find the reason ASSUME_CHARSET is off for CJK.
> 
> The first question should be why the CJK magic doesn't listen to any
> sorts of charset at all.  Whether the best way for toggling is via
> the ASSUME_CHARSET mechanism or some other mechanism can then be decided
> later.

I thought it's ASSUME_CHARSET. But now I can't understand how 
Lynx does/should process ASSUME_CHARSET at all.

> > Thanks. It seems there are no differences between output of them.
> > It seems <META ... CONTENT="text/html;charset=hogehoge"> has no effect
> > for Japanese documents.
> 
> See especially
>   <http://www.flora.org/lynx-dev/html/month1097/msg00110.html>
> and
>   <http://www.flora.org/lynx-dev/html/month1097/msg00151.html>
> from the thread that Henry pointed out.

Thank you. I see the META tag happened to have no effect in Henry's 
examples.

# Oops! Henry uses only x-sjis and x-euc-jp as charset.
# I tried again by replacing x-sjis to Shift_JIS and x-euc-jp to EUC-JP
# and got the same result.

> The code fragment quoted in the first message is still present in the
> most recent Lynx code.  Just search for "if (ch == ' ') {".  What it
> means, according to my understanding when I wrote that message (I have
> no re-examined this with the current code, but I asusme the effect is
> still the same): First, we go to some trouble to set text->kcode in
> HText_setKcode() (GridText.c), based on the charset in a META.  But
> then HText_appendCharacter() goes and almost immediately cancels the
> effect.  All it takes is a space (' ') character.

This strategy is useful for documents which has more than two charsets
like Henry's. But I think they are quite rare.
Especially in case of charset is declared explicitly, I think it's not 
useful. 

> (Possibly HText_setKcode() should be called from more places, not just
> LYHandleMETA in LYCharUtils.c, but also from MTMIME.c and HTFile.c, at
> least; but given that it has no real effect, it's no surprise that
> those calls have never been added.)

I hope it's also called from such places.

> > # There are some Japanese documents which declare WRONG character set.
> > # If Lynx processs the META tag strictly, we can't get proper output
> > # from such wrong pages. I'm wondering this is one reason that Hiroyuki
> > # added manual overriding function.
> > # In the case of NN and IE, it seems they don't processs the META tag
> > # strictly. I think that's the reason why there exists wrong documents
> > # in Japan. :-<
> >
> > > > If this is right, I think ASSUME_CHARSET should work properly.
> > > > # "Japanese (Auto Detect)" should be added in the list, if needed.
> > > > Don't you agree with me, Henry?
> 
> That is a user interface question that should be deferred until later.
> But it would make more sense to have
> 
>    x-autodetect_jp   # or similar
> in addition to
>    shift_jis
>    euc-jp
> 
> in the Assumed document character set list than having "Japanese (Auto
> Detect)" in the Display character set list.  One is for input, the other
> for output, and it is the character encoding of the input that would be
> "detected", not the state of the terminal display.  (I guess this is
> basically what you mean when you think about ASSUME_CHARSET?)

Right! Thank you.
--
Takeshi Hataguchi
E-mail: address@hidden

Attachment: samples.tar.gz
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]