[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: lynx-dev reading sjis docs [was Re: lynxcgi problem]

From: Henry Nelson
Subject: Re: lynx-dev reading sjis docs [was Re: lynxcgi problem]
Date: Mon, 3 Jan 2000 15:00:37 +0900 (JST)

>     metaSJIS.html
>         Documents in shift_jis with META tag
>     metaSJIS2.html
>         Documents in shift_jis with wrong META tag (x-sjis)

Thanks for doing this.  In my rush I forgot.

> I know x-sjis isn't in IANA's character sets.
> But there are some pages declaring charset as x-sjis, 
> because Netscape had added x-sjis and x-euc-jp to the charset
> and allowed to use them in the META tag independently
> before Shift_JIS and EUC-JP were added in IANA charset.
> So I feel happy if Lynx allows x-sjis and x-euc-jp.

Yes, please keep the x-* form.  The vast majority of pages I view are
described as "x-sjis," sad as it may be.  (One reason is that older
Japanese versions (2.*) of Netscape will not correctly render documents
that do not use these forms.)

> >  (b) Display Character Set is a non-CJK character set, and ASSUME_CHARSET
> >      points to a CJK charset.
> I tried. But I can't read many Japanese documents.
> What kind of situations do I have to use these settings?

There used to be something called "transparent," whatever that means, that
might be useful for experimentation.

> I thought it's ASSUME_CHARSET. But now I can't understand how 
> Lynx does/should process ASSUME_CHARSET at all.

Please keep trying.  You're probably our only hope.

> This strategy is useful for documents which has more than two charsets
> like Henry's. But I think they are quite rare.
> Especially in case of charset is declared explicitly, I think it's not 

It was done on purpose.  I was trying to give you the worse-case scenario.
My *hope* would be that Lynx would NOT correctly convert all of the
encodings; only the one encoding that matches that of the document charset
definition should be correctly rendered.  In other words, I wish Lynx would
assume the input that is defined in the META tag or server header in a manner
analogous to nkf's -S|J|E.  That would mean that if the document is said
by the author to be written in shift-jis (or commonly used aliases thereof),
Lynx should go ahead and treat the document as SJIS, regardless of the fact
that anything in EUC-JP will most likely be mangled.

Quoting from the English man pages for nkf:

     -S   Assume MS-Kanji and X0201 kana input.  It  also  accept
          JIS.   AT&T EUC is recognized as X0201 kana. Without -x
          flag, X0201 kana is converted into X0208.

     -J   Assume  JIS input. It also accepts Japanese EUC.   This
          is the default. This flag does not exclude MS-Kanji.

     -E   Assume AT&T EUC input. It also accept JIS.  Same as -J.

Lower down in the man pages (BUGS section) I'm sure you are aware of this:

     Nkf cannot handle  some  input  that  contains  mixed  kanji
     codes.   Automatic code detection becomes very weak with -x,
     -X and -S.

Personally, I see no problem with this.  It is the fault of the author or
webmaster, not Lynx, if the META and document charsets do not match.  OTOH,
if the author has attempted to describe the document character set, and Lynx
fails to render it properly, then I feel Lynx "is at fault."

> > > # There are some Japanese documents which declare WRONG character set.
> > > # If Lynx processs the META tag strictly, we can't get proper output
> > > # from such wrong pages. I'm wondering this is one reason that Hiroyuki
> > > # added manual overriding function.

One reason, yes.

This is perhaps where we differ in philosophy.  In general if someone puts
up a page with an unintelligible page, whether due to abuse of frames, or
javascript, or META, or other HTML tags, then I begin to question the value
of the content and whether I should be wasting my time visiting the site.

So my hope for Lynx is that it would do what is "reasonable" and "efficient,"
and whenever possible leave manual overrides for the soul purpose of
appearance or error recovery (error being on the host side, not the client)
to outside mechanisms, such as lynxcgi or proxy.

> > > # In the case of NN and IE, it seems they don't processs the META tag
> > > # strictly. I think that's the reason why there exists wrong documents

Also, many if not most HTML documents are not hand written but are formulated
by Ichitaro or MS-Word, and the user sometimes has the incorrect settings.
all the more reason, IMHO, that Lynx should be "hard-nosed."

> Content-Disposition: attachment; filename=samples.tar.gz

Thanks.  These are good to have for reference.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]