[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: lynx-dev 0x9A bug
Re: lynx-dev 0x9A bug
Wed, 6 Oct 1999 08:52:22 -0500 (CDT)
On Tue, 5 Oct 1999, Karel Kulhavy wrote:
> > On Tue, 5 Oct 1999, Karel Kulhavy wrote:
> > > I've found out that when I run lynx in -dump -raw mode, Lynx removes
> > characters 0x9A from the original source.
> > > This bug is in version 2.7.1 as well as in version 2.8.2rel.1.
> > > I have a html file containing Czech text in cp1250 encoding. Some
> > > czech words contain char 0x9A which is small letter 's' with
> > > caron. After running lynx -dump -raw on this local html file, the
> > > 0x9A character is left out in the output, although the characters
> > > around this character forming the word are left untouched.
> Doen't the "-raw" option include telling the Lynx that everything above 0x80
> is a normal letter?
No. Not in the current lynx, and I don't think it has ever been true
in that absolute sense (independent of character sets) in previous
Read the description in lynx.cfg in the CHARACTER_SET section.
(Take a newer version of lynx.cfg, the text I am referring to may
have been different in 2.7.1). Some excerpts:
# Raw (CJK) mode effectively changes the charset assumption about unlabeled
# documents. [...]
# Note that "raw" does not mean that every byte will be passed to the screen.
# HTML character entities may get expanded and translated, inappropriate
# control characters filtered out, etc. There is a "Transparent" pseudo
# character set for more "rawness".
> > For this use of -dump, lynx uses basically the same logic as for
> > normal interactive display. So you should see the same effect.
> > With -dump, lynx should use the D.C.S. saved from the Options Screen
> > (in .lynxrc) or set in lynx.cfg (called simply CHARACTER_SET there).
> > What OS are you using?
> Linux 2.2.12 on i386
The right choice for Lynx's display character set would depend on
your console font, or X fonts if using xterm etc., or if you log in
remotely on whatever the local system's actual properties are.
If that is not windows-1250 but ISO-8859-2 (rather more likely) then a
byte cannot have the value 0x9A and stand for "small letter 's' with
caron" at the same time...
> > Does this happen only with 0x9A, or also with other characters in the
> > range 0x80..0x9F?
> I don't know.
So I'll assume it happens to all of those, and that there is nothing
special about 0x9A here.
But do try writing an untranslated byte 0x9A to your screen:
$ echo -e '\232'
> > So what is your effective Display Character Set? Is it actually
> > what you want to get out of lynx?
> I take no care of display character set because I believe that when I switch
> -raw on, Lynx forgets all encoding problems and only dumps the bytes.
No, it doesn't just "dump the bytes" - it has to parse and format
the HTML after all. If you were talking about -source, then it
would make sense to not touch the input bytes in any way (and that
is what lynx does).
> Doesn't the -dump include forgetting there is a display?
For the purpose of actual displaying - yes.
For the purpose of transforming text/html input into rendered text
output, the same logic is used as for interactive display (with some
differences of course, like for page counting and screen width).
That includes transcoding from an (assumed or explicit) input charset
to an output charset (where possible). That output charset just
happens to be called "display character set", which isn't really
appropriate when using -dump, but it is still the same thing.
You cannot currently specify the output charset on the command line
(although you can specify the default input charset, as -assume_*),
so you have to set it either in lynx.cfg or by saving from the 'O'ptions
For most uses this seems reasonable. Normally you want the text you get
with -dump in the same character encoding you use for display. If you
normally use lynx only with -dump - call it interactively once, visit
some pages and set it up so that characters are displayed correctly
(for correct pages!), then save options.
The -raw flag survives from pre-chartrans times, its key equivalent
'@' is somewhat more useful for convenient toggling between two states.
But mostly you shouldn't need '-raw' if you use the more flexible
-assume_* flags or have the equivalent in lynx.cfg.
(I am not sure how much of this applies to 2.7.1.)