Re: lynx-dev 0x9A bug

From: Karel Kulhavy
Subject: Re: lynx-dev 0x9A bug
Date: Tue, 5 Oct 1999 16:37:50 +0200

> On Tue, 5 Oct 1999, Karel Kulhavy wrote:
> [ Reformatted for quoting - watch your line lenght! ]
> > I've found out that when I run lynx in -dump -raw mode, Lynx removes
> characters 0x9A from the original source.
> > This bug is in version 2.7.1 as well as in version 2.8.2rel.1.
> > I have a html file containing Czech text in cp1250 encoding. Some
> > czech words contain char 0x9A which is small letter 's' with
> > caron. After running lynx -dump -raw on this local html file, the
> > 0x9A character is left out in the output, although the characters
> > around this character forming the word are left untouched.
> Depending on circumstances this may be expected, as a precaution
> against having this byte (and others in the range 0x80..0x9F)
> act as a control character.  It depends on your environment whether
> that makes sense or not; but if you want lynx to spit out such
> bytes as if they were normal displayable characters, you have to
> *tell it* that your Display Character Set is one where these characters
> are allowed.

Doen't the "-raw" option include telling the Lynx that everything above 0x80
is a normal letter?

> For this use of -dump, lynx uses basically the same logic as for
> normal interactive display.  So you should see the same effect.
> With -dump, lynx should use the D.C.S. saved from the Options Screen
> (in .lynxrc) or set in lynx.cfg (called simply CHARACTER_SET there).
> What OS are you using?

Linux 2.2.12 on i386

> Are you *sure* that it is lynx that is removing the character?
> Just echoing the file to the screen may not be enough to check -
> since the byte may actually act as a control character.

Yes. I had a bum.html on disk. Then I issued:
lynx bum.html -dump -raw >a
mc (Midnight Commander)
then I viewed the "a" with the built-in viewer, switched to hex-mode and looked
at the fact that the character is missing in it's place.

Also, I am using lynx to get formatted text into my perverse web browser,
where the 0x9A missed too. It's not a bug in Midnight Commander.
Thea ctual system of getting data from lynx into my program consists of pipe(),
fork(), dup2() into stdout of lynx and execlp(lynx).

Then I viewed the bum.html and the character was in it's place.

> Does this happen only with 0x9A, or also with other characters in the
> range 0x80..0x9F?

I don't know.

> So what is your effective Display Character Set?  Is it actually
> what you want to get out of lynx?

I take no care of display character set because I believe that when I switch
-raw on, Lynx forgets all encoding problems and only dumps the bytes.

> lynx.cfg?  You should set e.g. the first one if lynx should assume
> that local files are all in the windows-1250 charset.  Then the -raw
> should not be needed for your local file example.  (Leave it out
> when it isn't needed - it might actually confuse things.)
> Does the file contain a META tag with charset specification?
> (In that case, ASSUME_* would not be used.)
> With which screen handling library was lynx compiled? (curses/ncurses/
> slang?)  There could be some relevant code differences.

Doesn't the -dump include forgetting there is a display?


