[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

lynx-dev Re: msg00798.html (was: 0x2276 handling)

From: for Leonid Pauzner
Subject: lynx-dev Re: msg00798.html (was: 0x2276 handling)
Date: Thu, 30 Apr 1998 22:11:04 +0400 (MSD)

> but is a dev5 build according to Wayne.  The problem is in the handling
> of attribute values via the (excessively hairy and unmaintainable :)
> functions in v2.8's LYCharUtils.c and it's UCfoo.c mods, that I did
> not use (with lengthy explanations to lynx-dev of why) in the code set
> that I had released as v2.7.2.  The homologous functions in SGML.c and
> HTPlain.c handle other conversions.  They are not coordinated in the
> v2.8 code with each other and the attribute handlers in LYCharUtils.c
> (Although I had coordinated them in the v2.7.2 release, the v2.8 release
> "superseded" v2.7.2 without having dealt with these and other problems
> in the devel code set.).  You see different problems in v2.8 depending
> on the markup, and in turn whether you are using SGML.c, HTPlain.c or
> LYCharUtils.c functions to set up the chartrans conversions. To see

1) As I understand, HTPlain.c intentionaly do not convert any escaping
and named/numeric entities but 8bit text only (and something for CJK, maybe).

2) Yes, there is a great mess in LYCharUtils.c (namely LYUCFullyTranslate...).
URL hex escaping should be splitted out from attributes value translation
(like it was done in 2.7.2), even more: instead of coordinating
SGML.c, HTPlain.c and LYCharUtils.c they should simply call the same function
for chartrans (with the exception of line wrapping background in attributes,
BTW, just for this letter I have prepared a variant of sgml.html with entities
moved inside alt= attributes: I got _exactly the same_ result except
-0x200D    ‍       HTMLspecial       # ZERO WIDTH JOINER
-0x200E    ‎       HTMLspecial       # LEFT-TO-RIGHT MARK
-0x200F    ‏       HTMLspecial       # RIGHT-TO-LEFT MARK
+0x200D                HTMLspecial       # ZERO WIDTH JOINER
+0x200E                HTMLspecial       # LEFT-TO-RIGHT MARK
+0x200F                HTMLspecial       # RIGHT-TO-LEFT MARK
There is no problem here.
Anyway, URLs escaping should be tested/rewritten someday.

Unfortunately, 2 months ago, when I was cleaning up chartrans a little,
you were "not available" on the list. I started with moving old
entities staff to unicode_entities (formely `extra_entities')
but found lot of places like reverse translation from isolatin1 entry
to entity name I had no idea for what (I was not sure in CJK).
So the changes was minimal.

Can you explain why we use `name = HTMLGetEntityName(value);'
for some HTPassEightBitRaw/HTPassEightBitNum
HTPassHighCtrlRaw/HTPassHighCtrlNum combination
instead of direct use of `LYlowest_eightbit' and `LYHaveCJKcharacterSet'?
Chartrans staff cannot be rewritten without understanding lots of such
questions, IMHO.

> the problem we've been discussing, you should have used Alex Matulich's
> test page (the URL was posted by Doug), and what his script returned
> before he modified his stuff to treat ';' instead of just '&' as the
> name=value separator (as in the HTML 4.0 recommendations, which he
> obviously has now read and understood :).
> >Yes, 0x2276 is not known for def7_uni.tbl currently, we may easily add
> >U+2276:<>
> >or something like this, if necessary.
> >
> >From the other hand, there are still few strange characters like 0x200A
> >which are _known_ by def7_uni.tbl but report error handling
> >instead of promised substitution. This is a bug.
>         It was inappropriate to have defined any SGML named character
> references to Unicode values without also setting up default chartrans
> conversions for them (looks like there are more than just "lg').
def7_uni expanded on "help yourself" basis.
> Depending on which of the (uncoordinated in v2.8) functions of SGML.c,
> HTPlain.c or LYCharUtils.c is invoked (based on the markup and MIME
> type), this has created a situation in v2.8 for which strings/Unicode
> values are being passed as "known" to functions which in fact don't
> know them as SGML character references, and particularly for that
> mess in the v2.8's LYCharUtils.c, have no rational error recovery
> associated with them.
Correct. Will look if this really happened or only possible.
>         Also, note this problem brought out for v2.8 by Alex's test
> page:  Had the "lg" in fact been handled according to SGML principles
> as a character in the URL with a value greater than decimal 127, and
> the markup actually intended that (e.g., for an i18n path), on
> submitting it to the http server the v2.8 code is still using Klaus'
> obsolete conversion function, instead of converting it to utf-8 and
> then hex escaping each byte of the resultant multibyte character, as
> is done in such cases by the code I had released as v2.7.2.  So even
> if the chartrans stuff in v2.8 is fixed up, such URLs would still fail
> to retrieve the resource for Lynx users (the server or its script would
> have no way to back translate correctly).  I had posted lengthy messages
> about this before the v2.8 release, but... (What a "pickle" this is!
> I retired just in the knick of time. :)
>                                         Fote
> --
> Foteos Macrides (address@hidden during April, '98)

reply via email to

[Prev in Thread] Current Thread [Next in Thread]