txr-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Txr-users] First cut at Unicode support.


From: Kaz Kylheku
Subject: [Txr-users] First cut at Unicode support.
Date: Wed, 11 Nov 2009 09:30:13 -0800

Hi all,

I've committed to GIT the first round of changes to make txr handle
international text.

Text is internally represented using wide characters.

The lex scanner for the language recognizes the UTF-8 encoding;
it can decode all characters [0, 0x10FFFF].

However, the regex engine is not yet converted to handle wide characters.
If you do a class match against a wide character, there will likely be
an out-of-bounds memory access, oops!

Moreover, there is a great deal of reliance on wide character I/O (including
conversion to and from an encoding) from the C library. All I/O needs
to be converted to the internal streams library, which will do its own
conversion to and from UTF-8.

There is a dependency on the wchar_t type, which can't hold all Unicode
characters on all platforms (some compilers have a 16 bit wchar_t).
There will have to be some #ifdefs to do something sane if input
is encountered which contains characters above the range of
wchar_t.

Cheers ...




reply via email to

[Prev in Thread] Current Thread [Next in Thread]