[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wide and UTF-8 international characters

From: D. Stimits
Subject: Re: Wide and UTF-8 international characters
Date: Sat, 17 May 2003 16:25:21 -0600
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2b) Gecko/20021018

I'm still trying to think ahead on my project, so I'm going to ask based on what I've read, but not tested (at least not with ncurses).


>If I am using just a console or or xterm, without ncurses, I can output
>the full 8 bit characters as described in html 8-bit entities, echoed
>directly to a console (not with ncurses or any lib), such as "©",
>and get the copyright symbol that is like a 'c' inside of a circle (it
>happens that to echo this I echo an uninterpreted 169 decimal, typecast
>to char). So current terminals, whether console or X11, use the full 8

generally true.  But the 8th bit used for standout in BSD curses was
stripped off and used as a flag to tell that implementation whether
to use standout mode to highlight characters.

>bits to create their display. If the eighth bit is being used by curses,
>then the top 128 characters are lost to standout mode ability. On the
>other hand, if ncurses uses a separate byte (a 16 bits) to store

more than 8 bits, actually.

So it sounds like the 8th bit is no longer used as a flag...is that correct? But also that 1 or more bytes are then added with each character cell to provide attribute data...is that correct?

>characteristics, while leaving the full 8 bits to display output, then
>ncurses can display the full 255 character entity set (html entity set)
>simply by sending the character straight to the terminal. I'm not
>positive, but this should include the full UTF-8 set, which is only
>single-byte. Is ncurses storing attribute in a separate byte already? Or

the problem with that, is that it doesn't mix well with treating the screen as an array of characters. You _could_ store each row as a multibyte string (with some pain achieved at the right margin), but it would require counting or some index added to point to a character which starts at a given column.
Instead, the common approach stores multiple characters for each array
position - some storage is wasted, but it's accessed more rapidly.

I assume that the actual character then is always converted to a wide character, even if it is just common text not requiring a wide character (because it is easier to deal with uniform wide characters than varying-width multibyte representations with escape sequences to mark character set changes). How many bytes does the current ncurses use to store non-attribute character data? I would guess two 8-bit bytes internally per cell.

>is it the way of the old book description, with 7 bits for character,
>and the last bit for standout mode flagging? If a separate byte is used
>already, then it would seem that multibyte characters already have the
>"infrastructure" to be plugged into ncurses. [FYI, it would be rather
>useful to see an entity substitution ability, like "©" in html]
>Pardon my curiosity, lately I've been looking at some non-7-bit ascii
>clients, but the clients support only 8 bit, not multibyte characters. I
>created a lightweight XML style data tree storage mechanism that uses
>XML/html entities to represent characters that cannot be easily entered
>via a keyboard, and it turned out to be far more flexible/useful than I
>thought at first. I remember seeing some of the development ncurses
>branch as partial or initial support for the wide characters, and I

that was up til mid-2001 - I didn't quite know where to begin at rewriting,
but one of the contributors got it moving.  ncurses 5.3 was good enough to
use - the current code probably has isolated bugs, but I don't see any
that are related to wide-characters.  Not all functions are tested - so
I've been reviewing, adding test-programs for places that are noticeably
not covered.

Currently on Linux, I could display a copyright symbol ('c' inside of a circle) by outputting 169 decimal cast as character (8 bits) to the terminal. I'm looking at the man page for echochar, and it appears that ncurses came up with its own version of something similar to html/xml character entities, but the ncurses version is not as complete as html/xml entities. If I were to use a printw function with a %c format, feeding it 169 decimal (or anything from 128 through 255), will ncurses ever represent the output appearance differently than had I fed that decimal number (cast as 8 bit character) directly to a standard linux console or xterm?

D. Stimits, stimits AT attbi DOT com

reply via email to

[Prev in Thread] Current Thread [Next in Thread]