Re: UTF-8 - retrieving and displaying multibyte characters.

bug-ncurses

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 - retrieving and displaying multibyte characters.

From:	Chris Jones
Subject:	Re: UTF-8 - retrieving and displaying multibyte characters.
Date:	Sat, 06 Jun 2009 22:04:15 -0400
User-agent:	Mutt/1.5.13 (2006-08-11)

On Sat, Jun 06, 2009 at 06:09:11PM EDT, Thomas Dickey wrote:
> On Sat, 6 Jun 2009, Chris Jones wrote:

[..]

> It's only minimal for the display (provided that the display uses
> POSIX characters ;-).
> 
> I/O for UTF-8 does involve some changes...

When I found out that theoretically UTF-8 encoding can require as many
as 6 bytes and with most int's I've run across being 32 bits + getch()
and Co. returning an int, I started to think that I had had a bad case
of wishful thinking. 

:-(

[..]

> >int main(void)
> >{
> > initscr();
> > raw();
> > keypad(stdscr, TRUE);
> > uc = getch();

> getch() will return (in effect) bytes; UTF-8 is (except for 0-127) a
> multibyte code.

That's where I'm stumped. The first getch() in my example with the euro
symbol appears to return 0xE2, the first byte of the 3-byte sequence..
and that means that somewhere my U+20AC has already been converted
before this byte is retrieved by the user program.  

If I test the leading bits of the first retrieved byte, this tells me
that there follow another two bytes. But if I do a second getch() ..
this results in the code requesting I enter another character at the
terminal instead of retrieving the queued 0x82 .. 

I guess I badly need to study code that does it correctly.

[..]

> For reading UTF-8 you should be using wget_wch, which makes a
> distinction between characters and KEY_xxx codes.

After hours of googling with the wrong keywords, I found this:

http://www.helsinki.fi/atk/unix/dec_manuals/DOC_40D/AQ0R4CTE/DOCU_006.HTM

Pity they don't provide sample/demo code that invokes all the functions.

[..]

> >I was naively expecting getch() to return 0xE282AC or maybe 0x20AC -
> >since I have coded the raw() function.
> >
> >What happened to the other two bytes, and is there a way to retrieve
> >them?

> wget_wch will return the 3 bytes as one character.  wgetch will read
> each byte separately.

So there's something wrong either with my setting or with my testing or
maybe both.

As mentioned below wgetch() retrieves the first of the three bytes and
if I issue another, it doesn't bother to check if there are any more
bytes queued, it just prompts me again for more input at the terminal.

As to get_wch() or wget_wch() they hand my code the raw hexadecimal
keycode for U+20AC i.e. 0x20AC instead  of the UTF8-encoded 0xe282AC.

This is the '_' snippet I mentioned in my previous post:

...

int ct;
wint_t uc;

int main(int argc, char *argv[])
{
  setlocale(LC_ALL, "");
  initscr();
  raw();
  keypad(stdscr, TRUE);
  ct = get_wch(&uc);
  mvprintw(24, 0, "Entered = %x ", uc);
  refresh();
  get_wch();
  endwin();
  return 0;
}

When I respond via a Compose + E= to the get_wch() prompt, uc ends up
containing 0x20AC.

Of course, the code displays:

Entered = 20AC

.. which is what I'm asking but it would be more useful to be able to
display the <euro> symbol.

;-)

Looks like I need to encode the 0x20AC myself, and feed the outcome to a
print function that handles UTF-8.

[..]

> >Does anyone have some basic code available that demonstrates how I would
> >go about writing a program that might provide a dialog such as this:

> The 'A' test in ncurses' test/ncurses.c exercises wget_wch A =
> wide-character keyboard and mouse input test

Not sure I have this on debian "etch".

I found a directory full of sample/demo code in:

  /usr/share/doc/libncurses5-dev/examples

but I don't see much stuff that appears to be UTF-8 related.

Most of the C source is in .gz files and the configure and Makefile.in
are gunzipped as well. 

There is an ncurses.c.gz source file in that directory that's rather
huge - over 6000 lines and seems to have a lot of very useful stuff but
after gunzip'ing it, I tried to compile it and got about 20 screenfuls of
syntax errors.

Again this is debian "etch" and maybe I'm not even looking at the
correct file .. and then I don't even know if this source is meant to be
compiled and executed.

Thanks,

CJ

[Prev in Thread]

Current Thread

[Next in Thread]

UTF-8 - retrieving and displaying multibyte characters., Chris Jones, 2009/06/06
- Re: UTF-8 - retrieving and displaying multibyte characters., Thomas Dickey, 2009/06/06
  - Re: UTF-8 - retrieving and displaying multibyte characters., Chris Jones <=
  - Re: UTF-8 - retrieving and displaying multibyte characters., Chris Jones, 2009/06/10

Prev by Date: ncurses-5.7-20090606.patch.gz
Next by Date: ncurses-5.7-20090607.patch.gz
Previous by thread: Re: UTF-8 - retrieving and displaying multibyte characters.
Next by thread: Re: UTF-8 - retrieving and displaying multibyte characters.
Index(es):
- Date
- Thread