[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

UTF-8 - retrieving and displaying multibyte characters.

From: Chris Jones
Subject: UTF-8 - retrieving and displaying multibyte characters.
Date: Sat, 06 Jun 2009 17:43:13 -0400
User-agent: Mutt/1.5.13 (2006-08-11)

Apologies if this is the wrong place to ask - please redirect me if it

This is not a but report but rather a request for assistance.

In case it matters, this is a debian "etch" system with xterm(222) and
what looks like version 5 of libncurses nd libncursesw.

I'm having a general problem understanding how to write ncurses code
that handles UTF-8 encoded characters or strings.

I thought I had read in several places that programs that were written
to handle 8-bit encodings such as the "less" utility only needed minimal
changes to run in a UTF-8 context. 

Unfortunately not much detail was given.

I wrote the following snippet to try and understand a bit more about
these aspects:

#include <locale.h>
#include <ncurses.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int uc;

int main(void)
  keypad(stdscr, TRUE);                
  uc = getch();                  
  mvprintw(24, 0, "Entered = %c", uc);
  mvprintw(26, 0, "Hexadecimal = %x", uc);
  return 0;

This was compiled via:

gcc -lncurses uni01.c

I run my test on uxterm..  LANG and all the LC_* environment variables
are set to en_US.UTF-8 with the exception of LC_ALL=

If I run a.out and hit a key that corresponds to a printable character
on a US PC keyboard, such as '0' or '1' or 'a' or 'A', the expected
results are displayed:

Entered = 0

Hexadecimal = 30


If I hit a key that corresponds to a control character such as the arrow
keys, the expected results are also displayed. 

This is what is displayed if I hit the Up Arrow key:

Entered = ^C

Hexadecimal = 103

I am just a little surprised that the value of Up Arrow should be 0x103,
larger than 0xFF and therefore requires more than one byte, both on a
uxterm or plain xterm with my locale variables set to en_US. Already
suggests that I am missing something but probably not related to UTF-8.

I tried to run the above code again on uxterm but this time I entered a
euro symbol U+20AC, via <Compose> + E=

Prior to running my code, I verified that if I enter the key combination
above at the bash prompt, a euro symbol is echoed back to the terminal
as expected.

When my a.out prompts me for a character, and I hit the Compose key
followed by "E" and "=" something unexpected happens: 

I no longer see a "euro symbol" echoed to the terminal, but rather an
empty box or rectangle immediately followed by a capital B.

This is what is subsequently displayed by the program:

Entered = .                                 # square box 

            Hexadecimal = e2.               # e2 and a square box

Replace the two dots ('.') above by square boxes and you will see what I
am seeing.

Also, the "Hexadecimal" string is actually indented 12 columns to the
right as above so it looks like something is interpreted as a control

I'm unsure what is happening here, but I manually encoded U+20AC to
UTF-8 and this resulted in the following three bytes: 0xE2, 0x82, 0xAC.

In other words, this looks as if on the one hand, the echoing mechanism
to the terminal is not working, and on the other hand that getch()
returns an integer that contains only the first byte of the encoded

I was naively expecting getch() to return 0xE282AC or maybe 0x20AC -
since I have coded the raw() function.

What happened to the other two bytes, and is there a way to retrieve

Adding a second getch() just causes the program to prompt me for more
input at the terminal.

I didn't find much help reading the 3NCURSES man pages, which is to be
expected, but unfortunately I couldn't find any tutorial that explains
the basics of UTF-8 character/string input & output in the ncurses
programming environment or even some sample programs.

I did manage to write an equivalent snippet that uses get_wch() instead
and got that to "work". No conversion to UTF-8 encoding, though.. I
retrieve the raw 0x20AC that corresponds to U+20AC.

Does code written for one-byte encoding have to be completely rewritten
where terminal input output is concerned to work with the UTF-8

Does anyone have some basic code available that demonstrates how I would
go about writing a program that might provide a dialog such as this:

Enter <euro symbol> or <yen symbol> :

<euro symbol>

You entered: <euro symbol>

That might be enough to get me started and would be greatly appreciated.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]