help-gplusplus
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using std::wstring


From: John Max Skaller
Subject: Re: Using std::wstring
Date: Mon, 25 Oct 2004 11:56:19 +1000
User-agent: Pan/0.13.3 (That cat's something I can't explain)

On Fri, 22 Oct 2004 16:03:17 -0700, Jamiil wrote:

> Thanks for the help Guy!
> A long time ago I wrote a class wrapper for what I found to be some of
> the most common methods in std::string. In the past few days I have
> seen the word "UNICODE" and/or "Internationalization" popping up on my
> desk, so I have decided to make my programs multicultural ready. C++
> has this "std::wstring" that might be the bridge to achieve my goal.

No. Do not use whar_t or wstring.

Instead, if you want to support Unicode you should use a 32 bit
integer string. C++ does not provide one, so internationalisation
using basic_string<?> is not strictly possible for a 1-1 character
to code point representation.

You have two alternatives. For total portability of your
internal code, use basic_string<unsigned char> and assume UTF-8 encoding.
This is the encoding used by the Internet and also Linux. 
You can probably get away with basic_string<char>.

With a tiny risk, you can also use basic_string<int>, 
this is likely to be wide enough on most hosted systems,
although it may fail for embedded systems where int is 16 bits.

wchar_t is 32 bit on 32 bit Linux boxes, but it is 16 bit on
Windows (and Solaris?). On a small system (eg micro controller
or small embedded platform like a mobile phone) it might even be 8 bit.

This situation will change if you adopt the C99 extensions to
C++ which will eventually be part of the next C++ Standard,
then you can use int32_t (and your program will fail on systems
not implementing it).

If you're *only* handling plain text, with simple formatting,
plain old string with UTF-8 is the best option .. because there
is nothing to do. You're already supporting it :)

However, if you need to index into the string, or count
characters, UTF-8 is harder to work with. Finding position n
in the string takes O(n) time.

Note that in all cases you have to worry about physical I/O.
There's no way to be sure what encoding your input data has,
so you either have to mandate it, or ask the client to tell you
(eg with a command line option). There is a standard way to
detect big/little endian UTF-8,UCS-2,UCS-4 by examining the
first few bytes of a file in binary mode.. but it doesn't seem
to be used much.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]