bug-ncurses
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Surrogate pairs for addwstr?


From: Thomas Dickey
Subject: Re: Surrogate pairs for addwstr?
Date: Sun, 10 Oct 2021 19:45:46 -0400
User-agent: Mutt/1.10.1 (2018-07-13)

On Sun, Oct 10, 2021 at 11:38:22AM -0400, Bill Gray wrote:
> Hi Thomas,  Tim,
> 
> On 10/9/21 7:04 PM, Tim Allen wrote:
> > Surrogate pairs only combine to create a single character in UTF-16
> > encoded data, or on platforms (Windows, Java, JavaScript, macOS Cocoa)
> > that use UTF-16 as an internal representation. Code-points in the
> > surrogate pair range are not allowed to appear in un-encoded Unicode
> > data, so if they show up, at best they'll be ignored, but they might
> > show up as blanks or as U+FFFE � REPLACEMENT CHARACTER.
> > 
> > ncurses' wide mode might use the locale's encoding (UTF-8, almost
> > universally) or might just hard-code UTF-8 as the internal
> > representation, since it's generally the best choice for the kind of
> > data ncurses handles. The behaviour you describe is within the range of
> > behaviour I'd expect.
> 
>    Thank you.  I see your points;  in theory,  U+D83D and U+DD1E
> should only happen with UTF-16 data.  And in theory,  theory and
> practice are the same thing.  In practice,  they aren't.
> 
>    The other way to put this would be to ask : if you're on a
> system with 32-bit wchar_ts,  what should happen for this line?
> 
>   mvaddwstr( 0, 2, L"\xd83d\xdd1e Treble clef with a surrogate pair");

I see (a string of wchar_t's).

ncurses is processing the string as separate wchar_t's,
only seeing that wcwidth returns a negative number,
and treating them as combining characters.

If it were to use wcsrtombs (for the whole string) to translate it
to UTF-8 and then mbsrtowcs, one would assume that the runtime would
do the right thing and get rid of the surrogate pairs.  But doing that
would be less efficient (and wouldn't prevent some application from
providing the information in separate calls...)

Of course, that wouldn't work on Windows either.

(since we've acquired a category of end-users who appear to be solely
interested in emojis, it's a bug that'll have to be fixed sometime)

>    At present,  I've got surrogate pairs combining regardless of
> encoding in PDCursesMod;  is there really a situation where I ought
> to instead be displaying glyphs of some sort for U+D800 to U+DFFF?

probably not

-- 
Thomas E. Dickey <dickey@invisible-island.net>
https://invisible-island.net
ftp://ftp.invisible-island.net

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]