bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Thai UTF-8 length bug


From: Eli Zaretskii
Subject: Re: [bug-gawk] Thai UTF-8 length bug
Date: Tue, 21 Jun 2016 18:25:28 +0300

> From: PePa <address@hidden>
> Date: Tue, 21 Jun 2016 13:25:47 +0700
> 
> Couldn't find any report about this. I read that gawk as of 3.1.5 is 
> supposed to report length in characters now. That is not true for Thai 
> characters (Ubuntu 16.04 gawk 4.1.3):
> 
> LC_ALL=th_TH.UTF-8 gawk 'BEGIN {print length("ค้ม")}'
> 3
> (should be 2)

I think you are confusing characters with grapheme clusters.  The
above string has 3 codepoints: u+0E04, u+0E49, and u+0E21.  On display
(assuming the display supports complex script shaping), we should see
2 grapheme clusters, because the first two characters combine to form
a single grapheme cluster.

But Gawk doesn't count grapheme clusters, it counts characters.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]