[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] How to convert recognize a string as a Unicode char?
From: |
Assaf Gordon |
Subject: |
Re: [bug-gawk] How to convert recognize a string as a Unicode char? |
Date: |
Tue, 14 May 2019 00:30:34 -0600 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 |
Hello,
On 2019-05-13 6:05 p.m., Neil R. Ormos wrote:
Peng Yu wrote:
Suppose that there is a file that contains something like the
following, is there a way to recognize as the corresponding Unicode
chars instead of two strings starting with "0x"?
0x2591
0x2592
[...]
gawk --non-decimal-data '{c=0+$0; a=sprintf("%c", c); print length(a); printf
"%s\n", a;}'
For older versions of gawk, you might need a chicane.
Or if you can use other tools, coreutils' printf can print
unicode code points directly:
env printf '\u2591\u2592\n'
So just changing '0x' to '\u' and passing on to printf would do the job:
printf "%s\n" 0x2591 0x2592 | sed 's/^0x/\\\\u/g' | xargs -n1 printf
Or,
If you can convert the ASCII to hex (i.e. '0' 'x' '2' '5' '9' '1' to
'\x25\x91'), you can use "iconv" to convert your UTF16BE to UTF-8
(this is a brevity assumption that you only use unicode codepoints up to
0xFFFF, which can be mostly treated as UTF16, if ignoring some edge
cases is acceptable);
printf "%s\n" 2591 2592 \
| basenc --base16 --decode | iconv -f utf16be -t utf8
'basenc' (=base-encode) is a new program in coreutils 8.31.
If you don't have it, using 'xxd' can also work to convert ascii to binary:
printf "%s\n" 2591 2592 | xxd -r -p | iconv -f utf16be -t utf8
There is also the 'uconv' program from the ICU package
(http://site.icu-project.org/) which can do alot more unicode conversions.
-assaf