bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] How to convert recognize a string as a Unicode char?


From: Assaf Gordon
Subject: Re: [bug-gawk] How to convert recognize a string as a Unicode char?
Date: Tue, 14 May 2019 00:30:34 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1

Hello,


On 2019-05-13 6:05 p.m., Neil R. Ormos wrote:
Peng Yu wrote:

Suppose that there is a file that contains something like the
following, is there a way to recognize as the corresponding Unicode
chars instead of two strings starting with "0x"?
0x2591
0x2592

[...]

gawk --non-decimal-data '{c=0+$0; a=sprintf("%c", c); print length(a); printf 
"%s\n", a;}'

For older versions of gawk, you might need a chicane.


Or if you can use other tools, coreutils' printf can print
unicode code points directly:

   env printf '\u2591\u2592\n'

So just changing '0x' to '\u' and passing on to printf would do the job:

  printf "%s\n" 0x2591 0x2592 | sed 's/^0x/\\\\u/g' | xargs -n1 printf

Or,

If you can convert the ASCII to hex (i.e. '0' 'x' '2' '5' '9' '1' to
'\x25\x91'), you can use "iconv" to convert your UTF16BE to UTF-8
(this is a brevity assumption that you only use unicode codepoints up to 0xFFFF, which can be mostly treated as UTF16, if ignoring some edge
cases is acceptable);

    printf "%s\n" 2591 2592 \
       | basenc --base16 --decode | iconv -f utf16be -t utf8

'basenc' (=base-encode) is a new program in coreutils 8.31.
If you don't have it, using 'xxd' can also work to convert ascii to binary:


    printf "%s\n" 2591 2592 | xxd -r -p | iconv -f utf16be -t utf8


There is also the 'uconv' program from the ICU package (http://site.icu-project.org/) which can do alot more unicode conversions.



-assaf








reply via email to

[Prev in Thread] Current Thread [Next in Thread]