coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Is there a way to print unicode characters and the actual code?


From: Assaf Gordon
Subject: Re: Is there a way to print unicode characters and the actual code?
Date: Sun, 25 Feb 2018 00:27:57 -0700
User-agent: Mutt/1.5.24 (2015-08-30)

Hello,

On Sat, Feb 24, 2018 at 08:12:01PM -0600, Peng Yu wrote:
> > $ od -An -tx1 -ta -tc <<< 'exámple'
> >   65  78  c3  a1  6d  70  6c  65  0a
> >    e   x   C   !   m   p   l   e  nl
> >    e   x 303 241   m   p   l   e  \n

Interestingly, FreeBSD's od(1) does support multibyte characters:

  $ printf "ex\303\241mple\n" | LC_ALL=en_CA.UTF-8 od -An -tx1c
             65  78  c3  a1  6d  70  6c  65  0a
             e   x   á  **   m   p   l   e  \n

Adding this functionality to coreutils is definitely on my TODO list (the most
recent patch includes a partially working implementation, but far from
complete).

> At this moment, I wrote some python code to do this, which prints both
> the decoded code as well as the encoded code in both hex and binary
> numbers in TSV format.

If you don't care about alignment, a simple perl script can do it:


  $ printf "ex\303\241mple\n" \
        | perl -C -MEncode -lne '$a=unpack("H*",encode("utf8",$_));
                                 $a=~s/(..)/\1 /g;
                                 print $a,"\n",$_'
  65 78 c3 a1 6d 70 6c 65
  exámple

If you do care about alighment, a slightly longer perl script works:

  $ printf "ex\303\241mple\n" \
      | perl -C -MEncode -lne 'foreach $c (split//) {
                                  $a=unpack("H*",encode("utf8",$c));
                                  $a=~s/(..)/\1 /g;
                                  $hex.=$a;
                                  $l=length($a)/3-1;
                                  $txt.=$c."  ".("** " x $l);
                               } ;
                               print $hex,"\n",$txt'
  65 78 c3 a1 6d 70 6c 65
  e  x  á  ** m  p  l  e

> $ ./dumpunicode0.py <<< á
> á    0xe1    0b11100001    0xa1c3    0b1010000111000011
> \n    0xa    0b1010    0xa    0b1010

In your example code you print one character per line
(which is not exactly what you previously asked about).

If one character per line is fine, the following sed+perl would work:

  $ printf "ex\303\241mple\n" \
        | sed 's/./&\n/g' \
        | perl -lne '$a=unpack("H*");$a=~s/(..)/\1 /g;print $_,"\t",$a'
  e       65
  x       78
  á       c3 a1
  m       6d
  p       70
  l       6c
  e       65


Or sed+awk:

  $ printf "ex\303\241mple\n" \
       | sed 's/./&\n/g' \
       | LC_ALL=C awk 'BEGIN{for(n=0;n<256;n++)ord[sprintf("%c",n)]=n}
                       {
                          n=split($0,a,"");
                          printf "%s\t", $0 ;
                          for (i in a) {
                              printf "%x ",ord[a[i]]
                          } ;
                          printf "\n"
                       }'
  e       65
  x       78
  á       c3 a1
  m       6d
  p       70
  l       6c
  e       65


And, if your don't care much about regular ASCII values, but want to
easily detect multibyte characters (and octal is acceptable), this
simple command would work:


  $ printf "ex\303\241mple\n" \
        | sed 's/./&\n/g' | sed -n 'p;l' | sed 's/\$$//' | paste - -
  e       e
  x       x
  á       \303\241
  m       m
  p       p
  l       l
  e       e


HTH,
 - assaf




reply via email to

[Prev in Thread] Current Thread [Next in Thread]