od command and unicode

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

od command and unicode

From:	Alain Williams
Subject:	od command and unicode
Date:	Thu, 4 Dec 2014 15:00:43 +0000
User-agent:	Mutt/1.5.20 (2009-12-10)

I am increasingly using Unicode multi byte characters (often UTF-8) in web
pages, etc.  There are times when I get something and need to work out what it
is; sometimes things are wrongly encoded eg in ISO-8859-1 when it should be
UTF-8.  It can be hard to look at a string of bytes and work out the Unicode
code point from them.

Suggestion: the 'od' command should do decoding and print out Unicode code 
points.
I propose the '-u' option to do this, This would work in a similar way to '-x'.

** BEWARE ** below is the UTF-8 encoding for a pound (GBP) U+A3.

This might wrap horribly in your mail reader.

Eg: 
echo 'They cost £1 each' | od -cu

0000000   T        h        e        y                c        o        s       
 t            302 243       1                 e        a
          54       68       65       79       20      63       6f       73      
 74       20       A3       31       20       65       61
0000020   c        h        \n
          63       68       0a
0000023

Notes:

that the UTF-8 encoding for the pound symbol is takes 2 characters, they are 
displayed
as 2 octal characters - no change here (-c output) other than increased spacing 
(to a width of 9)
and that the Unicode octal character representation is within 9 places.

The pound symbol takes 1 place on the Unicode line - although it is 2 
characters on the line above.
This gives a mismatch - maybe the 'A3' should have a marker (eg '.') following 
it to show this, eg:

    20       A3       .        31

The line below gives the Unicode code points in hex, as is traditional.
I have suppressed leading zeros, but they could be put in, eg:

        0054     0068     0065     0079     0020    0063     006f     0073     
0074     0020     00A3     0031     0020     0065     0061

or:

      000054   000068   000065   000079   000020  000063   00006f   000073   
000074   000020   0000A3   000031   000020   000065   000061

Depending on how many Unicode digits are wanted, maybe there should be an 
option to
specify ?


How does output look when a multi byte character is split between 2 line of 
'-c' output ?


Which Unicode encoding should be used ?

* One way would be to look at $LANG, which I set to 'en_GB.utf8' - so use UTF-8.

* This could be overridden with -U or --unicode-encoding options, eg: 
--unicode-encoding=iso-8859-1

It might be nice to add a -C option that simply output non-control characters, 
ie
leave it up to the terminal driver to interpret.

This would make my life much easier.

See:

    http://en.wikipedia.org/wiki/Unicode

Discuss.

-- 
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT 
Lecturer.
+44 (0) 787 668 0256  http://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: 
http://www.phcomp.co.uk/contact.php
#include <std_disclaimer.h>

[Prev in Thread]

Current Thread

[Next in Thread]

od command and unicode, Alain Williams <=

Prev by Date: Re: -fsanitize=undefined errors
Next by Date: [patch] add a compact header format to pr
Previous by thread: Re: Shift out of bounds in coreutils-6.11
Next by thread: [patch] add a compact header format to pr
Index(es):
- Date
- Thread