bug-sed
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#42857: sed: handling utf8 non-breaking space 0xA0


From: Assaf Gordon
Subject: bug#42857: sed: handling utf8 non-breaking space 0xA0
Date: Fri, 14 Aug 2020 20:46:08 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0

tags 42857 notabug
close 42857
stop

Hello,

Thank you for sending a detailed bug report, makes it much easier to troubleshoot.

On 2020-08-13 8:22 p.m., Dennis Nezic wrote:
I'm not sure if this is a bug. It has to do with the weird utf8(?)
character with hex code 0xa0.

There's an important issue here:

The unicode character "NO-BREAK SPACE" has code-point value of 0xA0
(often written as "U+00A0").

However, "UTF-8" is an encoding of unicode (just like UTF-16 is a different encoding of unicode). It is a way to represent unicode code-points in strings using non-ascii values.

In "UTF-8" the unicode character "NO-BREAK SPACE U+0x00A0" is encoded as
two bytes: 0xC2 0xA0.

See more details here: https://codepoints.net/U+00A0

See some Q&A regarding unicode-vs-utf8 here: https://stackoverflow.com/q/643694

The byte "0xA0" by itself is an invalid UTF-8 character.
This means that if your current locale is UTF-8,
and you have a string with 0xA0 in it by itself, it is considered an
invalid string (or at least not a valid text string, but valid binary
data).

Many programs (GNU sed included) do not match invalid bytes in UTF-8
in their regular expressions.

That is, the following simple regex of "." (any character) will NEVER match invalid UTF-8 characters:

   printf "\xA0" | LC_ALL=en_CA.utf8 sed 's/./x/'

It will be matched if you force a C/POSIX-locale, in which every single byte is valid:

   printf "\xA0" | LC_ALL=C sed 's/./x/'

[...]
But it can't do a proper subsitution/regex with it, for example:

   echo $'hello\nte\xA0st\nworld' | sed 2s,^t.*,x,

it seems to interpret 0xa0 as the end of the line.

With the above explanation (i.e. "0xA0" is not a valid character in UTF8
locale), it becomes clear why the 'sed' command isn't working as you
expected: It's not that "0xA0" is an "end of line",
it is that "^t.*" only matches "te". The invalid character "0xA0"
causes the regex engine to stop matching.

If you want to treat any byte value as a valid character, you can force
C/POSIX locale:

  $ echo $'hello\nte\xA0st\nworld' | LC_ALL=C sed 2s,^t.*,x,
  hello
  x
  world

But of course then you'd lose the ability to handle multibyte UTF-8
characters as a single character.

---

If you want to discard invalid byte values but keep valid UTF-8
characters, the "iconv(1)" program can help to some extent:

  $ echo $'he\xE2\x98\xBAllo\nte\xA0st\nworld' \
              | iconv -f utf8 -t utf8//IGNORE
  he☺llo
  test
  world
  iconv: illegal input sequence at position 21

In the above example the bytes "E2 98 BA" are the valid UTF8 encoding
of unicode codepoint "U+263A WHITE SMILING FACE"
https://codepoints.net/U+263A
They are kept in the output stream, while the invalid "0xA0" is
discarded.

---

You are using the "echo" command with the $'' to explicitly add
hex values into a string. Note that bash's "echo" command understand
unicode directly (not just UTF8), so using something like this:

   $ echo $'te\u00A0st'
   te st

Allows you to specify unicode codepoints (e.g. "0xA0") instead of UTF-8 encoding, and bash will generate the character in the correct locale
encoding:

  $ echo $'te\u00A0st' | od -tx1c -An
    74  65  c2  a0  73  74  0a
     t   e 302 240   s   t  \n

See: https://www.gnu.org/software/bash/manual/html_node/ANSI_002dC-Quoting.html#ANSI_002dC-Quoting

And lastly,
echo with $'' is not portable (although very convenient when using bash interactively). Using "printf" instead will work similarly, and be more portable:

  $ printf 'te\u00A0st\n'
  te st
  $ printf 'te\xC2\xA0st\n'
  te st


I hope this helps.
I'm closing this as "not a bug", but discussion can continue
by replying to this thread.

regards,
 - assaf








reply via email to

[Prev in Thread] Current Thread [Next in Thread]