[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#36094: Possible sed bug

From: Assaf Gordon
Subject: bug#36094: Possible sed bug
Date: Wed, 5 Jun 2019 08:17:32 -0600
User-agent: Mutt/1.11.4 (2019-03-13)

tag 36094 notabug
close 36094


On Wed, Jun 05, 2019 at 10:38:53AM +1000, Roel Van de Paar wrote:
> $ cat test
> a�-�-
> $ sed -i "s|.*|allgone|gi" test && cat test
> allgone�allgone�allgone
> Expected output in both cases would seem to be "allgone" on the line and
> nothing else?

This is not a bug, but a side-effect of having invalid UTF8 characters in
the input file, while working with a UTF8 locale.

POSIX requires that '.*' regular expression does not match invalid
The 'test' input file contains two bytes of 255 (\xFF) - these are
invalid (under UTF8 locale), and the regex matching stops at these bytes.
The other characters in the file are matched as three separate patterns
(due to "g" flag).

The simplest solution when working with such files is to force C locale,
where all bytes are considered valid (but then you loose UTF8
capabilities). Compare:

    $ LC_ALL=en_CA.utf8 sed "s|.*|allgone|g" test | od -An -c
       a   l   l   g   o   n   e 255   a   l   l   g   o   n   e 255
       a   l   l   g   o   n   e  \n

    $ LC_ALL=C sed "s|.*|allgone|g" test | od -An -c
       a   l   l   g   o   n   e  \n

But then multi-byte UTF8 characters are processed as individual bytes:

    $ printf "\U1011\n" | LC_ALL=en_CA.utf8 sed 's/./A/g'

    $ printf "\U1011\n" | LC_ALL=C sed 's/./A/g'

As a side-note,
This is the reason GNU sed has the non-standad 'z' command
to clear the pattern space - a more intuitive 's/.*//' command will fail
to clear a pattern containing invalid characters.

    $ printf "FOO \xFF\n" | LC_ALL=en_CA.utf8 sed 's/.*//g' | od -An -tx1
     ff 0a
    $ printf "FOO \xFF\n" | LC_ALL=en_CA.utf8 sed 'z' | od -An -tx1

I'm closing this as "not a bug", but discussion can continue by replying
to this thread.

 - assaf

reply via email to

[Prev in Thread] Current Thread [Next in Thread]