[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#23234: unexpected results with charset handling in GNU grep 2.23

From: Eric Blake
Subject: bug#23234: unexpected results with charset handling in GNU grep 2.23
Date: Wed, 6 Apr 2016 17:15:25 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.7.1

On 04/06/2016 05:04 PM, Bjoern Jacke wrote:
> On 07.04.2016 00:33, Eric Blake wrote:
>> That behavior complies with POSIX requirements.
> can you give a quote here? One thing which is not POSIX compliant is
> that the diagnostic messages is given back on stdout.
> http://pubs.opengroup.org/onlinepubs/9699919799/ says:
> --snip--
>     Determine the locale that should be used to affect the format and
> contents of diagnostic messages written to standard error.
> --snap--



    The standard input shall be used if no file operands are specified,
and shall be used if a file operand is '-' and the implementation treats
the '-' as meaning standard input. Otherwise, the standard input shall
not be used. See the INPUT FILES section.


    The input files shall be text files.

As soon as you supply grep with non-text-file input, POSIX no longer
applies, and we can do WHATEVER WE WANT.  The violation is not in grep's
behavior, but in yours for passing a binary file.

We have chosen that WHATEVER WE WANT means that by default, we will tell
you (on stdout) that the binary file matches, but if you use the
(non-standard extension) -a option, we will pretend the file is text
anyways.  And it's been documented that way for basically "forever" in
GNU grep.

What's changed recently is what we've done under the hood (more
efficient recognition of binary files, treating '\0' and '\n'
identically as line terminators when -a is not in effect because of the
speed improvements it lets us gain, and attempts with heuristics to
avoid spamming terminals or downstream clients with encoding errors when
-a is not in effect).  But all of those still fall under the broad
category of WHATEVER WE WANT as it falls outside the POSIX standard.

And yes, maybe we could change grep to print the "Binary file matches"
message to stderr, but that in turn will probably break other scripts,
and lead to even more complaints from people doing non-standard things
and expecting consistent results.  That said, patches are still welcome,
if you think you have better heuristics than what we currently have, and
as long as it still falls within the realm of WHATEVER WE WANT.

> if you consider grepping text files with mixed encodings as invalid use
> of grep, then you should not return 0 and/or output the "Binary file
> (standard input) matches" on stdout. This makes the output of GNU grep
> look like a valid match.

Maybe changing the exit status when a binary file is encountered is
worth doing - but not returning status 0 when a match is detected is
more likely to do harm than good.

> You say "grep -a" is your friend to all the users, who want to grep log
> files (cause they tend to conain mixed encodinds). Sure, -a is a
> workaround to make GNU grep work as before again. Realisically 99.99 of
> the users will not know that though, because this is the first grep
> version ever I guess, that requires this. Also -a is a GNU option only,
> so portable scripts will not be able to use that.

Portable scripts are not able to grep binary files, period.  As long as
you don't mind non-portable extensions, 'grep -a' is what you want.

> I guess you are aware, that you will break a lot of existing scripts
> with that change of treating mixed encoding input files as binary like
> the way you do it now with GNU grep >= 2.23 ?

Yes, we are aware that lots of users are getting an education on the
subtleties of POSIX.  But that doesn't mean it is a bug.

Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]