[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#20526: BUG: text file is detected as binary

From: Ángel González
Subject: bug#20526: BUG: text file is detected as binary
Date: Thu, 21 May 2015 02:27:43 +0200

Paul Eggert wrote:
> Perhaps we can improve the behavior of grep by changing its heuristic 
> slightly. 
>   Currently grep reports "Binary file FOO matches" if it finds binary 
> data in FOO before it finds the first match.  Instead, perhaps we 
> could change grep to report "Binary file FOO matches" when it sees 
> that it's about to generate binary *output* copied from FOO, 
> regardless of whether this output represents the first match.  That 
> is, when grep sees that it's about to output binary 
> data, grep instead outputs "Binary file FOO matches" and then stops 
> output for FOO (even if it already output some lines for ordinary 
> matches in FOO).

Another option would be to escape the problematic binary data (but how
to escape the escape char?) or maybe even replace it with U+FFFD if our
output is utf-8 (this has its own sort of problems when trying to
determine what was really matched, though).

> This approach would fix the problem of grep trashing the output 
> stream, and it should be less drastic than grep's current approach, 
> in that it would make grep more likely to do what Kamil Dudka is 
> asking for (assuming grep is given mostly valid input interspersed 
> with small amounts of binary data).


When grep is the las component of a pipeline, it isn't too bad. The
danger comes from grep being part of a pipeline instead. 
Sebastian Makefile is one of such cases. Another silly example: we
might have a list of people and be interested in knowning how many of
them begin with J (but excluding pseudonyms):

 printf 'John Smith\nJohannes Meixner\nPaul Eggert\nJohn Doe\n' > 
 grep ^J defendants-2015-05-* | sort -u | grep -vc "John Doe"

works perfectly, until the day someone provides an incorrectly entry. 
 printf 'Pedro P\xe9rez\n' >> defendants-2015-05-15
and havoc ensues.

It's something that should never happen, but someone else prepared the
file for you, or it comes from a third party (and sometimes it only
makes sense for them to be ANSI, yet one day there are unencoded high

reply via email to

[Prev in Thread] Current Thread [Next in Thread]