bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#22028: grep -Pc / grep -P | wc -l inconsistent results


From: Norihiro Tanaka
Subject: bug#22028: grep -Pc / grep -P | wc -l inconsistent results
Date: Sat, 28 Nov 2015 15:16:30 +0900

On Fri, 27 Nov 2015 06:29:31 -0500 (EST)
Jaroslav Skarvada <address@hidden> wrote:

> Hi,
> 
> it seems for long files which starts with non binary data and if PCRE matcher
> is used, grep works in TEXTBIN_UNKNOWN mode until it finds binary data, then 
> it
> switches to TEXTBIN_BINARY. But in -Pc mode in TEXTBIN_BINARY it exits
> on next match causing bogus -Pc results.
> 
> Reproducer:
> $ grep -P -c 'Blocked by (SpamAssassin|Spamfilter)' ./filtered.txt
> 1
> $ grep -P 'Blocked by (SpamAssassin|Spamfilter)' ./filtered.txt | wc -l
> 2
> 
> The ./filtered.txt is long enough text file, that contains some NULLs after 
> the
> first 32kB text, e.g. https://bugzilla.redhat.com/attachment.cgi?id=1080646
> 
> Original downstream bugzilla:
> https://bugzilla.redhat.com/attachment.cgi?id=1080646
> 
> Attached is my attempt to fix it, but it may be not the right way
> how to fix it. Especially the question is whether it should stop when
> it finds binary data or not. But at least the grep -Pc / grep -P | wc -l
> should behave the same
> 
> thanks & regards
> 
> Jaroslav

I see that filter.txt is binary file, as NULs are included at line 647.
However, first 32768 bytes are correctly enocoded.

If first 32768 bytes of a file are correct encoding, grep -P marks with
not TEXTBIN_TEXT but TEXTBIN_UNKNOWN, and if grep found first match,
marks with TEXTBIN_TEXT.  However, grep -P -c does not do last behavior.


grep -P treats as TEXTBIN_UNKNOWN, and if grep found first match, treats
as text file.  However, grep -P -c does not do it.

So you can get number of matched lines with grep -a -P -c.

Thanks,
Norihiro

Attachment: 0001-grep-P-grep-Pc-consistent-results.patch
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]