[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#20678: new bug that Paul "asked" for... grep -P aborts on non-utf8 i

From: L. A. Walsh
Subject: bug#20678: new bug that Paul "asked" for... grep -P aborts on non-utf8 input.
Date: Wed, 27 May 2015 14:41:12 -0700
User-agent: Thunderbird

(skip to end if you don't care to read how I found this

Paul Eggert wrote:
Linda Walsh wrote:

I had one file that it bailed on
saying it has an invalid UTF-8 encoding -- but the line was
recursive starting from '.' -- and it didn't name the file

That's pretty vague. Can you reproduce that problem? I don't observe it:
I'm not quite *sure* how to tell someone else to reproduce this, but
I can pretty reliably now some output from a checker....:
*** file = libvtkUtilitiesPythonInitializer-pv4.2.so.1
grep: invalid UTF-8 byte sequence in input
*** file = libvtkPVClientServerCoreCore-pv4.2.so.1
grep: invalid UTF-8 byte sequence in input
*** file = libsystemd.so.0
grep: invalid UTF-8 byte sequence in input
*** file = libvtkParallelCore-pv4.2.so.1
grep: invalid UTF-8 byte sequence in input

Now before you think I'm too daft, the code that produces those
messages is in perl and is:

for my $k (@sorted_missing) {
   P "*** file = %s", $k;
   open(my $gh, "grep -rP  '/$k'  /home/rpms/13.2|");
   while (<$gh>) {
   P "-----";

Those files are files that came up "missing" as pre-reqs.
in /home/rpms/...., I have the *file listings* of each of
the rpms, created in the same structure as in the distro, so
a file under that dir /home/rpms/13.2.. This is why I had
a problem finding it:
Ishtar:rpms/13.2/repo/oss/suse> file -bi x86_64/*>/tmp/x86files.txt
Ishtar:rpms/13.2/repo/oss/suse> sort </tmp/x86files.txt |uniq -c
     2 text/plain; charset=iso-8859-1
 13269 text/plain; charset=us-ascii
     2 text/plain; charset=utf-8
--- I'd say it's likely 1-2 files out of 13274 files that could
have the problem.  Yeah, I run into alot of needles in haystacks..
but trying to find the needle... just generating the file of types:
time file -i x86_64/*>/tmp/fullx86files.txt
27.71sec 27.07usr 0.63sys (99.99% cpu)

Then grep helps!

Ishtar:rpms/13.2/repo/oss/suse> grep iso-88 /tmp/fullx86files.txt
x86_64/aspell-is-0.51.10-46.1.2.x86_64.rpm:text/plain; charset=iso-8859-1
x86_64/aspell-nb-0.50.10-46.1.2.x86_64.rpm:text/plain; charset=iso-8859-1
Ishtar:rpms/13.2/repo/oss/suse> more x86_64/aspell-is-0.51.10-46.1.2.x86_64.rpm /usr/lib64/aspell-0.60/icelandic.alias
/usr/lib64/aspell-0.60/355slenska.alias <<-- the 355 was in inverse color
Same w/the other file (had this 1 'violation':

/usr/lib64/aspell-0.60/bokm345l.alias <-3

So those are 'octal' code points (using a little calc prog):
pcalc V0.1.8: Type 'constants' to see constants
(1)> 0355
= 237 (0x00ed) "í"
(2)> 0345
  = 229  (0x00e5)  "å"
So the 1st part of the bug is the message w/no filename.

the 2nd part of the bug is this: (looking for '^nobody' in
"/etc/passwd" works, as shown in 1st example:

 grep -P '^nobody' /etc/passwd
nobody:x:65534:65533:(group Nobody):/var/lib/nobody:/bin/nologin

but the 'error' message aborts any further file searches:
grep -P '^nobody' x86_64/aspell-is-0.51.10-46.1.2.x86_64.rpm /etc/passwd
grep: invalid UTF-8 byte sequence in input


This is why I objected to '\000' being treated as a binary
file (and why I think it's bad grep can't look for that):
If one works with windows, it's far more likely
just to be in UTF-16 encoding.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]