bug#18266: handling bytes not part of the charset, and other garbage (wa

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18266: handling bytes not part of the charset, and other garbage (wa

From:	Vincent Lefevre
Subject:	bug#18266: handling bytes not part of the charset, and other garbage (was: grep -P and invalid exits with error)
Date:	Thu, 11 Sep 2014 13:07:00 +0200
User-agent:	Mutt/1.5.23-6361-vl-r59709 (2014-07-25)

On 2014-09-01 01:31:53 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> >If there are many invalid UTF8 bytes, this would be slow, IMHO
> 
> That's OK.  We don't need grep -P to be fast on invalid input.

I can see a too important slowdown in practical cases.

> >But is the copy of the buffer really needed? Couldn't the invalid
> >UTF8 sequences just be replaced by null bytes?
> 
> I'd rather not, because that changes the semantics of matching.  The null
> byte is valid input data that might get matched.

It appears that the current behavior in UTF-8 is incorrect, even
without -P. For instance:

$ printf 'tr\xe8s\n' > text
$ grep 'tr.s' text
$ LC_ALL=C grep 'tr.s' text
tr<E8>s

There's no reason that '.' matches something that doesn't belong to
the charset in C locale, but doesn't match in a UTF-8 locale.

The pattern tr.s is used here to match the French word "très" in files
that could be encoded in ISO-8859-1 or UTF-8 locales. In the past,
before using UTF-8 locales, I was doing something like:

  grep -E 'tr..?s' text

to match both encodings, and this worked (I could get false positives,
but anyway, one is often not interested in all the real grep matches
in practice, so that even when knowing the encoding, one was already
getting false positives). It's annoying that now in UTF-8, one can no
longer match ISO-8859-1 text, and doing a pre-conversion would take
too much time.

Concerning binary files, I've never wanted to differentiate explicitly
null bytes and invalid UTF-8 sequences: IMHO, this is just garbage.
There are obviously no differences with patterns like 'some_word' or
'foo[0-9]*bar', but when I use a pattern like 'foo.bar' or 'foo.*bar',
I can see two valid reasons to handle these sequences in a similar
way with '.':

1. One may want to match "valid" (often in the sense "printable", in
the specified encoding) but unknown characters.

2. One may also want to match garbage (including null bytes, and also
bytes that do not have any meaning in the charset), with the drawback
that if the garbage contains a newline character, this won't work.

-- 
Vincent Lefèvre <address@hidden> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18266: grep -P and invalid exits with error, (continued)

Prev by Date: bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error
Next by Date: bug#18425: test for new glibc regex bug
Previous by thread: bug#18266: grep -P and invalid exits with error
Next by thread: bug#18266: handling bytes not part of the charset, and other garbage
Index(es):
- Date
- Thread