bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales


From: Paul Eggert
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Thu, 11 Sep 2014 19:53:23 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

Vincent Lefevre wrote:
Things could be done in grep:

1. Ignore -P when the pattern would have the same meaning without -P
    (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b",
    at least for the simplest cases).

2. Call PCRE in the C locale when this is equivalent.

I had already considered these ideas along with several others, but they would require grep to parse and analyze the Perl regular expression. I don't know the PCRE syntax and it would take some time to write a parser. And even if I wrote one, the next PCRE release would likely change the syntax. It sounds very painful to maintain.

3. Transform invalid bytes to null bytes in-place before the PCRE
    call. This changes the current semantic, but:
    * the semantic on invalid bytes has never been specified, AFAIK;
    * the best *practical* behavior may not be the current one

As we've already discussed, this would be incompatible with how invalid bytes are treated by other matchers. And would have undesirable practical effects, e.g., the pattern 'a..*b' would match data that would look like "ab" on many screens (because the null byte would vanish). It's a real kludge that will bite users.

Even if we went along with the kludge, grep does not know what bytes PCRE considers to be invalid without invoking PCRE, which is what it's doing now. (Yes, PCRE says it's parsing UTF-8, but there are different ways to do that and they don't all agree.) I suppose grep could reengineer libpcre's internals, to exactly duplicate the algorithm that libpcre uses to decide when bytes are invalid (except to do it 10X faster :-), but then that'd be another thing to maintain in parallel with libpcre.

All of these changes sound like a lot of work, which nobody is willing to do.

Here's a different idea. How about invoking grep with the --binary-files=without-match option? This should avoid much of the libpcre performance problem, without having to change 'grep'.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]