bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

From:	Paul Eggert
Subject:	bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date:	Thu, 11 Sep 2014 19:53:23 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1

Vincent Lefevre wrote:

Things could be done in grep:

1. Ignore -P when the pattern would have the same meaning without -P
    (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b",
    at least for the simplest cases).

2. Call PCRE in the C locale when this is equivalent.

I had already considered these ideas along with several others, but theywould require grep to parse and analyze the Perl regular expression. Idon't know the PCRE syntax and it would take some time to write aparser. And even if I wrote one, the next PCRE release would likelychange the syntax. It sounds very painful to maintain.

3. Transform invalid bytes to null bytes in-place before the PCRE
    call. This changes the current semantic, but:
    * the semantic on invalid bytes has never been specified, AFAIK;
    * the best *practical* behavior may not be the current one

As we've already discussed, this would be incompatible with how invalidbytes are treated by other matchers. And would have undesirablepractical effects, e.g., the pattern 'a..*b' would match data that wouldlook like "ab" on many screens (because the null byte would vanish).It's a real kludge that will bite users.

Even if we went along with the kludge, grep does not know what bytesPCRE considers to be invalid without invoking PCRE, which is what it'sdoing now. (Yes, PCRE says it's parsing UTF-8, but there are differentways to do that and they don't all agree.) I suppose grep couldreengineer libpcre's internals, to exactly duplicate the algorithm thatlibpcre uses to decide when bytes are invalid (except to do it 10Xfaster :-), but then that'd be another thing to maintain in parallelwith libpcre.

All of these changes sound like a lot of work, which nobody is willingto do.

Here's a different idea. How about invoking grep with the--binary-files=without-match option? This should avoid much of thelibpcre performance problem, without having to change 'grep'.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/09/11
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert <=
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/09/12
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/12
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Vincent Lefevre, 2014/09/12
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/12
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/16
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Jim Meyering, 2014/09/17
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/17
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Norihiro Tanaka, 2014/09/17
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Eric Blake, 2014/09/17
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/17

Prev by Date: bug#18266: Bug#758105: bug#18266: grep -P and invalid exits with error
Next by Date: bug#18266: handling bytes not part of the charset, and other garbage
Previous by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Next by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Index(es):
- Date
- Thread