bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

From:	Paul Eggert
Subject:	bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date:	Sat, 27 Sep 2014 13:54:24 -0700
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2

Zoltán Herczeg wrote:

He said 'I still want "." to match a single (valid) UTF-8 character.'

That's what the GNU matchers do, yes. '.' does not match an invalid byte. It'sa reasonable default. If you have some users who want '.' to match an invalidbyte, you can add a flag for them, just as there's a PCRE_DOTALL flag for userswho want '.' to match newline. That being said, I doubt whether users will careenough to need such a flag. (After all, they're evidently not caring *now*, aslibpcre can't search such data at *all*.)

In the regex world, matching performance is the key aspect of an engine

Absolutely. That's why we're having this discussion: libpcre is slow whenmatching binary data.

A "simple" change like this would require a major redesign of the engine.


It'd be nontrivial, yes.  But it's clearly doable.  (Not that I'm 
volunteering....)

What should happen, if the starting offset is inside an otherwise valid UTF 
character?

The same thing that would happen if an input file started with the tail end of aUTF-8 sequence. The leading bytes are invalid. 'grep' deals with this already;it's not a problem.

Filtering would not be needed if libpcre were like grep's other matchers
and simply worked with arbitrary binary data.


This might be efficient for engines which scans the input only forward direction

> and read every character once.

It can also be efficient for matchers, like grep's, that don't necessarily dothat. It just takes more implementation work, that's all. It's not rocketscience to go backwards through a UTF-8 string and to catch decoding errors asyou go.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, (continued)
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/22
  - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/25
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/26
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/26
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/26
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/26
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/27
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert <=
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/28
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/28
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/30
    - bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/30

Prev by Date: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Next by Date: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Previous by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Next by thread: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Index(es):
- Date
- Thread