bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales


From: Paul Eggert
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Sat, 27 Sep 2014 13:54:24 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2

Zoltán Herczeg wrote:
He said 'I still want "." to match a single (valid) UTF-8 character.'

That's what the GNU matchers do, yes. '.' does not match an invalid byte. It's a reasonable default. If you have some users who want '.' to match an invalid byte, you can add a flag for them, just as there's a PCRE_DOTALL flag for users who want '.' to match newline. That being said, I doubt whether users will care enough to need such a flag. (After all, they're evidently not caring *now*, as libpcre can't search such data at *all*.)

In the regex world, matching performance is the key aspect of an engine

Absolutely. That's why we're having this discussion: libpcre is slow when matching binary data.

A "simple" change like this would require a major redesign of the engine.

It'd be nontrivial, yes.  But it's clearly doable.  (Not that I'm 
volunteering....)

What should happen, if the starting offset is inside an otherwise valid UTF 
character?

The same thing that would happen if an input file started with the tail end of a UTF-8 sequence. The leading bytes are invalid. 'grep' deals with this already; it's not a problem.

Filtering would not be needed if libpcre were like grep's other matchers
and simply worked with arbitrary binary data.

This might be efficient for engines which scans the input only forward direction
> and read every character once.

It can also be efficient for matchers, like grep's, that don't necessarily do that. It just takes more implementation work, that's all. It's not rocket science to go backwards through a UTF-8 string and to catch decoding errors as you go.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]