bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales


From: Zoltán Herczeg
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Sat, 27 Sep 2014 20:16:45 +0200 (CEST)

Hi,

>Sorry, I assume you meant \x9c here?  Anyway, the point is that 
>conceptually you walk through the input byte sequence left-to-right, 
>converting it to characters as you go, and if you encounter an encoding 
>error in the process you convert the error to the corresponding 
>"character" outside the Unicode range.  You then do all matching against 
>the converted sequence.  So there is no question about interpretation: 
>it's the left-to-right interpretation.  This simple and easy-to-explain 
>approach is used by grep's other matchers, by Emacs, etc.

This was one of my proposal, we need a converter before we run PCRE. To be more 
precise, we likely need several converters, and users can select the 
appropriate for their use case.

>Obviously you don't want to *implement* it the way I described; instead, 
>you want to convert on-the-fly, lazily.  But whatever optimizations you 
>do, you do consistently with the conceptual model.

I would implement exactly as you described. PCRE is a complex backtracking 
engine, we need to decode the input forward and backward directions from any 
starting position, and several characters parsed multiple times depending on 
the pattern and input. We also employ many optimizations to make this as fast 
as possible, especially in JIT. The overhead of decoding invalid characters 
accumulates for every character regardless they are valid or not.

>In practice, the simple approach explained above works well enough to 
>satisfy the vast majority of users.  It's conceivable some special cases 
>in the PCRE world would have trouble fitting into this model, but to be 
>honest I expect this won't be a problem, and that there won't be any 
>serious conceptual issues here, though admittedly there will be some 
>nontrivial programming effort.

The approach might sound simple, but its side effects are non-trivial. For 
example, if we would implement the way suggested before, the guy, who you 
quoted, would not be satisfied. He said 'I still want "." to match a single 
(valid) UTF-8 character.' The dot character matches anything but newline. 
According to you, the invalid code points should have a "minimal" type, so they 
would match.

In the regex world, matching performance is the key aspect of an engine, and 
this is our focus in PCRE. But every advantage has a trade-of. In PCRE, this is 
source code complexity. A "simple" change like this would require a major 
redesign of the engine.

>Again, the proposed change should not slow down libpcre.  It should 
>speed it up.  That's the point.  In the PCRE_NO_UTF8_CHECK case, libpcre 
>could use exactly the same code it has now, so performance would be 
>unaffected.  And in the non-PCRE_NO_UTF8_CHECK case, libpcre should 
>typically be faster than it is now, because it would avoid unnecessary 
>UTF-8 validation for the parts of the input string that it does not examine.

Partial matching was invented exactly for this purpose. You can divide the 
input into small chunks, filter them, and perform matches. Btw, how partial 
matching is affected by your proposed solution? What should happen, if the 
starting offset is inside an otherwise valid UTF character?

>Filtering would not be needed if libpcre were like grep's other matchers 
>and simply worked with arbitrary binary data.

This might be efficient for engines which scans the input only forward 
direction and read every character once. This is not true for PCRE.

Regards,
Zoltan






reply via email to

[Prev in Thread] Current Thread [Next in Thread]