[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
From: |
Zoltán Herczeg |
Subject: |
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales |
Date: |
Sat, 27 Sep 2014 20:16:45 +0200 (CEST) |
Hi,
>Sorry, I assume you meant \x9c here? Anyway, the point is that
>conceptually you walk through the input byte sequence left-to-right,
>converting it to characters as you go, and if you encounter an encoding
>error in the process you convert the error to the corresponding
>"character" outside the Unicode range. You then do all matching against
>the converted sequence. So there is no question about interpretation:
>it's the left-to-right interpretation. This simple and easy-to-explain
>approach is used by grep's other matchers, by Emacs, etc.
This was one of my proposal, we need a converter before we run PCRE. To be more
precise, we likely need several converters, and users can select the
appropriate for their use case.
>Obviously you don't want to *implement* it the way I described; instead,
>you want to convert on-the-fly, lazily. But whatever optimizations you
>do, you do consistently with the conceptual model.
I would implement exactly as you described. PCRE is a complex backtracking
engine, we need to decode the input forward and backward directions from any
starting position, and several characters parsed multiple times depending on
the pattern and input. We also employ many optimizations to make this as fast
as possible, especially in JIT. The overhead of decoding invalid characters
accumulates for every character regardless they are valid or not.
>In practice, the simple approach explained above works well enough to
>satisfy the vast majority of users. It's conceivable some special cases
>in the PCRE world would have trouble fitting into this model, but to be
>honest I expect this won't be a problem, and that there won't be any
>serious conceptual issues here, though admittedly there will be some
>nontrivial programming effort.
The approach might sound simple, but its side effects are non-trivial. For
example, if we would implement the way suggested before, the guy, who you
quoted, would not be satisfied. He said 'I still want "." to match a single
(valid) UTF-8 character.' The dot character matches anything but newline.
According to you, the invalid code points should have a "minimal" type, so they
would match.
In the regex world, matching performance is the key aspect of an engine, and
this is our focus in PCRE. But every advantage has a trade-of. In PCRE, this is
source code complexity. A "simple" change like this would require a major
redesign of the engine.
>Again, the proposed change should not slow down libpcre. It should
>speed it up. That's the point. In the PCRE_NO_UTF8_CHECK case, libpcre
>could use exactly the same code it has now, so performance would be
>unaffected. And in the non-PCRE_NO_UTF8_CHECK case, libpcre should
>typically be faster than it is now, because it would avoid unnecessary
>UTF-8 validation for the parts of the input string that it does not examine.
Partial matching was invented exactly for this purpose. You can divide the
input into small chunks, filter them, and perform matches. Btw, how partial
matching is affected by your proposed solution? What should happen, if the
starting offset is inside an otherwise valid UTF character?
>Filtering would not be needed if libpcre were like grep's other matchers
>and simply worked with arbitrary binary data.
This might be efficient for engines which scans the input only forward
direction and read every character once. This is not true for PCRE.
Regards,
Zoltan
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, (continued)
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/22
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/25
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/26
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/26
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/26
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/26
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales,
Zoltán Herczeg <=
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/27
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/28
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/28
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Zoltán Herczeg, 2014/09/30
- bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales, Paul Eggert, 2014/09/30