bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales


From: Zoltán Herczeg
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Tue, 30 Sep 2014 20:10:58 +0200 (CEST)

Hi,

>It's purely a performance question.  GNU grep already uses libpcre to search 
>binary data, and it works now.  It's just slow, that all.  I'm willing to live 
>with this, and tell users "Sorry, but libpcre is not designed to search binary 
>data quickly; if you want speed then don't use grep's -P option."  If you're 
>willing to live with this too, we're done.

Yes, PCRE is not designed for matching binary data as UTF. Too much complexity 
for too little gain. Normal search can be used on binary data without 
limitations.

>Grep already does that sort of thing.  And it's smart enough to start matching 
>only at character boundaries.  It's not libpcre's job to worry about this; the 
>caller can worry about it.

Thank you for bringing this up. I don't see any point of reimplementing what is 
already there. However, if PCRE says it supports UTF matching in binary data, 
it should. Because the "what is there" depends on the environment. This clearly 
the best answer why the environment is responsible for handling the binary part 
of the data. Most environment needs some kind of validating, and we would just 
duplicate code. It is good to hear that everything is in grep, perhaps a few 
more lines are needed to do it in a thread.

>The code you posted could be made faster than that; among other things there 
>should not be an unbounded backward scan.  And even the code you posted would 
>often be faster than what's in libpcre now.  That early UTF-8 validity prepass 
>is a killer.

I would recommend to disable it. It's only purpose is returning early for 
invalid buffers. I am sure grep already knows that a buffer is invalid, since 
it scans the buffer.

Regards,
Zoltan






reply via email to

[Prev in Thread] Current Thread [Next in Thread]