bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#47264: [PATCH] pcre: migrate to pcre2


From: Paul Eggert
Subject: bug#47264: [PATCH] pcre: migrate to pcre2
Date: Sun, 7 Nov 2021 16:30:30 -0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.1

On 11/7/21 11:26, Carlo Marcelo Arenas Belón wrote:
Mostly a bug by bug translation of the original code to the PCRE2 API.
but includes a couple of fixes as well that might be worth doing in
independent patches, if a straight translation is preferable.

Yes, that might be preferable so that if there are problems we can bisect better.


The API changes the sign of some types and therefore some ugly casts
were needed,

I see one ugly cast from char const * to PCRE2_SPTR8; it seems unavoidable unless we go through void * (and I see little point to doing that). Were there some other ugly casts that I missed?

some of the changes are just to make sure all variables
fit into the newer types better.

Yes, we do want to avoid overflow bugs.

Includes backward compatibility and could be made to build all the way
to 10.00, but assumes a recent version ~10.30; the configure rule sets
a strict minimum of 10.34 as that is required to pass all tests, even
if the issues are minimal and likely to be real bugs that the old PCRE
just hide, there is likely more work pending in this area.

pcre2 10.34 is two years old. Seems old enough to me (tho perhaps others can chime in).

-          int search_offset, int options, int *sub)
+          int search_offset, uint32_t options)

Let's stick with 'int' unless we absolutely need uint32_t. uint32_t is ugly and is not required by the C standard and although PCRE2 evidently requires it, I'd rather stick to signed types for the usual reasons.

+          if (lim > INT32_MAX)
+            return e;
+          lim *= 2;

This should use UINT32_MAX / 2 instead of INT32_MAX (the two expressions have different values on some oddball platforms). Better yet, replace the whole thing with "if (INT_MULTIPLY_WRAPV (lim, 2, &lim)) return e;".

+  unsigned char *re = xnmalloc (4, size + (fix_len_max + 4 - 1) / 4);

We don't need to multiply by 4 any more, right? Because we no longer need to do the trick of replacing NUL with \000 in the pattern.

+      flags |= PCRE2_UTF;
+#ifdef PCRE2_NEVER_BACKSLASH_C
+      /* do not match individual code units but only UTF-8  */
+      flags |= PCRE2_NEVER_BACKSLASH_C;
+#endif
+#ifdef PCRE2_MATCH_INVALID_UTF
+      /* consider invalid UTF-8 as a barrier, instead of error  */
+      flags |= PCRE2_MATCH_INVALID_UTF;
+#endif

Which versions of PCRE2 lack these flags? Should we bother to support these old versions?

If memory serves grep currently takes care to not pass invalid UTF-8 in the buffer or pattern. Does PCRE2_MATCH_INVALID_UTF make this work obsolete?

+  PCRE2_UCHAR8 ep[128];

How do we know 128 is long enough? Deserves a comment.

Performance seems equivalent, and it also seems functionally complete.

Very good news. Thanks for working on this.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]