bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#47264: [PATCH v2] pcre: migrate to pcre2


From: Paul Eggert
Subject: bug#47264: [PATCH v2] pcre: migrate to pcre2
Date: Mon, 15 Nov 2021 08:17:02 -0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.1

On 11/14/21 20:44, Carlo Arenas wrote:

This shouldn't be a problem in practice. Surely PCRE2_SIZE_MAX is for
forward compatibility to a potential future version of PCRE2 that may
define PCRE2_SIZE to be some other type. For PCRE2 10.20 and earlier
PCRE2_SIZE is hardwired to size_t, so there is only one plausible
default for PCRE2_SIZE_MAX, namely SIZE_MAX.

which is why I mention that it will be better to at least document
that in a comment, as it was done everywhere else where assumptions
made in the pcre library were used.

What sort of documentation did you have in mind, exactly?

Interestingly enough this discussion gave me an idea for a feature in
PCRE where that value will be set to something else than SIZE_MAX and
that might break grep in a future release if it lands.

How would it break grep? I'm not following. If a future version of PCRE defines PCRE_SIZE_MAX to something other than SIZE_MAX, grep should work just fine because it will use what PCRE defines.

As I mentioned before, PCRE matches the Perl definition as mentioned
before in an early draft that also had this change reversed.

I see that PCRE2 documents that PCRE2_EXTRA_MATCH_WORD surrounds the
pattern with "\b(?:" and ")\b". However, this is bogus: it doesn't
correspond to the intuitive meaning of "match words", and it doesn't
correspond to how grep -w behaves for any grep that I know of.

It all comes from what perl defines[1] as a word character (\w)

No it doesn't. It comes merely from how PCRE2 documents and implements PCRE2_EXTRA_MATCH_WORD.

Perl's definition of \w does not determine how PCRE2_EXTRA_MATCH_WORD behaves; it determines only which characters are word characters and which are not. As things stand, PCRE2_EXTRA_MATCH_WORD is bizarre because it causes 'pcre2grep -w' to match strings consisting entirely of non-word (i.e., non-\w) characters. This cannot be right.


that is indeed likely a "bug", but is one that PCRE shares with perl
(and at least JavaScript, Java, Net, Python and Ruby) :

   $ echo 'a,a' | perl -nle '/\b(,)\b/ and print "$1"'

That is a different issue. \b matches word *boundaries*; it's different from -w which is supposed to match *words*. There is indeed a word boundary between "a" (a \w character) and "," (a non-\w character), and another word boundary between "," and the following "a", but this doesn't mean "," is a word.

Attempting to implement -w with \b is a mistake. That mistake is made in PCRE2 and the mistake should be corrected. PCRE2 should implement PCRE2_EXTRA_MATCH_WORD the same way that grep -P implements -w.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]