[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#47264: [PATCH v2] pcre: migrate to pcre2
From: |
Carlo Arenas |
Subject: |
bug#47264: [PATCH v2] pcre: migrate to pcre2 |
Date: |
Sun, 14 Nov 2021 20:44:33 -0800 |
On Sun, Nov 14, 2021 at 7:18 PM Paul Eggert <eggert@cs.ucla.edu> wrote:
> On 11/14/21 14:25, Carlo Arenas wrote:
> > using idx_t instead of size_t should be fine (if only halves the max
> > size of the objects managed), but I am concerned that assuming
> > PCRE2_SIZE_MAX is always equivalent to SIZE_MAX (as done in patch 4)
> > might be risky (at least without a comment), and considering that is
> > part of the API anyway might be better if kept as PCRE2_SIZE_MAX IMHO.
>
> This shouldn't be a problem in practice. Surely PCRE2_SIZE_MAX is for
> forward compatibility to a potential future version of PCRE2 that may
> define PCRE2_SIZE to be some other type. For PCRE2 10.20 and earlier
> PCRE2_SIZE is hardwired to size_t, so there is only one plausible
> default for PCRE2_SIZE_MAX, namely SIZE_MAX.
which is why I mention that it will be better to at least document
that in a comment, as it was done everywhere else where assumptions
made in the pcre library were used.
Interestingly enough this discussion gave me an idea for a feature in
PCRE where that value will be set to something else than SIZE_MAX and
that might break grep in a future release if it lands.
> > As I mentioned before, PCRE matches the Perl definition as mentioned
> > before in an early draft that also had this change reversed.
>
> I see that PCRE2 documents that PCRE2_EXTRA_MATCH_WORD surrounds the
> pattern with "\b(?:" and ")\b". However, this is bogus: it doesn't
> correspond to the intuitive meaning of "match words", and it doesn't
> correspond to how grep -w behaves for any grep that I know of.
It all comes from what perl defines[1] as a word character (\w), and
that I presume came from the fact it was used earlier for text
processing and most of that text was computer code (which currently
can include unicode).
> Which "early draft" are you talking about? This appears to be merely a
> bug in libpcre2's documentation and implementation.
https://lists.gnu.org/archive/html/grep-devel/2021-10/msg00000.html
> > I would suggest instead that -P should also follow perl convention
> > instead when used together with -w, but maybe that is something that a
> > -P feature flag could enable or disable as needed?
>
> I can't imagine anybody intuitively saying in an English locale that
> "%%" is a word in the string "aa%%aa". PCRE2 is broken, that's all. If a
> user really wants PCRE2's buggy interpretation, they can simply surround
> their regexp with "\b(?:" and ")\b" and not use -w; so there's no need
> to have a different flag for pcre2grep's bizarre interpretation of -w.
>
> Here's another reason why pcre2grep -w is obviously busted:
>
> $ pcre2grep -w ',' <<'EOF'
> > a,a
> > a, a
> > a,
> > EOF
> a,a
>
> Why is "," a word in the first input line, but not in the second or
> third? pcre2grep is simply wrong here.
that is indeed likely a "bug", but is one that PCRE shares with perl
(and at least JavaScript, Java, Net, Python and Ruby) :
$ echo 'a,a' | perl -nle '/\b(,)\b/ and print "$1"'
,
but it is also because the feature is not being used correctly as ','
is not a word and therefore logically none of them should match
> > Note that "word" definition also has a different meaning in a post
> > Unicode world
>
> Yes, but that's an independent issue.
for the '%' example was not, as it was the fact that it has a Unicode
property indicating is a character used for punctuation as the reason
why it was not matched as expected by grep.
Carlo
[1] https://perldoc.perl.org/perlrebackslash#%5Cb%7B%7D,-%5Cb,-%5CB%7B%7D,-%5CB
- bug#47264: [PATCH v2] pcre: migrate to pcre2, Carlo Marcelo Arenas Belón, 2021/11/09
- bug#47264: [PATCH v2] pcre: migrate to pcre2, Paul Eggert, 2021/11/14
- bug#47264: [PATCH v2] pcre: migrate to pcre2, Jeffrey Walton, 2021/11/14
- bug#47264: [PATCH v2] pcre: migrate to pcre2, Paul Eggert, 2021/11/14
- bug#47264: [PATCH v2] pcre: migrate to pcre2,
Carlo Arenas <=
- bug#47264: [PATCH v2] pcre: migrate to pcre2, Paul Eggert, 2021/11/15
- bug#47264: [PATCH v2] pcre: migrate to pcre2, Carlo Marcelo Arenas Belón, 2021/11/15
- bug#47264: [PATCH v2] pcre: migrate to pcre2, Paul Eggert, 2021/11/15
- bug#47264: [PATCH v2] pcre: migrate to pcre2, Carlo Marcelo Arenas Belón, 2021/11/15