bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b

From:	Paul Eggert
Subject:	bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P
Date:	Mon, 9 Jan 2023 10:40:16 -0800
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0

On 1/9/23 03:35, Ævar Arnfjörð Bjarmason wrote:

You almost never want "everything Unicode considers a digit", and if you
do using e.g. \p{Nd} instead of \d would be better in terms of
expressing your intent.

For GNU grep, PCRE2_UCP is needed because of examples like what Gro-Tsenand Karl Petterssen supplied. If there's some diagreement about how \dshould behave with UTF-8 data the GNU grep hackers should let the Perlcommunity decide that; that is, GNU grep can simply follow PCRE2's lead.But GNU grep does need PCRE2_UCP for \b etc.

        $ diff <(git -P grep -P '\d+') <(git -P grep -P '(*UCP)\d')
        53360a53361,53362
        > git-gui/po/ja.po:"- 第１行: 何をしたか、を１行で要約。\n"
        > git-gui/po/ja.po:"- 第２行: 空白\n"

Although I don't speak Japanese I have dealt with quite a bit ofJapanese text in a previous job, and personally I would prefer \d tomatch those two lines as they do contain digits. So to me thisparticular case is not a good argument that git grep should not matchthose lines.

Of course other people might prefer differently, and there are caseswhere I want to match only ASCII digits. I've learned in the past to use[0-9] for that. I hope PCRE2 never changes [0-9] to match anything butASCII digits when searching UTF-8 text.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P, Ævar Arnfjörð Bjarmason, 2023/01/09
- bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P, Paul Eggert <=
  - bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P, Ævar Arnfjörð Bjarmason, 2023/01/09
    - bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P, Paul Eggert, 2023/01/09

Prev by Date: bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P
Next by Date: bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P
Previous by thread: bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P
Next by thread: bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P
Index(es):
- Date
- Thread