grep-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Locale aware range expressions?


From: Ronan Pigott
Subject: Re: Locale aware range expressions?
Date: Mon, 29 Jan 2024 05:00:37 +0000

January 28, 2024 at 7:29 PM, "Paul Eggert" <eggert@cs.ucla.edu> wrote:

> Not exactly. 'sort' sorts strings using an algorithm that is more
> complicated than simply comparing characters according to the collation
> sequence, because it uses weights. This is true even for single-character
> strings. This means that in general, you cannot use 'sort' to deduce a
> locale's collation sequence.

Can you expand on how this applies to grep though? By my reading, it sounds
like the collation sequence referred to by that document, which defines the
sort order for sort, strcoll, strxfrm etc., is the same one referred to by the
grep(1) manual. I'm pretty sure that the case insensitive ordering correctly
reflects the collation sequence in this case, and is not some quirk of sort:

  $ python -q
  >>> import string, locale
  >>> locale.setlocale(locale.LC_ALL, '')
  'en_US.UTF-8'
  >>> sorted(string.ascii_letters, key=locale.strxfrm)[:8]
  ['a', 'A', 'b', 'B', 'c', 'C', 'd', 'D']

The above implies to me that A, B, and C should be matched by '[a-d]' given
the description in grep(1).

Also, considering the example from the other response, Arch compiles grep with
simply './config --prefix=/usr && make' [1], and my range expression does
match these non-ascii latin characters (and not their uppercase
counterparts):

  $ printf '%s\n' d ḑ D Ḑ e E é É f ḟ F Ḟ | grep '[d-f]'
  d
  ḑ
  e
  é
  f

Futzing around, I find this other interesting case:

  $ printf '%s\n' $'NFKC: \u00e9' $'NFKD: e\u0301' | grep '[a-e]'
  NFKD: é
  $ printf '%s\n' $'NFKC: \u00e9' $'NFKD: e\u0301' | grep '[a-f]'
  NFKC: é
  NFKD: é

I still don't think we have an answer to my first question, then. Why is it
that the uppercase letters are not matched? Maybe more importantly, how can I
characterize the full set of characters which are matched by '[a-d]'? The grep
manual says it's all the characters that sort between 'a' and 'd' "using the
locale's collating sequence and character set". AFAICT it is not.

For the record, I prefer that they are not matched. I worry that if grep were
changed to enable this behavior it would probably break many user scripts as
well. I would like to understand the expected behavior, though.

[1] 
https://gitlab.archlinux.org/archlinux/packaging/packages/grep/-/blob/main/PKGBUILD?ref_type=heads#L36-37

Thanks,

Ronan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]