Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8

From:	Johannes Meixner
Subject:	Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8
Date:	Fri, 15 Jun 2012 15:00:29 +0200 (CEST)
User-agent:	Alpine 2.00 (LNX 1167 2008-08-23)


Hello,

On Jun 14 07:44 Paul Eggert wrote (excerpt):

On 06/14/2012 04:07 AM, Johannes Meixner wrote:

Is grep's -i implemented via plain convert to lower case
or is it actually implemented via "case folding"?


I'm not sure which you mean by "plain convert"
and by "case folding", but it should handle
the Greek sigma case correctly.  If there are
bugs please let us know.


I meant the difference between "convert to lower case"
and "case folding" as described in
http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf

But I am not at all an expert in this area so that I may
misunderstand things.



The Greek sigma case:


I think that the Greek sigma case is not handled correctly
in grep-2.7 which I use.

My steps to reproduce it:

1.

From
http://www.utf8-chartable.de/unicode-utf8-table.pl
I have this table:

Unicode | UTF-8 (oct.) | name
----------------------------------------------------------
U+03A3  | 0316 0243    | GREEK CAPITAL LETTER SIGMA
U+03C2  | 0317 0202    | GREEK SMALL LETTER FINAL SIGMA
U+03C3  | 0317 0203    | GREEK SMALL LETTER SIGMA

2.

I set a Greek UTF-8 locale:

$ export LC_ALL=el_GR.utf8 ; export LANG=el_GR.utf8

3.

I create four UTF-8 files with those characters:

For the file names I use the following ASCII characters
to denote the content of the file:
'S' means a GREEK CAPITAL LETTER SIGMA
'f' means a GREEK SMALL LETTER FINAL SIGMA
's' means a GREEK SMALL LETTER SIGMA

$ echo -e '\0316\0243\0316\0243' >SS

$ echo -e '\0316\0243\0317\0202' >Sf

$ echo -e '\0317\0203\0317\0202' >sf

$ echo -e '\0317\0203\0317\0203' >ss

4.

Testing what grep versus grep -i finds:

$ grep -q -i -f SS ss && echo yes || echo no
yes

$ grep -q -i -f ss SS && echo yes || echo no
yes

$ grep -q -i -f Sf sf && echo yes || echo no
yes

$ grep -q -i -f sf Sf && echo yes || echo no
yes

$ grep -q -i -f SS sf && echo yes || echo no
no

$ grep -q -i -f sf SS && echo yes || echo no
no

My conclusion:

The last two "no" are an error as far as I understand
http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf

Therein in the "Caseless Matching" sub-section there is
an example given that GREEK CAPITAL LETTER SIGMA should
match GREEK SMALL LETTER FINAL SIGMA.

I think this is caused by "the way grep's -i is implemented:
it converts both the RE and the buffer-to-search to lower case"
so that GREEK CAPITAL LETTER SIGMA gets converted
to GREEK SMALL LETTER SIGMA which does not match
GREEK SMALL LETTER FINAL SIGMA.

I think if grep's -i would be implemented by "case folding"
according to how I understand
http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf
then I assume something like the following should be done:

In both the RE and the buffer-to-search
GREEK CAPITAL LETTER SIGMA gets converted to GREEK SMALL LETTER SIGMA
and
GREEK SMALL LETTER FINAL SIGMA gets converted to GREEK SMALL LETTER SIGMA
so that then in the end there is only GREEK SMALL LETTER SIGMA
in both the RE and the buffer-to-search and that matches.



The German sharp s case:


Another example is the LATIN SMALL LETTER SHARP S (U+00DF) which is
described in the "Complications for Case Mapping" sub-section in
http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf

The LATIN SMALL LETTER SHARP S (U+00DF / octal 0303 0237)
expands when uppercased to the sequence of two characters "SS".

I use here in this mail the ASCII character 'f' to denote
a LATIN SMALL LETTER SHARP S (U+00DF / octal 0303 0237).

There is the German lowercase word 'heif' (English 'hot')
and when 'heif' is uppercased it becomes 'HEISS'.

Therefore for "grep -i" 'heif' and 'HEISS' should match.

$ export LC_ALL=de_DE.utf8 ; export LANG=de_DE.utf8

$ echo -e 'hei\0303\0237' >heif

$ echo 'HEISS' >HEISS

$ grep -q -i -f heif HEISS && echo yes || echo no
no

$ grep -q -i -f HEISS heif && echo yes || echo no
no

In this case it seems "case folding" can be implemented
as follows:

In both the RE and the buffer-to-search
LATIN SMALL LETTER SHARP S gets converted to 'SS'


To make it more complicated since Unicode 5.1 there exists

LATIN CAPITAL LETTER SHARP S ( U+1E9E coctal 0341 0272 0236)

so that 'HEISS' could be also written as

$ echo -e 'HEI\0341\0272\0236' >HEIF

I use 'F' to denote a LATIN CAPITAL LETTER SHARP S.

For "grep -i" 'heif' and 'HEISS' and 'HEIF' should match.

Therefore "case folding" for LATIN SHARP S in general
might be implemented as follows:

In both the RE and the buffer-to-search
LATIN SMALL LETTER SHARP S gets converted to 'SS'
and
LATIN CAPITAL LETTER SHARP S gets converted to 'SS'

I wonder if it is right or wrong to convert 'SS' to 'ss'
in an additional step which would mean that LATIN SHARP S
would match 'ss' but I don't know if this is the right
meaning of caseless matching because there are the German
words 'Masse' (English mass) and 'Mafe' (English measures)
again I wrote 'f' to denote a LATIN SMALL LETTER SHARP S.

Both words have different menaing so that I think that
for "grep -i" 'Masse' and 'Mafe' should not match.

Therefore I think "case folding" should result something like
'Masse' is case folded to 'masse'
and
'Mafe' is case folded to 'maSSe'
without an an additional step which converts 'SS' to 'ss'
so that 'Masse' and 'Mafe' do not match for caseless matching.



In the end I think "case folding" means to have a list of mappings
for special characters how to convert them in both the RE and the
buffer-to-search into a fixed form (i.e. a sequence of bytes)
which is appropriate for caseless matching using a binary
comparison, see "Caseless Matching" in
http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf


Kind Regards
Johannes Meixner
--
SUSE LINUX Products GmbH -- Maxfeldstrasse 5 -- 90409 Nuernberg -- Germany
HRB 16746 (AG Nuernberg) GF: Jeff Hawn, Jennifer Guild, Felix Imendoerffer

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8, (continued)

Prev by Date: Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8
Next by Date: Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8
Previous by thread: Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8
Next by thread: Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8
Index(es):
- Date
- Thread