bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] different results from same patterns


From: arnold
Subject: Re: [bug-gawk] different results from same patterns
Date: Tue, 02 Aug 2016 02:28:11 -0600
User-agent: Heirloom mailx 12.4 7/29/08

Hi.

I have integrated this patch, and it's in the repo. I added a test
based on your test code, but in awk.  On my Ubuntu 16.04 system I have
the el_GR.iso88597 locale installed and the test passes.

On a different Ubuntu 14.04 system I have tried to install this locale,
and I think I have done so successfully, but the test fails.  Do you have
any advice about how to verify if this locale is usable for the test?
If not, I'm going to have to remove the test.

Please use the gawk-4.1-stable branch in the gawk repo from Savannah
for any testing.

Thanks,

Arnold

Norihiro Tanaka <address@hidden> wrote:

> Hi,
>
> \323, \362 and \363 mean each following in el_GR.iso88597.
>
>   \323 (0xd3) SIGMA
>   \362 (0xf2) stigma
>   \363 (0xf3) sigma
>
> This locale is MB_CUR_MAX == 1.  \323 is upper case, OTOH \362 and \363
> are lower case in the locale.  Lower case of \323 is \363, and upper
> case of \362 and \363 is \323.
>
> Now I tested below.
>
> (
> LC_ALL=el_GR.iso88597
> export LC_ALL
>
> printf 'b\323\nb\362\nb\363\n' >in
>
> pat=$(printf '\323\n'); ./gawk "BEGIN { IGNORECASE = 1 } /b$pat/ { print }" 
> in | sed -e s'/^b//' >out1-dfa
> pat=$(printf '\362\n'); ./gawk "BEGIN { IGNORECASE = 1 } /b$pat/ { print }" 
> in | sed -e s'/^b//' >out2-dfa
> pat=$(printf '\363\n'); ./gawk "BEGIN { IGNORECASE = 1 } /b$pat/ { print }" 
> in | sed -e s'/^b//' >out3-dfa
> pat=$(printf '\323\n'); ./gawk 'BEGIN { IGNORECASE = 1 } /[a-c]'"$pat/ { 
> print }" in | sed -e s'/^b//' >out1-regex
> pat=$(printf '\362\n'); ./gawk 'BEGIN { IGNORECASE = 1 } /[a-c]'"$pat/ { 
> print }" in | sed -e s'/^b//' >out2-regex
> pat=$(printf '\363\n'); ./gawk 'BEGIN { IGNORECASE = 1 } /[a-c]'"$pat/ { 
> print }" in | sed -e s'/^b//' >out3-regex
>
> for out in out[1-3]-*; do echo "$out"; od -tx1 "$out" | head -1; done
> )
>
> result:
>
> out1-dfa
> 0000000 d3 0a f2 0a f3 0a
> out1-regex
> 0000000 d3 0a
> out2-dfa
> 0000000 d3 0a f2 0a f3 0a
> out2-regex
> 0000000 f2 0a f3 0a
> out3-dfa
> 0000000 d3 0a f2 0a f3 0a
> out3-regex
> 0000000 f2 0a f3 0a
>
> I expect out1-dfa and are same as out1-regex, out2-dfa and are same as
> out2-regex and out3-dfa and are same as out3-regex.
>
> I reported issue of fastmap in regex at
> https://sourceware.org/bugzilla/show_bug.cgi?id=20381, but I think that
> this is anothor issue.  Although Gawk uses RE_ICASE with dfa matcher,
> uses CASETABLE instead of RE_ICASE with regex in single byte locales.
>
> I propose a patch, but I do not have the confidence that it is correct.
>
> Thanks,
> Norihiro



reply via email to

[Prev in Thread] Current Thread [Next in Thread]