bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 locale and \n in regexps


From: Aharon Robbins
Subject: Re: UTF-8 locale and \n in regexps
Date: Tue, 24 Apr 2007 21:41:53 +0300

Greetings.  Concerning the below.

I can indeed reproduce this in my current code base.

This would seem to be a bug deep, deep, VERY deep in the guts of the dfa
matcher which is failing to match when it should. You can see this by
setting the environment variable GAWK_NO_DFA to a non-empty value and
running the test.

I am sorry to admit that I am unable to identify the bug. The dfa code,
particularly for multiybte locales, is just too complicated for me to
follow, although I did try.

That code originally came from GNU grep, I am cc-ing the grep bug list
in the hope that someone there may be able to help.

A separate question arises, which is why the match function is using
the dfa matcher in this case at all, which I will start investigating.
(In other words, there may be a workaround.)

Thanks,

Arnold

> Date: Thu, 19 Apr 2007 17:09:02 +0300
> From: Pekka Pessi <address@hidden>
> Subject: UTF-8 locale and \n in regexps
> To: address@hidden
> Cc: address@hidden
>
> --=-=-=
>
> Hello,
>
> It looks like regexp with \n in [^] behaves badly if locale has
> an UTF-8 ctype.
>
> It looks like if there is \n and an range without \n, like /\n[^x\n]foo/,
> and first \n ends an even-numbered line within the string, regexp
> does not match.
>
> Please see the attached script for an demonstration.
>
> --Pekka Pessi
>
>
> --=-=-=
> Content-Disposition: inline; filename=gawk-test
>
> #! /bin/sh
>
> for LC_ALL in C UNKNOWN POSIX en_US.ISO-8859-1 en_US.UTF-8
> do
> export LC_ALL
> cat <<EOF |
> line1
> line2
> line3
> line4 
> line5
> line6
> line7
> line8
> line9
> EOF
> gawk '
> BEGIN { RS="\0"; }
> { 
>   if (match($0, /\n[^2\n]*2/)) { got2=1; } else { print "no match 2"; }
>   if (match($0, /\n[^3\n]*3/)) { got3=1; } else { print "no match 3"; }
>   if (match($0, /\n[^4\n]*4/)) { got4=1; } else { print "no match 4"; }
>   if (match($0, /\n[^5\t]*5/)) { got5=1; } else { print "no match 5"; }
>   if (match($0, /\n[^6\n]*6/)) { got6=1; } else { print "no match 6"; }
>   if (match($0, /\n[a-z]*7\n/)){ got7=1; } else { print "no match 7"; }
>   if (match($0, /\n[^8\n]*8/)) { got8=1; } else { print "no match 8"; }
>   if (match($0, /8.[^9\n]+9/)) { got9=1; } else { print "no match 9"; }
> }
>
> END { exit(!(got2 && got3 && got4 && got5 && got6 && got7 && got8 && got9)); }
> ' || { 
>   echo LC_ALL=$LC_ALL FAILED
>   exit 1
> }
> echo LC_ALL=$LC_ALL passed
> done
>
> --=-=-=--
>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]