rebased utf-8 series

bug-grep
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
rebased utf-8 series

From:	Paolo Bonzini
Subject:	rebased utf-8 series
Date:	Tue, 16 Mar 2010 10:37:34 +0100
Jim said he had difficulties rebasing, so here it is again.  Single
message for simplicity.

Also available at git://github.com/bonzini/grep.git

Paolo Bonzini (7):
  dfa: rewrite handling of multibyte case_fold lexing
  dfa: speed up handling of brackets
  dfa: optimize simple character sets under UTF-8 charsets
  dfa: cache MB_CUR_MAX for dfaexec
  dfa: run simple UTF-8 regexps as a single-byte character set
  grep: remove check_multibyte_string, fix non-UTF8 missed match
  grep: match multibyte charsets line-by-line when using -i

 .x-sc_cast_of_argument_to_free |    1 -
 .x-sc_space_tab                |    1 -
 NEWS                           |   13 +
 src/dfa.c                      |  960 +++++++++++++++++++++-------------------
 src/dfa.h                      |    6 +
 src/grep.c                     |  108 ++---
 src/search.c                   |  245 ++++++-----
 tests/Makefile.am              |    2 +
 tests/case-fold-backslash-w    |   14 +
 tests/euc-mb                   |   23 +
 tests/foad1.sh                 |   10 +-
 11 files changed, 742 insertions(+), 641 deletions(-)
 delete mode 100644 .x-sc_cast_of_argument_to_free
 create mode 100755 tests/case-fold-backslash-w
 create mode 100644 tests/euc-mb

>From 726b62e3b64bf848422f7ec45786ad93148452a0 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <address@hidden>
Date: Sun, 23 Nov 2008 17:28:46 +0100
Subject: [PATCH 1/7] dfa: rewrite handling of multibyte case_fold lexing

Let dfacomp do the folding to lowercase of multibyte input strings,
and remove it from grep.c.  Input strings to kwset.c are still folded
outside kwset.c, so we still need to do mbtolower in search.c.

* NEWS: Document bugfixes.
* .x-sc_cast_of_argument_to_free: Remove.
* src/dfa.c (wctok, addtok_wc): New.
(cur_mb_index, update_mb_len_index): Remove.
(FETCH): Do not call it.
(parse_bracket_exp_mb) [GREP]: Disable case-folding of ranges and
characters.
(addtok): Extract part to...
(addtok_mb): ... this new function.
(lex): Call fetch_wc in the main loop for MB_CUR_MAX > 1.  Return WCHAR
for normal characters if MB_CUR_MAX > 1.
(atom): Handle WCHAR instead of treating multibyte characters specially.
Do case folding of multibyte characters here.
(dfacomp): Remove case_fold special casing.
* src/dfa.h (WCHAR): New.
* src/grep.c (mb_icase_keys): Remove.
(main): Do not call it.
* src/search.c (kwsinit): Init transition table only for MB_CUR_MAX == 1.
(mbtolower): New.
(kwsincr_case): New.
(kwsmusts): Call it instead of kwsincr.
(check_multibyte_string): Remove.
(check_multibyte_string_no_icase): Rename to check_multibyte_string.
(GEAcompile, EGexecute, Fcompile): Use mbtolower instead of the old
check_multibyte_string.
* tests/Makefile.am (TESTS): Add case-fold-backslash-w.
* tests/foad1.sh: Enable fixed tests.
* tests/case-fold-backslash-w: New.
---
 .x-sc_cast_of_argument_to_free |    1 -
 NEWS                           |    3 +
 src/dfa.c                      |  225 +++++++++++++++++-----------------------
 src/dfa.h                      |    3 +
 src/grep.c                     |   68 ------------
 src/search.c                   |  192 +++++++++++++++++++++--------------
 tests/Makefile.am              |    1 +
 tests/case-fold-backslash-w    |   14 +++
 tests/foad1.sh                 |   10 +-
 9 files changed, 238 insertions(+), 279 deletions(-)
 delete mode 100644 .x-sc_cast_of_argument_to_free
 create mode 100755 tests/case-fold-backslash-w

diff --git a/.x-sc_cast_of_argument_to_free b/.x-sc_cast_of_argument_to_free
deleted file mode 100644
index 3f02e3d..0000000
--- a/.x-sc_cast_of_argument_to_free
+++ /dev/null
@@ -1 +0,0 @@
-^src/search\.c$
diff --git a/NEWS b/NEWS
index 4f9c778..d326f2e 100644
--- a/NEWS
+++ b/NEWS
@@ -10,6 +10,9 @@ GNU grep NEWS                                    -*- outline 
-*-
   - for single characters, echo Y | grep -i '[y]'
   - for character types, echo Y | grep -i '[[:lower:]]'
 
+  grep -i -o would fail to report some matches; grep -i --color, while not
+  missing any line containing a match, would fail to color some matches.
+
   Various bugs in grep -P, caused by expressions such as [^b] or \S matching
   newlines, were fixed.  grep -P also supports the special sequences \Z and
   \z, and can be combined with the command-line option -z to perform searches
diff --git a/src/dfa.c b/src/dfa.c
index 90e3c18..455bd70 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -267,17 +267,10 @@ static int hard_LC_COLLATE;       /* Nonzero if 
LC_COLLATE is hard.  */
 #ifdef MBS_SUPPORT
 /* These variables are used only if (MB_CUR_MAX > 1).  */
 static mbstate_t mbs;          /* Mbstate for mbrlen().  */
-static int cur_mb_len;         /* Byte length of the current scanning
+static int cur_mb_len;         /* Length of the multibyte representation of
+                                  wctok.  */
+static wchar_t wctok;          /* Wide character representation of the current
                                   multibyte character.  */
-static int cur_mb_index;        /* Byte index of the current scanning multibyte
-                                   character.
-
-                                  single byte character : cur_mb_index = 0
-                                  multibyte character
-                                      1st byte : cur_mb_index = 1
-                                      2nd byte : cur_mb_index = 2
-                                        ...
-                                      nth byte : cur_mb_index = n  */
 static unsigned char *mblen_buf;/* Correspond to the input buffer in dfaexec().
                                   Each element store the amount of remain
                                   byte of corresponding multibyte character
@@ -299,38 +292,6 @@ static unsigned char const *buf_end;       /* reference to 
end in dfaexec().  */
 #endif /* MBS_SUPPORT  */
 
 #ifdef MBS_SUPPORT
-/* This function update cur_mb_len, and cur_mb_index.
-   p points current lexptr, len is the remaining buffer length.  */
-static void
-update_mb_len_index (char const *p, int len)
-{
-  /* If last character is a part of a multibyte character,
-     we update cur_mb_index.  */
-  if (cur_mb_index)
-    cur_mb_index = (cur_mb_index >= cur_mb_len)? 0
-                       : cur_mb_index + 1;
-
-  /* If last character is a single byte character, or the
-     last portion of a multibyte character, we check whether
-     next character is a multibyte character or not.  */
-  if (! cur_mb_index)
-    {
-      cur_mb_len = mbrlen(p, len, &mbs);
-      if (cur_mb_len > 1)
-       /* It is a multibyte character.
-          cur_mb_len was already set by mbrlen().  */
-       cur_mb_index = 1;
-      else if (cur_mb_len < 1)
-       /* Invalid sequence.  We treat it as a single byte character.
-          cur_mb_index is aleady 0.  */
-       cur_mb_len = 1;
-      /* Otherwise, cur_mb_len == 1, it is a single byte character.
-        cur_mb_index is aleady 0.  */
-    }
-}
-#endif /* MBS_SUPPORT */
-
-#ifdef MBS_SUPPORT
 /* Note that characters become unsigned here. */
 # define FETCH(c, eoferr)                      \
   do {                                         \
@@ -341,8 +302,6 @@ update_mb_len_index (char const *p, int len)
        else                                    \
          return lasttok = END;                 \
       }                                                \
-    if (MB_CUR_MAX > 1)                                \
-      update_mb_len_index(lexptr, lexleft);    \
     (c) = (unsigned char) *lexptr++;           \
     --lexleft;                                 \
   } while(0)
@@ -365,7 +324,7 @@ fetch_wc (char const *eoferr)
   if (cur_mb_len <= 0)
    {
       cur_mb_len = 1;
-      wc = *lexptr;
+      wc = (unsigned char) *lexptr;
     }
   lexptr += cur_mb_len;
   lexleft -= cur_mb_len;
@@ -585,6 +544,7 @@ parse_bracket_exp_mb (void)
          work_mbc->range_ends[work_mbc->nranges++] =
             case_fold ? towlower(wc2) : (wchar_t)wc2;
 
+#ifndef GREP
          if (case_fold)
             {
               REALLOC_IF_NECESSARY(work_mbc->range_sts, wchar_t,
@@ -594,21 +554,24 @@ parse_bracket_exp_mb (void)
                                    range_ends_al, work_mbc->nranges + 1);
               work_mbc->range_ends[work_mbc->nranges++] = towupper(wc2);
             }
+#endif
        }
       else if (wc != WEOF)
        /* build normal characters.  */
        {
          REALLOC_IF_NECESSARY(work_mbc->chars, wchar_t, chars_al,
                               work_mbc->nchars + 1);
-         work_mbc->chars[work_mbc->nchars++] = (wchar_t)wc;
-         if (case_fold && (iswlower(wc) || iswupper(wc)))
-           {
-             REALLOC_IF_NECESSARY(work_mbc->chars, wchar_t, chars_al,
-                                  work_mbc->nchars + 1);
-             work_mbc->chars[work_mbc->nchars++] =
-               (wchar_t) (iswlower(wc) ? towupper(wc) : towlower(wc));
-           }
-       }
+         work_mbc->chars[work_mbc->nchars++] =
+               (wchar_t) (case_fold ? towlower(wc) : wc);
+#ifndef GREP
+         if (case_fold)
+            {
+              REALLOC_IF_NECESSARY(work_mbc->chars, wchar_t, chars_al,
+                                   work_mbc->nchars + 1);
+              work_mbc->chars[work_mbc->nchars++] = towupper(wc);
+            }
+#endif
+        }
     }
   while ((wc = wc1) != L']');
   return MBCSET;
@@ -691,15 +654,20 @@ lex (void)
      "if (backslash) ...".  */
   for (i = 0; i < 2; ++i)
     {
-      FETCH(c, 0);
 #ifdef MBS_SUPPORT
-      if (MB_CUR_MAX > 1 && cur_mb_index)
-       /* If this is a part of a multi-byte character, we must treat
-          this byte data as a normal character.
-          e.g. In case of SJIS encoding, some character contains '\',
-               but they must not be backslash.  */
-       goto normal_char;
+      if (MB_CUR_MAX > 1)
+        {
+          wint_t wi = fetch_wc (NULL);
+          if (wi == WEOF)
+            return lasttok = EOF;
+          wctok = wi, c = wctob (wi);
+          if ((int)c == EOF)
+            goto normal_char;
+        }
+      else
 #endif /* MBS_SUPPORT  */
+        FETCH(c, NULL);
+
       switch (c)
        {
        case '\\':
@@ -1071,12 +1039,20 @@ lex (void)
        default:
        normal_char:
          laststart = 0;
+#ifdef MBS_SUPPORT
+         /* For multibyte character sets, folding is done in atom.  Always
+             return WCHAR.  */
+          if (MB_CUR_MAX > 1)
+            return lasttok = WCHAR;
+#endif
+
          if (case_fold && ISALPHA(c))
            {
              zeroset(ccl);
              setbit_case_fold (c, ccl);
              return lasttok = CSET + charclass_index(ccl);
            }
+
          return lasttok = c;
        }
     }
@@ -1096,29 +1072,18 @@ static int depth;               /* Current depth of a 
hypothetical stack
                                   required of the real stack later on in
                                   dfaanalyze(). */
 
-/* Add the given token to the parse tree, maintaining the depth count and
-   updating the maximum depth if necessary. */
 static void
-addtok (token t)
+addtok_mb (token t, int mbprop)
 {
 #ifdef MBS_SUPPORT
   if (MB_CUR_MAX > 1)
     {
       REALLOC_IF_NECESSARY(dfa->multibyte_prop, int, dfa->nmultibyte_prop,
                           dfa->tindex);
-      /* Set dfa->multibyte_prop.  See struct dfa in dfa.h.  */
-      if (t == MBCSET)
-       dfa->multibyte_prop[dfa->tindex] = ((dfa->nmbcsets - 1) << 2) + 3;
-      else if (t < NOTCHAR)
-       dfa->multibyte_prop[dfa->tindex]
-         = (cur_mb_len == 1)? 3 /* single-byte char */
-         : (((cur_mb_index == 1)? 1 : 0) /* 1st-byte of multibyte char */
-            + ((cur_mb_index == cur_mb_len)? 2 : 0)); /* last-byte */
-      else
-       /* It may be unnecessary, but it is safer to treat other
-          symbols as single byte characters.  */
-       dfa->multibyte_prop[dfa->tindex] = 3;
+      dfa->multibyte_prop[dfa->tindex] = mbprop;
     }
+#else
+  (void) mbprop;
 #endif
 
   REALLOC_IF_NECESSARY(dfa->tokens, token, dfa->talloc, dfa->tindex);
@@ -1147,6 +1112,41 @@ addtok (token t)
     dfa->depth = depth;
 }
 
+/* Add the given token to the parse tree, maintaining the depth count and
+   updating the maximum depth if necessary. */
+static void
+addtok (token t)
+{
+#ifdef MBS_SUPPORT
+  if (MB_CUR_MAX > 1 && t == MBCSET)
+    addtok_mb (MBCSET, ((dfa->nmbcsets - 1) << 2) + 3);
+  else
+#endif
+    addtok_mb (t, 3);
+}
+
+/* We treat a multibyte character as a single atom, so that DFA
+   can treat a multibyte character as a single expression.
+
+   e.g. We construct following tree from "<mb1><mb2>".
+   <mb1(1st-byte)><mb1(2nd-byte)><CAT><mb1(3rd-byte)><CAT>
+   <mb2(1st-byte)><mb2(2nd-byte)><CAT><mb2(3rd-byte)><CAT><CAT> */
+static void
+addtok_wc (wint_t wc)
+{
+  unsigned char buf[16];
+  mbstate_t s;
+  int i;
+  memset (&s, 0, sizeof(s));
+  cur_mb_len = wcrtomb ((char *) buf, wc, &s);
+  addtok_mb(buf[0], cur_mb_len == 1 ? 3 : 1);
+  for (i = 1; i < cur_mb_len; i++)
+    {
+      addtok_mb(buf[i], i == cur_mb_len - 1 ? 2 : 0);
+      addtok(CAT);
+    }
+}
+
 /* The grammar understood by the parser is as follows.
 
    regexp:
@@ -1185,6 +1185,23 @@ addtok (token t)
 static void
 atom (void)
 {
+#ifdef MBS_SUPPORT
+  if (tok == WCHAR)
+    {
+      addtok_wc (case_fold ? towlower(wctok) : wctok);
+#ifndef GREP
+      if (case_fold && iswalpha(wctok))
+        {
+          addtok_wc (towupper(wctok));
+          addtok (OR);
+        }
+#endif
+
+      tok = lex();
+      return;
+    }
+#endif /* MBS_SUPPORT  */
+
   if ((tok >= 0 && tok < NOTCHAR) || tok >= CSET || tok == BACKREF
       || tok == BEGLINE || tok == ENDLINE || tok == BEGWORD
 #ifdef MBS_SUPPORT
@@ -1194,24 +1211,6 @@ atom (void)
     {
       addtok(tok);
       tok = lex();
-#ifdef MBS_SUPPORT
-      /* We treat a multibyte character as a single atom, so that DFA
-        can treat a multibyte character as a single expression.
-
-         e.g. We construct following tree from "<mb1><mb2>".
-              <mb1(1st-byte)><mb1(2nd-byte)><CAT><mb1(3rd-byte)><CAT>
-              <mb2(1st-byte)><mb2(2nd-byte)><CAT><mb2(3rd-byte)><CAT><CAT>
-      */
-      if (MB_CUR_MAX > 1)
-       {
-         while (cur_mb_index > 1 && tok >= 0 && tok < NOTCHAR)
-           {
-             addtok(tok);
-             addtok(CAT);
-             tok = lex();
-           }
-       }
-#endif /* MBS_SUPPORT  */
     }
   else if (tok == LPAREN)
     {
@@ -1343,7 +1342,6 @@ dfaparse (char const *s, size_t len, struct dfa *d)
 #ifdef MBS_SUPPORT
   if (MB_CUR_MAX > 1)
     {
-      cur_mb_index = 0;
       cur_mb_len = 0;
       memset(&mbs, 0, sizeof(mbstate_t));
     }
@@ -2938,39 +2936,10 @@ dfainit (struct dfa *d)
 void
 dfacomp (char const *s, size_t len, struct dfa *d, int searchflag)
 {
-  if (case_fold && len)        /* dummy folding in service of dfamust() */
-    {
-      char *lcopy;
-      int i;
-
-      lcopy = malloc(len);
-      if (!lcopy)
-       dfaerror(_("memory exhausted"));
-
-      /* This is a kludge. */
-      case_fold = 0;
-      for (i = 0; i < len; ++i)
-       if (ISUPPER ((unsigned char) s[i]))
-         lcopy[i] = tolower ((unsigned char) s[i]);
-       else
-         lcopy[i] = s[i];
-
-      dfainit(d);
-      dfaparse(lcopy, len, d);
-      free(lcopy);
-      dfamust(d);
-      d->cindex = d->tindex = d->depth = d->nleaves = d->nregexps = 0;
-      case_fold = 1;
-      dfaparse(s, len, d);
-      dfaanalyze(d, searchflag);
-    }
-  else
-    {
-        dfainit(d);
-        dfaparse(s, len, d);
-       dfamust(d);
-        dfaanalyze(d, searchflag);
-    }
+  dfainit(d);
+  dfaparse(s, len, d);
+  dfamust(d);
+  dfaanalyze(d, searchflag);
 }
 
 /* Free the storage held by the components of a dfa. */
diff --git a/src/dfa.h b/src/dfa.h
index 685ce94..4ca55f0 100644
--- a/src/dfa.h
+++ b/src/dfa.h
@@ -129,6 +129,9 @@ typedef enum
 
   MBCSET,                      /* MBCSET is similar to CSET, but for
                                   multibyte characters.  */
+
+  WCHAR,                       /* Only returned by lex.  wctok contains
+                                  the wide character representation.  */
 #endif /* MBS_SUPPORT */
 
   CSET                         /* CSET and (and any value greater) is a
diff --git a/src/grep.c b/src/grep.c
index c1c6152..f1d341a 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -1781,69 +1781,6 @@ parse_grep_colors (void)
                "at remaining substring \"%s\"."), p, q);
 }
 
-/* mb_icase_keys() is called by main() to convert its "keys" string with
-   strlen() "len" to lowercase if match_icase is true.  Pointers are used
-   to implement in-out call-by-reference parameters.  */
-#ifdef MBS_SUPPORT
-static void
-mb_icase_keys (char **keys, size_t *len)
-{
-  wchar_t wc;
-  mbstate_t sti, stj;          /* i for input/old, j for output/new.  */
-  size_t i, j, li, lj;         /* l for total string length (minus '\0').  */
-  char *ki, *kj;               /* k for keys.  */
-  int mcm;
-
-  if ((mcm = MB_CUR_MAX) == 1)
-    return;
-
-  li = *len;
-  ki = *keys;
-  /* We use a new buffer because some multi-octet characters change
-     length through a lower-case conversion.  For example:
-       len(U+0049)=1 --> len(U+0131)=2   under tr_TR.UTF-8
-       len(U+0130)=2 --> len(U+0069)=1   under en_US.UTF-8
-       len(U+2126)=3 --> len(U+03C9)=2   under en_US.UTF-8
-       len(U+212A)=3 --> len(U+006B)=1   under en_US.UTF-8
-       len(U+212B)=3 --> len(U+00E5)=2   under en_US.UTF-8  */
-  lj = li + mcm;
-  kj = xmalloc(lj + 1);
-
-  memset(&sti, 0, sizeof(mbstate_t));
-  memset(&stj, 0, sizeof(mbstate_t));
-  for (i = j = 0; i < li ;)
-    {
-      size_t mbclen;
-      mbclen = mbrtowc(&wc, ki + i, li - i, &sti);
-      if (lj < j + mcm)
-       {
-         lj += mcm;
-         kj = xrealloc(kj, lj + 1);
-       }
-      if (mbclen == (size_t) -1 || mbclen == (size_t) -2 || mbclen == 0)
-       {
-         /* An invalid sequence, or a truncated multi-octet character.
-            We treat it as a single-octet character.  */
-         kj[j++] = ki[i++];
-       }
-      else
-       {
-         /* Doing towupper() before towlower() helps a few hairy cases and is
-            not too costly since this is the PATTERN and is done only once.  */
-         wc = towupper((wint_t)wc);
-         wc = towlower((wint_t)wc);
-         j += wcrtomb(kj + j, wc, &stj);
-         i += mbclen;
-       }
-    }
-  kj[j] = '\0';
-
-  free(ki);
-  *keys = kj;
-  *len = j;
-}
-#endif /* MBS_SUPPORT */
-
 int
 main (int argc, char **argv)
 {
@@ -2261,11 +2198,6 @@ There is NO WARRANTY, to the extent permitted by 
law.\n"),
 
   set_limits();
 
-#ifdef MBS_SUPPORT
-  if (match_icase)
-    mb_icase_keys (&keys, &keycc);
-#endif /* MBS_SUPPORT */
-
   compile(keys, keycc);
   free (keys);
 
diff --git a/src/search.c b/src/search.c
index d9b4462..d7cf0e8 100644
--- a/src/search.c
+++ b/src/search.c
@@ -30,7 +30,6 @@
 #endif
 
 #include "system.h"
-#include "ignore-value.h"
 #include "grep.h"
 #ifndef FGREP_PROGRAM
 # include <regex.h>
@@ -59,14 +58,83 @@ kwsinit (void)
   static char trans[NCHAR];
   int i;
 
-  if (match_icase)
-    for (i = 0; i < NCHAR; ++i)
-      trans[i] = TOLOWER (i);
+  if (match_icase && MB_CUR_MAX == 1)
+    {
+      for (i = 0; i < NCHAR; ++i)
+        trans[i] = TOLOWER (i);
+
+      kwset = kwsalloc (trans);
+    }
+  else
+    kwset = kwsalloc (NULL);
 
-  if (!(kwset = kwsalloc (match_icase ? trans : (char *) 0)))
+  if (!kwset)
     xalloc_die ();
 }
 
+#ifdef MBS_SUPPORT
+/* Convert the string from BEG to N to lowercase.  Overwrite N
+   with the length of the new string, and return a pointer to
+   the lowercase string.  Successive calls to mbtolower will
+   rewrite the output buffer.  */
+static char *
+mbtolower (const char *beg, size_t *n)
+{
+  static char *out;
+  static size_t outalloc;
+  size_t outlen, mb_cur_max;
+  mbstate_t is, os;
+  const char *end;
+  char *p;
+
+  if (*n > outalloc)
+    {
+      out = xrealloc (out, *n);
+      outalloc = *n;
+    }
+
+  memset (&is, 0, sizeof (is));
+  memset (&os, 0, sizeof (os));
+  end = beg + *n;
+
+  mb_cur_max = MB_CUR_MAX;
+  p = out;
+  outlen = 0;
+  while (beg < end)
+    {
+      wchar_t wc;
+      size_t mbclen = mbrtowc(&wc, beg, end - beg, &is);
+      if (outlen + mb_cur_max >= outalloc)
+        {
+          out = x2nrealloc (out, &outalloc, 1);
+          p = out + outlen;
+        }
+
+      if (mbclen == (size_t) -1 || mbclen == (size_t) -2 || mbclen == 0)
+        {
+          /* An invalid sequence, or a truncated multi-octet character.
+             We treat it as a single-octet character.  */
+          *p++ = *beg++;
+          outlen++;
+          memset (&is, 0, sizeof (is));
+          memset (&os, 0, sizeof (os));
+        }
+      else
+        {
+          beg += mbclen;
+          mbclen = wcrtomb (p, towlower ((wint_t) wc), &os);
+          p += mbclen;
+          outlen += mbclen;
+        }
+    }
+
+  *n = p - out;
+  *p++ = 0;
+  return out;
+}
+#endif
+
+
 #ifndef FGREP_PROGRAM
 /* DFA compiled regexp. */
 static struct dfa dfa;
@@ -94,6 +162,22 @@ dfaerror (char const *mesg)
    call the regexp matcher at all. */
 static int kwset_exact_matches;
 
+static char const *
+kwsincr_case (const char *must)
+{
+  const char *buf;
+  size_t n;
+
+  n = strlen (must);
+#ifdef MBS_SUPPORT
+  if (match_icase && MB_CUR_MAX > 1)
+    buf = mbtolower (must, &n);
+  else
+#endif
+    buf = must;
+  return kwsincr (kwset, buf, n);
+}
+
 /* If the DFA turns out to have some set of fixed strings one of
    which must occur in the match, then we build a kwset matcher
    to find those strings, and thus quickly filter out impossible
@@ -115,7 +199,7 @@ kwsmusts (void)
          if (!dm->exact)
            continue;
          ++kwset_exact_matches;
-         if ((err = kwsincr (kwset, dm->must, strlen (dm->must))) != NULL)
+         if ((err = kwsincr_case (dm->must)) != NULL)
            error (EXIT_TROUBLE, 0, "%s", err);
        }
       /* Now, we compile the substrings that will require
@@ -124,7 +208,7 @@ kwsmusts (void)
        {
          if (dm->exact)
            continue;
-         if ((err = kwsincr (kwset, dm->must, strlen (dm->must))) != NULL)
+         if ((err = kwsincr_case (dm->must)) != NULL)
            error (EXIT_TROUBLE, 0, "%s", err);
        }
       if ((err = kwsprep (kwset)) != NULL)
@@ -134,48 +218,9 @@ kwsmusts (void)
 #endif /* !FGREP_PROGRAM */
 
 #ifdef MBS_SUPPORT
-/* This function allocate the array which correspond to "buf".
-   Then this check multibyte string and mark on the positions which
-   are not single byte character nor the first byte of a multibyte
-   character.  Caller must free the array.  */
-static char*
-check_multibyte_string(char *buf, size_t size)
-{
-  char *mb_properties = xcalloc(size, 1);
-  mbstate_t cur_state;
-  wchar_t wc;
-  int i;
-
-  memset(&cur_state, 0, sizeof(mbstate_t));
-
-  for (i = 0; i < size ;)
-    {
-      size_t mbclen;
-      mbclen = mbrtowc(&wc, buf + i, size - i, &cur_state);
-
-      if (mbclen == (size_t) -1 || mbclen == (size_t) -2 || mbclen == 0)
-       {
-         /* An invalid sequence, or a truncated multibyte character.
-            We treat it as a single byte character.  */
-         mbclen = 1;
-       }
-      else if (match_icase)
-       {
-         if (iswupper((wint_t)wc))
-           {
-             wc = towlower((wint_t)wc);
-             ignore_value (wcrtomb(buf + i, wc, &cur_state));
-           }
-       }
-      mb_properties[i] = mbclen;
-      i += mbclen;
-    }
-
-  return mb_properties;
-}
 
 static char*
-check_multibyte_string_no_icase(const char *buf, size_t size)
+check_multibyte_string(const char *buf, size_t size)
 {
   char *mb_properties = xcalloc(size, 1);
   mbstate_t cur_state;
@@ -219,10 +264,8 @@ GEAcompile (char const *pattern, size_t size, reg_syntax_t 
syntax_bits)
   size_t total = size;
   char *motif;
 
-#if 0
   if (match_icase)
     syntax_bits |= RE_ICASE;
-#endif
   re_set_syntax (syntax_bits);
   dfasyntax (syntax_bits, match_icase, eolbyte);
 
@@ -334,18 +377,16 @@ EXECUTE_FCT(EGexecute)
     {
       if (match_icase)
         {
-          /* Add one for the sentinel byte dfaexec may add.  */
-          char *case_buf = xmalloc(size + 1);
-          memcpy(case_buf, buf, size);
+          /* mbtolower adds a NUL byte at the end.  That will provide
+            space for the sentinel byte dfaexec may add.  */
+          char *case_buf = mbtolower (buf, &size);
          if (start_ptr)
            start_ptr = case_buf + (start_ptr - buf);
-         if (kwset)
-           mb_properties = check_multibyte_string(case_buf, size);
           buf = case_buf;
         }
-      else
-       if (kwset)
-         mb_properties = check_multibyte_string_no_icase(buf, size);
+
+      if (kwset)
+        mb_properties = check_multibyte_string(buf, size);
     }
 #endif /* MBS_SUPPORT */
 
@@ -512,11 +553,7 @@ EXECUTE_FCT(EGexecute)
  out:
 #ifdef MBS_SUPPORT
   if (MB_CUR_MAX > 1)
-    {
-      if (match_icase)
-        free ((char *) buf);
-      free (mb_properties);
-    }
+    free (mb_properties);
 #endif /* MBS_SUPPORT */
   return ret_val;
 }
@@ -525,16 +562,23 @@ EXECUTE_FCT(EGexecute)
 #if defined(GREP_PROGRAM) || defined(FGREP_PROGRAM)
 COMPILE_FCT(Fcompile)
 {
-  char const *beg, *end, *lim, *err;
+  char const *beg, *end, *lim, *err, *pat;
+  size_t psize;
 
   kwsinit ();
-  beg = pattern;
+  psize = size;
+  if (match_icase && MB_CUR_MAX > 1)
+    pat = mbtolower (pattern, &psize);
+  else
+    pat = pattern;
+
+  beg = pat;
   do
     {
       for (lim = beg;; ++lim)
        {
          end = lim;
-         if (lim >= pattern + size)
+         if (lim >= pat + psize)
            break;
         if (*lim == '\n')
           {
@@ -542,18 +586,19 @@ COMPILE_FCT(Fcompile)
             break;
           }
 #if HAVE_DOS_FILE_CONTENTS
-        if (*lim == '\r' && lim + 1 < pattern + size && lim[1] == '\n')
+        if (*lim == '\r' && lim + 1 < pat + psize && lim[1] == '\n')
           {
             lim += 2;
             break;
           }
 #endif
        }
+
       if ((err = kwsincr (kwset, beg, end - beg)) != NULL)
        error (EXIT_TROUBLE, 0, "%s", err);
       beg = lim;
     }
-  while (beg < pattern + size);
+  while (beg < pat + psize);
 
   if ((err = kwsprep (kwset)) != NULL)
     error (EXIT_TROUBLE, 0, "%s", err);
@@ -572,14 +617,13 @@ EXECUTE_FCT(Fexecute)
     {
       if (match_icase)
         {
-          char *case_buf = xmemdup (buf, size);
+          char *case_buf = mbtolower (buf, &size);
          if (start_ptr)
            start_ptr = case_buf + (start_ptr - buf);
-         mb_properties = check_multibyte_string(case_buf, size);
           buf = case_buf;
         }
-      else
-       mb_properties = check_multibyte_string_no_icase(buf, size);
+
+      mb_properties = check_multibyte_string(buf, size);
     }
 #endif /* MBS_SUPPORT */
 
@@ -644,11 +688,7 @@ EXECUTE_FCT(Fexecute)
  out:
 #ifdef MBS_SUPPORT
   if (MB_CUR_MAX > 1)
-    {
-      if (match_icase)
-        free ((char *) buf);
-      free (mb_properties);
-    }
+    free (mb_properties);
 #endif /* MBS_SUPPORT */
   return ret_val;
 }
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 6a56dcf..e27d1d5 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -17,6 +17,7 @@
 TESTS =                                                \
   backref.sh                                   \
   bre.sh                                       \
+  case-fold-backslash-w                                \
   case-fold-char-class                         \
   case-fold-char-range                         \
   case-fold-char-type                          \
diff --git a/tests/case-fold-backslash-w b/tests/case-fold-backslash-w
new file mode 100755
index 0000000..6ae7046
--- /dev/null
+++ b/tests/case-fold-backslash-w
@@ -0,0 +1,14 @@
+#!/bin/sh
+# test that \W works on case-insensitive matches.  It used to become \w.
+# Derived from https://savannah.gnu.org/bugs/?28162
+: ${srcdir=.}
+. "$srcdir/init.sh"; path_prepend_ ../src
+
+if echo foo bar | LANG=C.ASCII grep '^foo\W'; then
+  echo foo bar | LANG=C.ASCII grep -i '^foo\W' || fail_ ASCII insensitive
+else
+  echo foo bar | LANG=C grep '^foo\W' || fail_ LANG=C sensitive
+  echo foo bar | LANG=C grep -i '^foo\W' || fail_ LANG=C insensitive
+fi
+echo foo bar | LANG=en_US.UTF-8 grep '^foo\W' || fail_ UTF-8 sensitive
+echo foo bar | LANG=en_US.UTF-8 grep -i '^foo\W' || fail_ UTF-8 insensitive
diff --git a/tests/foad1.sh b/tests/foad1.sh
index 7c16d00..68acc77 100755
--- a/tests/foad1.sh
+++ b/tests/foad1.sh
@@ -42,9 +42,8 @@ grep_test ()
 
 # "-o" with "-i" should output an exact copy of the matching input text.
 grep_test "WordA/wordB/WORDC/" "Word/word/WORD/" "word" -o -i
-# Comment out cases that are known to fail. These should be uncommented after 
the 2.5.4 release. TAA.
-#grep_test "WordA/wordB/WORDC/" "Word/word/WORD/" "Word" -o -i
-#grep_test "WordA/wordB/WORDC/" "Word/word/WORD/" "WORD" -o -i
+grep_test "WordA/wordB/WORDC/" "Word/word/WORD/" "Word" -o -i
+grep_test "WordA/wordB/WORDC/" "Word/word/WORD/" "WORD" -o -i
 
 # Should display the line number (-n), octet offset (-b), or file name
 # (-H) of every match, not just of the first match on each input line.
@@ -82,9 +81,8 @@ CE="[m[K"
 
 # "--color" with "-i" should output an exact copy of the matching input text.
 grep_test "WordA/wordb/WORDC/" 
"${CB}Word${CE}A/${CB}word${CE}b/${CB}WORD${CE}C/" "word" --color=always -i
-# Comment out cases that are known to fail. These should be uncommented after 
the 2.5.4 release. TAA.
-#grep_test "WordA/wordb/WORDC/" 
"${CB}Word${CE}A/${CB}word${CE}b/${CB}WORD${CE}C/" "Word" --color=always -i
-#grep_test "WordA/wordb/WORDC/" 
"${CB}Word${CE}A/${CB}word${CE}b/${CB}WORD${CE}C/" "WORD" --color=always -i
+grep_test "WordA/wordb/WORDC/" 
"${CB}Word${CE}A/${CB}word${CE}b/${CB}WORD${CE}C/" "Word" --color=always -i
+grep_test "WordA/wordb/WORDC/" 
"${CB}Word${CE}A/${CB}word${CE}b/${CB}WORD${CE}C/" "WORD" --color=always -i
 
 # End of a previous match should not match a "start of ..." expression.
 grep_test "word_word/" "${CB}word_${CE}word/" "^word_*" --color=always
-- 
1.6.6.1


>From 9af3f4dbe1f2897e17cafa775091cef61580fb25 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <address@hidden>
Date: Sun, 7 Mar 2010 11:22:00 +0100
Subject: [PATCH 2/7] dfa: speed up handling of brackets

This patch has two sides.  One is to fold the parsing of brackets in the
single- and multi-byte cases.  The second is to leverage this change,
and use a bitset to test for single-byte characters in the charset.
Splitting the two would be very hard.

Testcase:
   yes 'the quick brown fox jumps over the lazy dog' | sed 100000q | \
     time grep -c [ABCDEFGHIJKLMNOPQRSTUVWXYZ,]

Before: 59ms (best of three runs); after: 51ms (best of three runs).
Nice, but mostly providing infrastructure for the next patch.

* src/dfa.c (setbit_case_fold): Try applying towlower/towupper.
(looking_at): Remove.
(FETCH_WC): New.
(fetch_wc): Merge into FETCH_WC [MBS_SUPPORT].
(FETCH) [MBS_SUPPORT]: Call FETCH_WC.
(prednames, find_pred, is_blank and other predicates): Move above.
(parse_bracket_exp): New name of parse_bracket_exp_mb, rewritten to
include single-byte character set parsing of brackets.
(lex): Adjust for fetch_wc->FETCH_WC change, remove single-byte
character set parsing of brackets.
(match_mb_charset): Test against work_mbc->cset.
* src/dfa.h (struct mb_char_classes): Add cset.
---
 .x-sc_space_tab |    1 -
 src/dfa.c       |  631 ++++++++++++++++++++++++++++---------------------------
 src/dfa.h       |    1 +
 3 files changed, 319 insertions(+), 314 deletions(-)

diff --git a/.x-sc_space_tab b/.x-sc_space_tab
index be9bbd8..efa4369 100644
--- a/.x-sc_space_tab
+++ b/.x-sc_space_tab
@@ -1,2 +1 @@
 \.diff$
-^src/dfa\.c$
diff --git a/src/dfa.c b/src/dfa.c
index 455bd70..c98296d 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -236,17 +236,40 @@ dfasyntax (reg_syntax_t bits, int fold, unsigned char eol)
   eolbyte = eol;
 }
 
-/* Like setbit, but if case is folded, set both cases of a letter.  */
+/* Like setbit, but if case is folded, set both cases of a letter.
+   For MB_CUR_MAX > 1, one or both of the two cases may not be set,
+   so the resulting charset may only be used as an optimization.  */
 static void
 setbit_case_fold (unsigned b, charclass c)
 {
-  setbit (b, c);
   if (case_fold)
     {
-      if (ISUPPER (b))
-       setbit (tolower (b), c);
-      else if (ISLOWER (b))
-       setbit (toupper (b), c);
+#ifdef MBS_SUPPORT
+      if (MB_CUR_MAX > 1)
+        {
+          wint_t b1 = iswupper(b) ? towlower(b) : b;
+          wint_t b2 = iswlower(b) ? towupper(b) : b;
+          if (wctob ((unsigned char)b1) == b1)
+            setbit (b1, c);
+          if (b2 != b1 && wctob ((unsigned char)b2) == b2)
+            setbit (b2, c);
+        }
+      else
+        {
+#endif
+          unsigned char b1 = ISUPPER(b) ? tolower(b) : b;
+          unsigned char b2 = ISLOWER(b) ? toupper(b) : b;
+         setbit (b1, c);
+          if (b2 != b1)
+            setbit (b2, c);
+        }
+    }
+  else
+    {
+#ifdef MBS_SUPPORT
+      if (wctob ((unsigned char)b) == b)
+#endif
+        setbit (b, c);
     }
 }
 
@@ -293,57 +316,57 @@ static unsigned char const *buf_end;      /* reference to 
end in dfaexec().  */
 
 #ifdef MBS_SUPPORT
 /* Note that characters become unsigned here. */
-# define FETCH(c, eoferr)                      \
+# define FETCH_WC(c, wc, eoferr)               \
   do {                                         \
     if (! lexleft)                             \
-     {                                         \
-       if (eoferr != 0)                        \
+      {                                                \
+        if (eoferr != 0)                       \
          dfaerror (eoferr);                    \
-       else                                    \
+        else                                   \
          return lasttok = END;                 \
       }                                                \
-    (c) = (unsigned char) *lexptr++;           \
-    --lexleft;                                 \
+    else                                       \
+      {                                                \
+        cur_mb_len = mbrtowc(&wc, lexptr, lexleft, &mbs); \
+        if (cur_mb_len <= 0)                   \
+          {                                    \
+            cur_mb_len = 1;                    \
+            --lexleft;                         \
+            wc = c = (unsigned char) *lexptr++;        \
+          }                                    \
+        else                                   \
+          {                                    \
+            lexptr += cur_mb_len;              \
+            lexleft -= cur_mb_len;             \
+            (c) = wctob(wc);                   \
+          }                                    \
+      }                                                \
   } while(0)
 
-/* This function fetch a wide character, and update cur_mb_len,
-   used only if the current locale is a multibyte environment.  */
-static wint_t
-fetch_wc (char const *eoferr)
-{
-  wchar_t wc;
-  if (! lexleft)
-    {
-      if (eoferr != 0)
-       dfaerror (eoferr);
-      else
-       return WEOF;
-    }
+# define FETCH(c, eoferr)                      \
+  do {                                         \
+    wint_t _wc;                                        \
+    FETCH_WC(c, _wc, eoferr);                  \
+  } while(0)
 
-  cur_mb_len = mbrtowc(&wc, lexptr, lexleft, &mbs);
-  if (cur_mb_len <= 0)
-   {
-      cur_mb_len = 1;
-      wc = (unsigned char) *lexptr;
-    }
-  lexptr += cur_mb_len;
-  lexleft -= cur_mb_len;
-  return wc;
-}
 #else
 /* Note that characters become unsigned here. */
-# define FETCH(c, eoferr)            \
-  do {                               \
-    if (! lexleft)                   \
+# define FETCH(c, eoferr)            \
+  do {                               \
+    if (! lexleft)                   \
       {                                      \
        if (eoferr != 0)              \
          dfaerror (eoferr);          \
-       else                          \
+       else                          \
          return lasttok = END;       \
       }                                      \
     (c) = (unsigned char) *lexptr++;  \
-    --lexleft;                       \
+    --lexleft;                       \
   } while(0)
+
+# define FETCH_WC(c, unused, eoferr)           \
+  FETCH(c, eoferr)
+
 #endif /* MBS_SUPPORT */
 
 static int
@@ -353,13 +376,76 @@ in_coll_range (char ch, char from, char to)
   return strcoll (&c[0], &c[2]) <= 0 && 0 <= strcoll (&c[2], &c[4]);
 }
 
-#ifdef MBS_SUPPORT
+#ifdef __STDC__
+#define FUNC(F, P) static int F(int c) { return P(c); }
+#else
+#define FUNC(F, P) static int F(c) int c; { return P(c); }
+#endif
+
+FUNC(is_alpha, ISALPHA)
+FUNC(is_upper, ISUPPER)
+FUNC(is_lower, ISLOWER)
+FUNC(is_digit, ISDIGIT)
+FUNC(is_xdigit, ISXDIGIT)
+FUNC(is_space, ISSPACE)
+FUNC(is_punct, ISPUNCT)
+FUNC(is_alnum, ISALNUM)
+FUNC(is_print, ISPRINT)
+FUNC(is_graph, ISGRAPH)
+FUNC(is_cntrl, ISCNTRL)
+
+static int
+is_blank (int c)
+{
+   return (c == ' ' || c == '\t');
+}
+
+typedef int predicate (int);
+
+/* The following list maps the names of the Posix named character classes
+   to predicate functions that determine whether a given character is in
+   the class.  The leading [ has already been eaten by the lexical analyzer. */
+static struct {
+  const char *name;
+  predicate *pred;
+} const prednames[] = {
+  { "alpha", is_alpha },
+  { "upper", is_upper },
+  { "lower", is_lower },
+  { "digit", is_digit },
+  { "xdigit", is_xdigit },
+  { "space", is_space },
+  { "punct", is_punct },
+  { "alnum", is_alnum },
+  { "print", is_print },
+  { "graph", is_graph },
+  { "cntrl", is_cntrl },
+  { "blank", is_blank },
+  { 0, 0 }
+};
+
+static predicate *
+find_pred (const char *str)
+{
+  int i;
+  for (i = 0; prednames[i].name; ++i)
+    if (!strcmp(str, prednames[i].name))
+      break;
+
+  return prednames[i].pred;
+}
+
 /* Multibyte character handling sub-routine for lex.
    This function  parse a bracket expression and build a struct
    mb_char_classes.  */
 static token
-parse_bracket_exp_mb (void)
+parse_bracket_exp (void)
 {
+  int invert;
+  int c, c1, c2;
+  charclass ccl;
+
+#ifdef MBS_SUPPORT
   wint_t wc, wc1, wc2;
 
   /* Work area to build a mb_char_classes.  */
@@ -367,63 +453,68 @@ parse_bracket_exp_mb (void)
   int chars_al, range_sts_al, range_ends_al, ch_classes_al,
     equivs_al, coll_elems_al;
 
-  REALLOC_IF_NECESSARY(dfa->mbcsets, struct mb_char_classes,
-                      dfa->mbcsets_alloc, dfa->nmbcsets + 1);
-  /* dfa->multibyte_prop[] hold the index of dfa->mbcsets.
-     We will update dfa->multibyte_prop[] in addtok(), because we can't
-     decide the index in dfa->tokens[].  */
-
-  /* Initialize work are */
-  work_mbc = &(dfa->mbcsets[dfa->nmbcsets++]);
-
   chars_al = 1;
   range_sts_al = range_ends_al = 0;
   ch_classes_al = equivs_al = coll_elems_al = 0;
+  if (MB_CUR_MAX > 1)
+    {
+      REALLOC_IF_NECESSARY(dfa->mbcsets, struct mb_char_classes,
+                           dfa->mbcsets_alloc, dfa->nmbcsets + 1);
+
+      /* dfa->multibyte_prop[] hold the index of dfa->mbcsets.
+         We will update dfa->multibyte_prop[] in addtok(), because we can't
+         decide the index in dfa->tokens[].  */
+
+      /* Initialize work area.  */
+      work_mbc = &(dfa->mbcsets[dfa->nmbcsets++]);
+      work_mbc->nchars = work_mbc->nranges = work_mbc->nch_classes = 0;
+      work_mbc->nequivs = work_mbc->ncoll_elems = 0;
+      work_mbc->chars = NULL;
+      work_mbc->ch_classes = NULL;
+      work_mbc->range_sts = work_mbc->range_ends = NULL;
+      work_mbc->equivs = work_mbc->coll_elems = NULL;
+    }
+  else
+    work_mbc = NULL;
+#endif
 
-  work_mbc->nchars = work_mbc->nranges = work_mbc->nch_classes = 0;
-  work_mbc->nequivs = work_mbc->ncoll_elems = 0;
-  work_mbc->chars = NULL;
-  work_mbc->ch_classes = NULL;
-  work_mbc->range_sts = work_mbc->range_ends = NULL;
-  work_mbc->equivs = work_mbc->coll_elems = NULL;
-
-  wc = fetch_wc(_("unbalanced ["));
-  if (wc == L'^')
+  memset (ccl, 0, sizeof(ccl));
+  FETCH_WC (c, wc, _("unbalanced ["));
+  if (c == '^')
     {
-      wc = fetch_wc(_("unbalanced ["));
-      work_mbc->invert = 1;
+      FETCH_WC (c, wc, _("unbalanced ["));
+      invert = 1;
     }
   else
-    work_mbc->invert = 0;
+    invert = 0;
+
   do
     {
-      wc1 = WEOF; /* mark wc1 is not initialized".  */
+      c1 = EOF; /* mark c1 is not initialized".  */
 
       /* Note that if we're looking at some other [:...:] construct,
         we just treat it as a bunch of ordinary characters.  We can do
         this because we assume regex has checked for syntax errors before
         dfa is ever called. */
-      if (wc == L'[' && (syntax_bits & RE_CHAR_CLASSES))
+      if (c == '[' && (syntax_bits & RE_CHAR_CLASSES))
        {
 #define BRACKET_BUFFER_SIZE 128
          char str[BRACKET_BUFFER_SIZE];
-         wc1 = wc;
-         wc = fetch_wc(_("unbalanced ["));
+         FETCH_WC (c1, wc1, _("unbalanced ["));
 
          /* If pattern contains `[[:', `[[.', or `[[='.  */
-         if (cur_mb_len == 1 && (wc == L':' || wc == L'.' || wc == L'='))
+         if (c1 == ':'
+#ifdef MBS_SUPPORT
+              /* TODO: handle `[[.' and `[[=' also for MB_CUR_MAX == 1.  */
+             || (MB_CUR_MAX > 1 && (c1 == '.' || c1 == '='))
+#endif
+             )
            {
-             unsigned char c;
-             unsigned char delim = (unsigned char)wc;
              int len = 0;
              for (;;)
                {
-                 if (! lexleft)
-                   dfaerror (_("unbalanced ["));
-                 c = (unsigned char) *lexptr++;
-                 --lexleft;
-
-                 if ((c == delim && *lexptr == ']') || lexleft == 0)
+                 FETCH (c, _("unbalanced ["));
+                 if ((c == c1 && *lexptr == ']') || lexleft == 0)
                    break;
                  if (len < BRACKET_BUFFER_SIZE)
                    str[len++] = c;
@@ -433,18 +524,9 @@ parse_bracket_exp_mb (void)
                }
              str[len] = '\0';
 
-             if (lexleft == 0)
-               {
-                 REALLOC_IF_NECESSARY(work_mbc->chars, wchar_t, chars_al,
-                                      work_mbc->nchars + 2);
-                 work_mbc->chars[work_mbc->nchars++] = L'[';
-                 work_mbc->chars[work_mbc->nchars++] = delim;
-                 break;
-               }
-
-             if (--lexleft, *lexptr++ != ']')
-               dfaerror (_("unbalanced ["));
-             if (delim == ':')
+              /* Fetch bracket.  */
+             FETCH (c, _("unbalanced ["));
+             if (c1 == ':')
                /* build character class.  */
                {
                  char const *class
@@ -452,24 +534,39 @@ parse_bracket_exp_mb (void)
                                     || !strcmp (str, "lower"))
                                       ? "alpha"
                                       : str);
-                 /* Query the character class as wctype_t.  */
-                 wctype_t wt = wctype (class);
-
-                 if (ch_classes_al == 0)
-                   MALLOC(work_mbc->ch_classes, wctype_t, ++ch_classes_al);
-                 REALLOC_IF_NECESSARY(work_mbc->ch_classes, wctype_t,
-                                      ch_classes_al,
-                                      work_mbc->nch_classes + 1);
-                 work_mbc->ch_classes[work_mbc->nch_classes++] = wt;
-
-               }
-             else if (delim == '=' || delim == '.')
+#ifdef MBS_SUPPORT
+                  if (MB_CUR_MAX > 1)
+                    {
+                     /* Store the character class as wctype_t.  */
+                      wctype_t wt = wctype (class);
+
+                      if (ch_classes_al == 0)
+                        MALLOC(work_mbc->ch_classes, wctype_t, 
++ch_classes_al);
+                      REALLOC_IF_NECESSARY(work_mbc->ch_classes, wctype_t,
+                                           ch_classes_al,
+                                           work_mbc->nch_classes + 1);
+                      work_mbc->ch_classes[work_mbc->nch_classes++] = wt;
+                    }
+#endif
+
+                  {
+                    predicate *pred = find_pred (class);
+                    if (!pred)
+                      dfaerror(_("invalid character class"));
+                    for (c2 = 0; c2 < NOTCHAR; ++c2)
+                      if ((*pred)(c2))
+                        setbit_case_fold (c2, ccl);
+                  }
+                }
+
+#ifdef MBS_SUPPORT
+             else if (c1 == '=' || c1 == '.')
                {
                  char *elem;
                  MALLOC(elem, char, len + 1);
                  strncpy(elem, str, len + 1);
 
-                 if (delim == '=')
+                 if (c1 == '=')
                    /* build equivalent class.  */
                    {
                      if (equivs_al == 0)
@@ -480,7 +577,7 @@ parse_bracket_exp_mb (void)
                      work_mbc->equivs[work_mbc->nequivs++] = elem;
                    }
 
-                 if (delim == '.')
+                 if (c1 == '.')
                    /* build collating element.  */
                    {
                      if (coll_elems_al == 0)
@@ -490,159 +587,158 @@ parse_bracket_exp_mb (void)
                                           work_mbc->ncoll_elems + 1);
                      work_mbc->coll_elems[work_mbc->ncoll_elems++] = elem;
                    }
-               }
-             wc1 = wc = WEOF;
-           }
-         else
-           /* We treat '[' as a normal character here.  */
-           {
-             wc2 = wc1; wc1 = wc; wc = wc2; /* swap */
+               }
+#endif
+
+              /* Fetch new lookahead character.  */
+             FETCH_WC (c1, wc1, _("unbalanced ["));
+              continue;
            }
-       }
-      else
-       {
-         if (wc == L'\\' && (syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS))
-           wc = fetch_wc(("unbalanced ["));
+
+          /* We treat '[' as a normal character here.  c/c1/wc/wc1
+             are already set up.  */
        }
 
-      if (wc1 == WEOF)
-       wc1 = fetch_wc(_("unbalanced ["));
+      if (c == '\\' && (syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS))
+        FETCH_WC(c, wc, _("unbalanced ["));
 
-      if (wc1 == L'-')
+      if (c1 == EOF)
+       FETCH_WC(c1, wc1, _("unbalanced ["));
+
+      if (c1 == '-')
        /* build range characters.  */
        {
-         wc2 = fetch_wc(_("unbalanced ["));
-         if (wc2 == L']')
+         FETCH_WC(c2, wc2, _("unbalanced ["));
+         if (c2 == ']')
            {
              /* In the case [x-], the - is an ordinary hyphen,
                 which is left in c1, the lookahead character. */
              lexptr -= cur_mb_len;
              lexleft += cur_mb_len;
-             wc2 = wc;
-           }
-         else
-           {
-             if (wc2 == L'\\'
-                 && (syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS))
-               wc2 = fetch_wc(_("unbalanced ["));
-             wc1 = fetch_wc(_("unbalanced ["));
-           }
+            }
+        }
 
-         /* When case folding map a range, say [m-z] (or even [M-z]) to the
-            pair of ranges, [m-z] [M-Z].  */
-         if (range_sts_al == 0)
-           {
-             MALLOC(work_mbc->range_sts, wchar_t, ++range_sts_al);
-             MALLOC(work_mbc->range_ends, wchar_t, ++range_ends_al);
-           }
-         REALLOC_IF_NECESSARY(work_mbc->range_sts, wchar_t,
-                              range_sts_al, work_mbc->nranges + 1);
-         REALLOC_IF_NECESSARY(work_mbc->range_ends, wchar_t,
-                              range_ends_al, work_mbc->nranges + 1);
-         work_mbc->range_sts[work_mbc->nranges] =
-            case_fold ? towlower(wc) : (wchar_t)wc;
-         work_mbc->range_ends[work_mbc->nranges++] =
-            case_fold ? towlower(wc2) : (wchar_t)wc2;
+      if (c1 == '-' && c2 != ']')
+        {
+          if (c2 == '\\'
+              && (syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS))
+            FETCH_WC(c2, wc2, _("unbalanced ["));
 
-#ifndef GREP
-         if (case_fold)
+#ifdef MBS_SUPPORT
+          if (MB_CUR_MAX > 1)
             {
+             /* When case folding map a range, say [m-z] (or even [M-z])
+                to the pair of ranges, [m-z] [M-Z].  */
+             if (range_sts_al == 0)
+                {
+                  MALLOC(work_mbc->range_sts, wchar_t, ++range_sts_al);
+                  MALLOC(work_mbc->range_ends, wchar_t, ++range_ends_al);
+                }
               REALLOC_IF_NECESSARY(work_mbc->range_sts, wchar_t,
                                    range_sts_al, work_mbc->nranges + 1);
-              work_mbc->range_sts[work_mbc->nranges] = towupper(wc);
               REALLOC_IF_NECESSARY(work_mbc->range_ends, wchar_t,
                                    range_ends_al, work_mbc->nranges + 1);
-              work_mbc->range_ends[work_mbc->nranges++] = towupper(wc2);
+              work_mbc->range_sts[work_mbc->nranges] =
+                case_fold ? towlower(wc) : (wchar_t)wc;
+              work_mbc->range_ends[work_mbc->nranges++] =
+                case_fold ? towlower(wc2) : (wchar_t)wc2;
+
+#ifndef GREP
+              if (case_fold && (iswalpha(wc) || iswalpha(wc2)))
+                {
+                  REALLOC_IF_NECESSARY(work_mbc->range_sts, wchar_t,
+                                       range_sts_al, work_mbc->nranges + 1);
+                  work_mbc->range_sts[work_mbc->nranges] = towupper(wc);
+                  REALLOC_IF_NECESSARY(work_mbc->range_ends, wchar_t,
+                                       range_ends_al, work_mbc->nranges + 1);
+                  work_mbc->range_ends[work_mbc->nranges++] = towupper(wc2);
+                }
+#endif
             }
+          else
 #endif
+            {
+              c1 = c;
+              if (case_fold)
+                {
+                  c1 = tolower (c1);
+                  c2 = tolower (c2);
+                }
+              if (!hard_LC_COLLATE)
+                for (c = c1; c <= c2; c++)
+                  setbit_case_fold (c, ccl);
+              else
+                for (c = 0; c < NOTCHAR; ++c)
+                  if (!(case_fold && ISUPPER (c))
+                      && in_coll_range (c, c1, c2))
+                    setbit_case_fold (c, ccl);
+            }
+
+          FETCH_WC(c1, wc1, _("unbalanced ["));
+         continue;
        }
-      else if (wc != WEOF)
-       /* build normal characters.  */
-       {
-         REALLOC_IF_NECESSARY(work_mbc->chars, wchar_t, chars_al,
-                              work_mbc->nchars + 1);
-         work_mbc->chars[work_mbc->nchars++] =
-               (wchar_t) (case_fold ? towlower(wc) : wc);
-#ifndef GREP
-         if (case_fold)
+
+      setbit_case_fold (c, ccl);
+#ifdef MBS_SUPPORT
+      /* Build normal characters.  */
+      if (MB_CUR_MAX > 1)
+        {
+          if (case_fold && iswalpha(wc))
+            {
+              wc = towlower(wc);
+              c = wctob(wc);
+              if (c == EOF || (wint_t)c != (wint_t)wc)
+                {
+                  REALLOC_IF_NECESSARY(work_mbc->chars, wchar_t, chars_al,
+                                       work_mbc->nchars + 1);
+                  work_mbc->chars[work_mbc->nchars++] = wc;
+                }
+#ifdef GREP
+             continue;
+#else
+              wc = towupper(wc);
+              c = wctob(wc);
+#endif
+            } 
+          if (c == EOF || (wint_t)c != (wint_t)wc)
             {
               REALLOC_IF_NECESSARY(work_mbc->chars, wchar_t, chars_al,
                                    work_mbc->nchars + 1);
-              work_mbc->chars[work_mbc->nchars++] = towupper(wc);
+              work_mbc->chars[work_mbc->nchars++] = wc;
             }
 #endif
         }
     }
-  while ((wc = wc1) != L']');
-  return MBCSET;
-}
-#endif /* MBS_SUPPORT */
+  while ((wc = wc1, (c = c1) != L']'));
 
-#ifdef __STDC__
-#define FUNC(F, P) static int F(int c) { return P(c); }
-#else
-#define FUNC(F, P) static int F(c) int c; { return P(c); }
+#ifdef MBS_SUPPORT
+  if (MB_CUR_MAX > 1)
+    {
+      static charclass zeroclass;
+      work_mbc->invert = invert;
+      work_mbc->cset = equal(ccl, zeroclass) ? -1 : charclass_index(ccl);
+      return MBCSET;
+    }
 #endif
 
-FUNC(is_alpha, ISALPHA)
-FUNC(is_upper, ISUPPER)
-FUNC(is_lower, ISLOWER)
-FUNC(is_digit, ISDIGIT)
-FUNC(is_xdigit, ISXDIGIT)
-FUNC(is_space, ISSPACE)
-FUNC(is_punct, ISPUNCT)
-FUNC(is_alnum, ISALNUM)
-FUNC(is_print, ISPRINT)
-FUNC(is_graph, ISGRAPH)
-FUNC(is_cntrl, ISCNTRL)
+  if (invert)
+    {
+      notset(ccl);
+      if (syntax_bits & RE_HAT_LISTS_NOT_NEWLINE)
+        clrbit(eolbyte, ccl);
+    }
 
-static int
-is_blank (int c)
-{
-   return (c == ' ' || c == '\t');
+  return CSET + charclass_index(ccl);
 }
 
-/* The following list maps the names of the Posix named character classes
-   to predicate functions that determine whether a given character is in
-   the class.  The leading [ has already been eaten by the lexical analyzer. */
-static struct {
-  const char *name;
-  int (*pred) (int);
-} const prednames[] = {
-  { ":alpha:]", is_alpha },
-  { ":upper:]", is_upper },
-  { ":lower:]", is_lower },
-  { ":digit:]", is_digit },
-  { ":xdigit:]", is_xdigit },
-  { ":space:]", is_space },
-  { ":punct:]", is_punct },
-  { ":alnum:]", is_alnum },
-  { ":print:]", is_print },
-  { ":graph:]", is_graph },
-  { ":cntrl:]", is_cntrl },
-  { ":blank:]", is_blank },
-  { 0, 0 }
-};
-
 /* Return non-zero if C is a `word-constituent' byte; zero otherwise.  */
 #define IS_WORD_CONSTITUENT(C) (ISALNUM(C) || (C) == '_')
 
-static int
-looking_at (char const *s)
-{
-  size_t len;
-
-  len = strlen(s);
-  if (lexleft < len)
-    return 0;
-  return strncmp(s, lexptr, len) == 0;
-}
-
 static token
 lex (void)
 {
-  unsigned c, c1, c2;
-  int backslash = 0, invert;
+  unsigned c, c2;
+  int backslash = 0;
   charclass ccl;
   int i;
 
@@ -657,10 +753,7 @@ lex (void)
 #ifdef MBS_SUPPORT
       if (MB_CUR_MAX > 1)
         {
-          wint_t wi = fetch_wc (NULL);
-          if (wi == WEOF)
-            return lasttok = EOF;
-          wctok = wi, c = wctob (wi);
+          FETCH_WC (c, wctok, NULL);
           if ((int)c == EOF)
             goto normal_char;
         }
@@ -941,100 +1034,7 @@ lex (void)
          if (backslash)
            goto normal_char;
          laststart = 0;
-#ifdef MBS_SUPPORT
-         if (MB_CUR_MAX > 1)
-           {
-             /* In multibyte environment a bracket expression may contain
-                multibyte characters, which must be treated as characters
-                (not bytes).  So we parse it by parse_bracket_exp_mb().  */
-             return lasttok = parse_bracket_exp_mb();
-           }
-#endif
-         zeroset(ccl);
-         FETCH(c, _("unbalanced ["));
-         if (c == '^')
-           {
-             FETCH(c, _("unbalanced ["));
-             invert = 1;
-           }
-         else
-           invert = 0;
-         do
-           {
-             /* Nobody ever said this had to be fast. :-)
-                Note that if we're looking at some other [:...:]
-                construct, we just treat it as a bunch of ordinary
-                characters.  We can do this because we assume
-                regex has checked for syntax errors before
-                dfa is ever called. */
-             if (c == '[' && (syntax_bits & RE_CHAR_CLASSES))
-               for (c1 = 0; prednames[c1].name; ++c1)
-                 if (looking_at(prednames[c1].name))
-                   {
-                     int (*pred) (int) = prednames[c1].pred;
-
-                     for (c2 = 0; c2 < NOTCHAR; ++c2)
-                       if ((*pred)(c2))
-                         setbit_case_fold (c2, ccl);
-                     lexptr += strlen(prednames[c1].name);
-                     lexleft -= strlen(prednames[c1].name);
-                     FETCH(c1, _("unbalanced ["));
-                     goto skip;
-                   }
-             if (c == '\\' && (syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS))
-               FETCH(c, _("unbalanced ["));
-             FETCH(c1, _("unbalanced ["));
-             if (c1 == '-')
-               {
-                 FETCH(c2, _("unbalanced ["));
-                 if (c2 == ']')
-                   {
-                     /* In the case [x-], the - is an ordinary hyphen,
-                        which is left in c1, the lookahead character. */
-                     --lexptr;
-                     ++lexleft;
-                   }
-                 else
-                   {
-                     if (c2 == '\\'
-                         && (syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS))
-                       FETCH(c2, _("unbalanced ["));
-
-                      c1 = c;
-                     if (!hard_LC_COLLATE)
-                       for (c = c1; c <= c2; c++)
-                         setbit_case_fold (c, ccl);
-                     else
-                        {
-                          if (case_fold)
-                            {
-                              c1 = tolower (c1);
-                              c2 = tolower (c2);
-                            }
-                          for (c = 0; c < NOTCHAR; ++c)
-                            if (!(case_fold && ISUPPER (c))
-                                && in_coll_range (c, c1, c2))
-                              setbit_case_fold (c, ccl);
-                        }
-
-                     FETCH(c1, _("unbalanced ["));
-                     continue;
-                   }
-               }
-
-             setbit_case_fold (c, ccl);
-
-           skip:
-             ;
-           }
-         while ((c = c1) != ']');
-         if (invert)
-           {
-             notset(ccl);
-             if (syntax_bits & RE_HAT_LISTS_NOT_NEWLINE)
-               clrbit(eolbyte, ccl);
-           }
-         return lasttok = CSET + charclass_index(ccl);
+         return lasttok = parse_bracket_exp();
 
        default:
        normal_char:
@@ -2475,6 +2475,11 @@ match_mb_charset (struct dfa *d, int s, position pos, 
int idx)
   match = !work_mbc->invert;
   match_len = (mblen_buf[idx] == 0)? 1 : mblen_buf[idx];
 
+  /* Match in range 0-255?  */
+  if (wc < NOTCHAR && work_mbc->cset != -1
+      && tstbit((unsigned char)wc, d->charclasses[work_mbc->cset]))
+    goto charset_matched;
+
   /* match with a character class?  */
   for (i = 0; i<work_mbc->nch_classes; i++)
     {
diff --git a/src/dfa.h b/src/dfa.h
index 4ca55f0..b8eb0c2 100644
--- a/src/dfa.h
+++ b/src/dfa.h
@@ -243,6 +243,7 @@ struct dfamust
    e.g. [a-c], [[:alpha:]], etc.  */
 struct mb_char_classes
 {
+  int cset;
   int invert;
   wchar_t *chars;              /* Normal characters.  */
   int nchars;
-- 
1.6.6.1


>From dcc9436b6eb56f491fa2afee70deb91b66b2ec48 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <address@hidden>
Date: Mon, 8 Mar 2010 12:20:37 +0100
Subject: [PATCH 3/7] dfa: optimize simple character sets under UTF-8 charsets

Only use a bitset when possible without involving MBCSET.  Testcase:
   yes 'the quick brown fox jumps over the lazy dog' | sed 100000q | \
     time grep -c [ABCDEFGHIJKLMNOPQRSTUVWXYZ,]

Before: 51ms (best of three runs); after: 16ms(best of three runs).

* src/dfa.c (check_utf8, using_utf8): New.
(parse_bracket_exp): For simple bracket expressions under UTF-8,
use a CSET.
(dfacomp): Call check_utf8.
---
 src/dfa.c |   34 +++++++++++++++++++++++++++++++++-
 1 files changed, 33 insertions(+), 1 deletions(-)

diff --git a/src/dfa.c b/src/dfa.c
index c98296d..e158873 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -21,6 +21,7 @@
    Modified July, 1988 by Arthur David Olson to assist BMG speedups  */
 
 #include <config.h>
+#include <assert.h>
 #include <ctype.h>
 #include <stdio.h>
 #include <sys/types.h>
@@ -78,6 +79,7 @@
 /* We can handle multibyte strings. */
 # include <wchar.h>
 # include <wctype.h>
+# include <langinfo.h>
 #endif
 
 #include "regex.h"
@@ -312,8 +314,27 @@ static wchar_t *inputwcs;  /* Wide character 
representation of input
                                   And inputwcs[i] is the codepoint.  */
 static unsigned char const *buf_begin; /* reference to begin in dfaexec().  */
 static unsigned char const *buf_end;   /* reference to end in dfaexec().  */
+
+/* UTF-8 encoding allows some optimizations that we can't otherwise
+   assume in a multibyte encoding. */
+static int using_utf8;
+
+static void
+check_utf8 (void)
+{
+#ifdef HAVE_LANGINFO_CODESET
+  if (strcmp (nl_langinfo (CODESET), "UTF-8") == 0)
+    using_utf8 = 1;
+#endif
+}
+#else
+static void
+check_utf8 (void)
+{
+}
 #endif /* MBS_SUPPORT  */
 
+
 #ifdef MBS_SUPPORT
 /* Note that characters become unsigned here. */
 # define FETCH_WC(c, wc, eoferr)               \
@@ -712,7 +733,14 @@ parse_bracket_exp (void)
   while ((wc = wc1, (c = c1) != L']'));
 
 #ifdef MBS_SUPPORT
-  if (MB_CUR_MAX > 1)
+  if (MB_CUR_MAX > 1
+      && (!using_utf8
+         || invert
+          || work_mbc->nchars != 0
+          || work_mbc->nch_classes != 0
+          || work_mbc->nranges != 0
+          || work_mbc->nequivs != 0
+          || work_mbc->ncoll_elems != 0))
     {
       static charclass zeroclass;
       work_mbc->invert = invert;
@@ -723,6 +751,9 @@ parse_bracket_exp (void)
 
   if (invert)
     {
+#ifdef MBS_SUPPORT
+      assert(MB_CUR_MAX == 1);
+#endif
       notset(ccl);
       if (syntax_bits & RE_HAT_LISTS_NOT_NEWLINE)
         clrbit(eolbyte, ccl);
@@ -2941,6 +2972,7 @@ dfainit (struct dfa *d)
 void
 dfacomp (char const *s, size_t len, struct dfa *d, int searchflag)
 {
+  check_utf8();
   dfainit(d);
   dfaparse(s, len, d);
   dfamust(d);
-- 
1.6.6.1


>From 3a4b1d660c3fa4e5bd169678b98a694a6340196a Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <address@hidden>
Date: Mon, 8 Mar 2010 12:42:58 +0100
Subject: [PATCH 4/7] dfa: cache MB_CUR_MAX for dfaexec

* src/dfa.c (state_index, dfaexec): Use d->mb_cur_max.
(dfainit): Initialize it.
(free_mbdata): New, extracted out of dfafree.
(dfafree): Use it.
---
 src/dfa.c |   73 ++++++++++++++++++++++++++++++++++++-------------------------
 src/dfa.h |    2 +
 2 files changed, 45 insertions(+), 30 deletions(-)

diff --git a/src/dfa.c b/src/dfa.c
index e158873..8b73199 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -2131,7 +2131,7 @@ dfastate (int s, struct dfa *d, int trans[])
          insert(d->follows[grps[i].elems[j].index].elems[k], &follows);
 
 #ifdef MBS_SUPPORT
-      if (MB_CUR_MAX > 1)
+      if (d->mb_cur_max > 1)
        {
          /* If a token in follows.elems is not 1st byte of a multibyte
             character, or the states of follows must accept the bytes
@@ -2166,7 +2166,7 @@ dfastate (int s, struct dfa *d, int trans[])
       /* If we are building a searching matcher, throw in the positions
         of state 0 as well. */
 #ifdef MBS_SUPPORT
-      if (d->searchflag && (MB_CUR_MAX == 1 || !next_isnt_1st_byte))
+      if (d->searchflag && (d->mb_cur_max == 1 || !next_isnt_1st_byte))
 #else
       if (d->searchflag)
 #endif
@@ -2790,7 +2790,7 @@ dfaexec (struct dfa *d, char const *begin, char *end,
   *end = eol;
 
 #ifdef MBS_SUPPORT
-  if (MB_CUR_MAX > 1)
+  if (d->mb_cur_max > 1)
     {
       int remain_bytes, i;
       buf_begin = (unsigned char *) begin;
@@ -2835,7 +2835,7 @@ dfaexec (struct dfa *d, char const *begin, char *end,
   for (;;)
     {
 #ifdef MBS_SUPPORT
-      if (MB_CUR_MAX > 1)
+      if (d->mb_cur_max > 1)
        while ((t = trans[s]))
          {
            if ((char *) p > end)
@@ -2872,7 +2872,7 @@ dfaexec (struct dfa *d, char const *begin, char *end,
              if (backref)
                *backref = (d->states[s].backref != 0);
 #ifdef MBS_SUPPORT
-             if (MB_CUR_MAX > 1)
+             if (d->mb_cur_max > 1)
                {
                  free(mblen_buf);
                  free(inputwcs);
@@ -2884,7 +2884,7 @@ dfaexec (struct dfa *d, char const *begin, char *end,
 
          s1 = s;
 #ifdef MBS_SUPPORT
-         if (MB_CUR_MAX > 1)
+         if (d->mb_cur_max > 1)
            {
               /* Can match with a multibyte character (and multicharacter
                  collating element).  Transition table might be updated.  */
@@ -2905,7 +2905,7 @@ dfaexec (struct dfa *d, char const *begin, char *end,
       if ((char *) p > end)
        {
 #ifdef MBS_SUPPORT
-         if (MB_CUR_MAX > 1)
+         if (d->mb_cur_max > 1)
            {
              free(mblen_buf);
              free(inputwcs);
@@ -2932,6 +2932,37 @@ dfaexec (struct dfa *d, char const *begin, char *end,
     }
 }
 
+static void
+free_mbdata (struct dfa *d)
+{
+  int i;
+
+  free(d->multibyte_prop);
+  d->multibyte_prop = NULL;
+
+  for (i = 0; i < d->nmbcsets; ++i)
+    {
+      int j;
+      struct mb_char_classes *p = &(d->mbcsets[i]);
+      free(p->chars);
+      free(p->ch_classes);
+      free(p->range_sts);
+      free(p->range_ends);
+
+      for (j = 0; j < p->nequivs; ++j)
+        free(p->equivs[j]);
+      free(p->equivs);
+
+      for (j = 0; j < p->ncoll_elems; ++j)
+        free(p->coll_elems[j]);
+      free(p->coll_elems);
+    }
+
+  free(d->mbcsets);
+  d->mbcsets = NULL;
+  d->nmbcsets = 0;
+}
+
 /* Initialize the components of a dfa that the other routines don't
    initialize for themselves. */
 void
@@ -2944,8 +2975,10 @@ dfainit (struct dfa *d)
   d->talloc = 1;
   MALLOC(d->tokens, token, d->talloc);
   d->tindex = d->depth = d->nleaves = d->nregexps = 0;
+
 #ifdef MBS_SUPPORT
-  if (MB_CUR_MAX > 1)
+  d->mb_cur_max = MB_CUR_MAX;
+  if (d->mb_cur_max > 1)
     {
       d->nmultibyte_prop = 1;
       MALLOC(d->multibyte_prop, int, d->nmultibyte_prop);
@@ -2990,28 +3023,8 @@ dfafree (struct dfa *d)
   free(d->tokens);
 
 #ifdef MBS_SUPPORT
-  if (MB_CUR_MAX > 1)
-    {
-      free(d->multibyte_prop);
-      for (i = 0; i < d->nmbcsets; ++i)
-       {
-         int j;
-         struct mb_char_classes *p = &(d->mbcsets[i]);
-         free(p->chars);
-         free(p->ch_classes);
-         free(p->range_sts);
-         free(p->range_ends);
-
-         for (j = 0; j < p->nequivs; ++j)
-           free(p->equivs[j]);
-         free(p->equivs);
-
-         for (j = 0; j < p->ncoll_elems; ++j)
-           free(p->coll_elems[j]);
-         free(p->coll_elems);
-       }
-      free(d->mbcsets);
-    }
+  if (d->mb_cur_max > 1)
+    free_mbdata(d);
 #endif /* MBS_SUPPORT */
 
   for (i = 0; i < d->sindex; ++i) {
diff --git a/src/dfa.h b/src/dfa.h
index b8eb0c2..89460aa 100644
--- a/src/dfa.h
+++ b/src/dfa.h
@@ -278,6 +278,8 @@ struct dfa
   int nregexps;                        /* Count of parallel regexps being built
                                   with dfaparse(). */
 #ifdef MBS_SUPPORT
+  int mb_cur_max;              /* Cached value of MB_CUR_MAX.  */
+
   /* These stuff are used only if MB_CUR_MAX > 1 or multibyte environments.  */
   int nmultibyte_prop;
   int *multibyte_prop;
-- 
1.6.6.1


>From 9b76db1f59230a7180498a070e1be331a75e2fd0 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <address@hidden>
Date: Mon, 8 Mar 2010 12:47:23 +0100
Subject: [PATCH 5/7] dfa: run simple UTF-8 regexps as a single-byte character 
set

This partially works around https://savannah.gnu.org/bugs/?29117,
but in general provides a speedup whenever fgrep is "almost" sufficient
but not quite (e.g. grep ^abc).  Speedup is too good to be true :-)
(can get to 1000x on some not-too-contrived testcases).

* src/dfa.c (dfaoptimize): New.
(dfacomp): Call it.
---
 src/dfa.c |   25 +++++++++++++++++++++++++
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/src/dfa.c b/src/dfa.c
index 8b73199..5a549b3 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -3001,6 +3001,30 @@ dfainit (struct dfa *d)
 #endif
 }
 
+static void
+dfaoptimize (struct dfa *d)
+{
+  unsigned i;
+  if (!using_utf8)
+    return;
+
+  for (i = 0; i < d->tindex; ++i)
+    {
+      switch(d->tokens[i])
+       {
+       case ANYCHAR:
+       case MBCSET:
+         /* Requires multi-byte algorithm.  */
+         return;
+       default:
+         break;
+       }
+    }
+
+  free_mbdata (d);
+  d->mb_cur_max = 1;
+}
+
 /* Parse and analyze a single string of the given length. */
 void
 dfacomp (char const *s, size_t len, struct dfa *d, int searchflag)
@@ -3008,6 +3032,7 @@ dfacomp (char const *s, size_t len, struct dfa *d, int 
searchflag)
   check_utf8();
   dfainit(d);
   dfaparse(s, len, d);
+  dfaoptimize(d);
   dfamust(d);
   dfaanalyze(d, searchflag);
 }
-- 
1.6.6.1


>From 4a9c00d48786fdb2e5da35f87582de9a6db49337 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <address@hidden>
Date: Mon, 8 Mar 2010 15:52:40 +0100
Subject: [PATCH 6/7] grep: remove check_multibyte_string, fix non-UTF8 missed 
match

Avoid computing ahead something that can be computed lazily as efficiently
(or more efficiently in the case of UTF-8, though this is left as TODO).
At the same time, "soften" the rejection condition for matching in the
middle of a multibyte sequence to fix bug 23814.

Multibyte "grep -i" is still very slow.

* NEWS: Document bugfix.
* src/search.c (check_multibyte_string): Rewrite as...
(is_mb_middle): ... this.
(EGexecute, Fexecute): Adjust.
* tests/Makefile.am (TESTS): Add euc-mb.
* tests/euc-mb: New testcase.
---
 NEWS              |    4 +++
 src/search.c      |   69 ++++++++++++++++++++++++----------------------------
 tests/Makefile.am |    1 +
 tests/euc-mb      |   23 +++++++++++++++++
 4 files changed, 60 insertions(+), 37 deletions(-)
 create mode 100644 tests/euc-mb

diff --git a/NEWS b/NEWS
index d326f2e..3211179 100644
--- a/NEWS
+++ b/NEWS
@@ -13,6 +13,10 @@ GNU grep NEWS                                    -*- outline 
-*-
   grep -i -o would fail to report some matches; grep -i --color, while not
   missing any line containing a match, would fail to color some matches.
 
+  grep would fail to report a match in a multibyte character set other than
+  UTF-8, if another match occurred earlier in the line but started in the
+  middle of a multibyte character.
+
   Various bugs in grep -P, caused by expressions such as [^b] or \S matching
   newlines, were fixed.  grep -P also supports the special sequences \Z and
   \z, and can be combined with the command-line option -z to perform searches
diff --git a/src/search.c b/src/search.c
index d7cf0e8..6533c16 100644
--- a/src/search.c
+++ b/src/search.c
@@ -219,20 +219,22 @@ kwsmusts (void)
 
 #ifdef MBS_SUPPORT
 
-static char*
-check_multibyte_string(const char *buf, size_t size)
+static bool
+is_mb_middle(const char **good, const char *buf, const char *end)
 {
-  char *mb_properties = xcalloc(size, 1);
+  const char *p = *good, *prev = p;
   mbstate_t cur_state;
-  wchar_t wc;
-  int i;
 
+  /* TODO: can be optimized for UTF-8.  */
   memset(&cur_state, 0, sizeof(mbstate_t));
-
-  for (i = 0; i < size ;)
+  while (p < buf)
     {
       size_t mbclen;
-      mbclen = mbrtowc(&wc, buf + i, size - i, &cur_state);
+      mbclen = mbrlen(p, end - p, &cur_state);
+
+      /* Store the beginning of the previous complete multibyte character.  */
+      if (mbclen != (size_t) -2)
+        prev = p;
 
       if (mbclen == (size_t) -1 || mbclen == (size_t) -2 || mbclen == 0)
        {
@@ -240,11 +242,11 @@ check_multibyte_string(const char *buf, size_t size)
             We treat it as a single byte character.  */
          mbclen = 1;
        }
-      mb_properties[i] = mbclen;
-      i += mbclen;
+      p += mbclen;
     }
 
-  return mb_properties;
+  *good = prev;
+  return p > buf;
 }
 #endif /* MBS_SUPPORT */
 
@@ -366,13 +368,12 @@ COMPILE_FCT(Ecompile)
 
 EXECUTE_FCT(EGexecute)
 {
-  register char const *buflim, *beg, *end, *match, *best_match;
+  char const *buflim, *beg, *end, *match, *best_match, *mb_start;
   char eol = eolbyte;
   int backref, start, len, best_len;
   struct kwsmatch kwsm;
   size_t i, ret_val;
 #ifdef MBS_SUPPORT
-  char *mb_properties = NULL;
   if (MB_CUR_MAX > 1)
     {
       if (match_icase)
@@ -384,12 +385,10 @@ EXECUTE_FCT(EGexecute)
            start_ptr = case_buf + (start_ptr - buf);
           buf = case_buf;
         }
-
-      if (kwset)
-        mb_properties = check_multibyte_string(buf, size);
     }
 #endif /* MBS_SUPPORT */
 
+  mb_start = buf;
   buflim = buf + size;
 
   for (beg = end = buf; end < buflim; beg = end)
@@ -410,14 +409,18 @@ EXECUTE_FCT(EGexecute)
                end++;
               else
                 end = buflim;
-#ifdef MBS_SUPPORT
-             if (MB_CUR_MAX > 1 && mb_properties[beg - buf] == 0)
-               continue;
-#endif
+             match = beg;
              while (beg > buf && beg[-1] != eol)
                --beg;
              if (kwsm.index < kwset_exact_matches)
-               goto success;
+                {
+#ifdef MBS_SUPPORT
+                  if (mb_start < beg)
+                    mb_start = beg;
+                  if (MB_CUR_MAX == 1 || !is_mb_middle (&mb_start, match, 
buflim))
+#endif
+                    goto success;
+                }
              if (dfaexec (&dfa, beg, (char *) end, 0, NULL, &backref) == NULL)
                continue;
            }
@@ -551,10 +554,6 @@ EXECUTE_FCT(EGexecute)
   *match_size = len;
   ret_val = beg - buf;
  out:
-#ifdef MBS_SUPPORT
-  if (MB_CUR_MAX > 1)
-    free (mb_properties);
-#endif /* MBS_SUPPORT */
   return ret_val;
 }
 #endif /* defined(GREP_PROGRAM) || defined(EGREP_PROGRAM) */
@@ -606,13 +605,12 @@ COMPILE_FCT(Fcompile)
 
 EXECUTE_FCT(Fexecute)
 {
-  register char const *beg, *try, *end;
-  register size_t len;
+  char const *beg, *try, *end, *mb_start;
+  size_t len;
   char eol = eolbyte;
   struct kwsmatch kwsmatch;
   size_t ret_val;
 #ifdef MBS_SUPPORT
-  char *mb_properties = NULL;
   if (MB_CUR_MAX > 1)
     {
       if (match_icase)
@@ -622,19 +620,20 @@ EXECUTE_FCT(Fexecute)
            start_ptr = case_buf + (start_ptr - buf);
           buf = case_buf;
         }
-
-      mb_properties = check_multibyte_string(buf, size);
     }
 #endif /* MBS_SUPPORT */
 
-  for (beg = start_ptr ? start_ptr : buf; beg <= buf + size; beg++)
+  for (mb_start = beg = start_ptr ? start_ptr : buf; beg <= buf + size; beg++)
     {
       size_t offset = kwsexec (kwset, beg, buf + size - beg, &kwsmatch);
       if (offset == (size_t) -1)
        goto failure;
 #ifdef MBS_SUPPORT
-      if (MB_CUR_MAX > 1 && mb_properties[offset+beg-buf] == 0)
-       continue; /* It is a part of multibyte character.  */
+      if (MB_CUR_MAX > 1 && is_mb_middle (&mb_start, beg + offset, buf + size))
+        {
+          beg = mb_start - 1;
+          continue; /* It is a part of multibyte character.  */
+        }
 #endif /* MBS_SUPPORT */
       beg += offset;
       len = kwsmatch.size[0];
@@ -686,10 +685,6 @@ EXECUTE_FCT(Fexecute)
   *match_size = len;
   ret_val = beg - buf;
  out:
-#ifdef MBS_SUPPORT
-  if (MB_CUR_MAX > 1)
-    free (mb_properties);
-#endif /* MBS_SUPPORT */
   return ret_val;
 }
 #endif /* defined(GREP_PROGRAM) || defined(FGREP_PROGRAM) */
diff --git a/tests/Makefile.am b/tests/Makefile.am
index e27d1d5..6262af9 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -24,6 +24,7 @@ TESTS =                                               \
   dfaexec-multibyte                            \
   empty.sh                                     \
   ere.sh                                       \
+  euc-mb                                       \
   fedora                                       \
   file.sh                                      \
   fmbtest.sh                                   \
diff --git a/tests/euc-mb b/tests/euc-mb
new file mode 100644
index 0000000..a59c295
--- /dev/null
+++ b/tests/euc-mb
@@ -0,0 +1,23 @@
+#!/bin/sh
+# test that matches starting in the middle of a multibyte char aren't rejected
+# too greedily.
+# Derived from https://savannah.gnu.org/bugs/?23814
+: ${srcdir=.}
+. "$srcdir/init.sh"; path_prepend_ ../src
+
+make_input () {
+  echo "$1" | tr AB '\244\263'
+}
+
+euc_grep () {
+  LC_ALL=ja_JP.EUC-JP grep `make_input "$1"`
+}
+
+if make_input BABA |euc_grep AB ; then
+  skip_ 'EUC-JP locale seems not to work'
+fi
+
+make_input BABAAB |euc_grep AB || \
+  fail_ 'whole line rejected after matching in the middle of a multibyte char'
+
+exit 0
-- 
1.6.6.1


>From ba430e6d6e1fee517cbf90e49ad4fa6b7582984a Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <address@hidden>
Date: Mon, 8 Mar 2010 16:12:47 +0100
Subject: [PATCH 7/7] grep: match multibyte charsets line-by-line when using -i

The turtle combination -i + MB_CUR_MAX>1 requires case conversion ahead
of time.  Avoid doing this repeatedly when many matches succeed.  Together
with the previous changes, this fixes https://savannah.gnu.org/bugs/?29117
and https://savannah.gnu.org/bugs/?14472.

* NEWS: Document new speedup.
* src/grep.c (do_execute): New.
(grepbuf): Use it.
---
 NEWS       |    6 ++++++
 src/grep.c |   40 ++++++++++++++++++++++++++++++++++++++--
 2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/NEWS b/NEWS
index 3211179..42611fa 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,12 @@ GNU grep NEWS                                    -*- outline 
-*-
 
 * Noteworthy changes in release ?.? (????-??-??) [?]
 
+** Speed improvements
+
+  grep is much faster on multibyte character sets, especially (but not
+  limited to) UTF-8 character sets.  The speed improvement is also very
+  pronounced with case-insensitive matches.
+
 ** Bug fixes
 
   Character classes would malfunction in multi-byte locales when using grep -i.
diff --git a/src/grep.c b/src/grep.c
index f1d341a..19e04e2 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -1025,6 +1025,42 @@ prtext (char const *beg, char const *lim, int *nlinesp)
   used = 1;
 }
 
+static EXECUTE_RET do_execute EXECUTE_ARGS
+{
+  const char *line_buf, *line_end, *line_next;
+  size_t result = (size_t) -1;
+
+  /* -i is a real turtle with multibyte character sts, so match
+     line-by-line.
+
+     FIXME: this is just an ugly workaround, and it doesn't really
+     belong here.  Also, PCRE is always using this same per-line
+     matching algorithm.  Either we fix -i, or we should refactor
+     this code---for example, we could adding another function pointer
+     to struct matcher to split the buffer passed to execute.  It would
+     perform the memchr if line-by-line matching is necessary, or just
+     returns buf + size otherwise.  */
+  if (MB_CUR_MAX == 1 || !match_icase)
+    return execute(buf, size, match_size, start_ptr);
+
+  for (line_next = buf; result == (size_t)-1 && line_next < buf + size; )
+    {
+      line_buf = line_next;
+      line_end = memchr (line_buf, eolbyte, (buf + size) - line_buf);
+      if (line_end == NULL)
+        line_next = line_end = buf + size;
+      else
+        line_next = line_end + 1;
+
+      if (start_ptr && start_ptr >= line_end)
+        continue;
+
+      result = execute (line_buf, line_next - line_buf, match_size, start_ptr);
+    }
+
+  return result == (size_t)-1 ? result : (line_buf - buf) + result;
+}
+
 /* Scan the specified portion of the buffer, matching lines (or
    between matching lines if OUT_INVERT is true).  Return a count of
    lines printed. */
@@ -1038,8 +1074,8 @@ grepbuf (char const *beg, char const *lim)
 
   nlines = 0;
   p = beg;
-  while ((match_offset = execute(p, lim - p, &match_size,
-                                NULL)) != (size_t) -1)
+  while ((match_offset = do_execute(p, lim - p, &match_size,
+                                   NULL)) != (size_t) -1)
     {
       char const *b = p + match_offset;
       char const *endp = b + match_size;
-- 
1.6.6.1
[Prev in Thread]
Current Thread
[Next in Thread]
rebased utf-8 series, Paolo Bonzini <=
Prev by Date: Re: [PATCH 2/9] dfa: fix handling of ranges in multibyte character sets
Next by Date: Re: [PATCH 9/9] grep: match multibyte charsets line-by-line when using -i
Previous by thread: [PATCH v2 0/9] UTF-8 speedups
Next by thread: [bug #17623] add option to grep -R to NOT follow symlinks to directory
Index(es):
- Date
- Thread