grep-commit
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

grep branch, master, updated. v2.20-69-g1519c4e


From: Jim Meyering
Subject: grep branch, master, updated. v2.20-69-g1519c4e
Date: Sun, 26 Oct 2014 15:06:36 +0000

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "grep".

The branch, master has been updated
       via  1519c4e5e4bf68ec348bfe4261f78768710aa985 (commit)
      from  a0a142906f09222fa0de40a7f4867997d31a909c (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
http://git.savannah.gnu.org/cgit/grep.git/commit/?id=1519c4e5e4bf68ec348bfe4261f78768710aa985


commit 1519c4e5e4bf68ec348bfe4261f78768710aa985
Author: Norihiro Tanaka <address@hidden>
Date:   Sat Oct 11 11:38:09 2014 +0900

    dfa: avoid false match in a non-UTF8 multibyte locale
    
    This command should print nothing:
    
      printf '\263\244\263\244\n' \
        | LC_ALL=ja_JP.eucJP grep -E "$(printf '^x|\244\263')"
    
    Before this patch, it would print its sole input line.
    * src/dfa.c (struct dfa): Add new members: min_trcount,
    initstate_letter, initstate_others.
    (dfaanalyze): Build states with not only a newline context but others.
    (build_state): Don't release initial states.
    (skip_remains_mb): Add a parameter.
    Add a comment describing all parameters.
    (dfaexec_main): When there are multiple start states, we are about
    to transition from one state to another and the current byte is not
    the first byte of a multibyte character, first advance past the
    current multibyte character.
    * tests/euc-mb: Add a new test.
    * NEWS (Bug fixes): Mention it.
    This addresses http://debbugs.gnu.org/18685

diff --git a/NEWS b/NEWS
index 07a5d54..94eeeeb 100644
--- a/NEWS
+++ b/NEWS
@@ -38,6 +38,10 @@ GNU grep NEWS                                    -*- outline 
-*-
   implying that the match, "10" was on line 1.
   [bug introduced in grep-2.19]
 
+  grep in a non-UTF8 multibyte locale could mistakenly match in the middle
+  of a multibyte character when using a '^'-anchored alternate in a pattern,
+  leading it to print non-matching lines.  [bug present since "the beginning"]
+
   grep -E rejected unmatched ')', instead of treating it like '\)'.
   [bug present since "the beginning"]
 
diff --git a/src/dfa.c b/src/dfa.c
index 80510a8..5b9d154 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -405,6 +405,10 @@ struct dfa
                                    slots so far, not counting trans[-1].  */
   int trcount;                  /* Number of transition tables that have
                                    actually been built.  */
+  int min_trcount;              /* Minimum of number of transition tables.
+                                   Always keep the number, even after freeing
+                                   the transition tables.  It is also the
+                                   number of initial states.  */
   state_num **trans;            /* Transition tables for states that can
                                    never accept.  If the transitions for a
                                    state have not yet been computed, or the
@@ -423,6 +427,8 @@ struct dfa
                                    newline is stored separately and handled
                                    as a special case.  Newline is also used
                                    as a sentinel at the end of the buffer.  */
+  state_num initstate_letter;   /* Initial state for letter context.  */
+  state_num initstate_others;   /* Initial state for other contexts.  */
   struct dfamust *musts;        /* List of strings, at least one of which
                                    is known to appear in any r.e. matching
                                    the dfa.  */
@@ -2517,9 +2523,16 @@ dfaanalyze (struct dfa *d, int searchflag)
 
   /* Build the initial state.  */
   separate_contexts = state_separate_contexts (&merged);
-  state_index (d, &merged,
-               (separate_contexts & CTX_NEWLINE
-                ? CTX_NEWLINE : separate_contexts ^ CTX_ANY));
+  if (separate_contexts & CTX_NEWLINE)
+    state_index (d, &merged, CTX_NEWLINE);
+  d->initstate_others = d->min_trcount
+    = state_index (d, &merged, separate_contexts ^ CTX_ANY);
+  if (separate_contexts & CTX_LETTER)
+    d->initstate_letter = d->min_trcount
+      = state_index (d, &merged, CTX_LETTER);
+  else
+    d->initstate_letter = d->initstate_others;
+  d->min_trcount++;
 
   free (posalloc);
   free (stkalloc);
@@ -2855,17 +2868,17 @@ build_state (state_num s, struct dfa *d)
   /* Set an upper limit on the number of transition tables that will ever
      exist at once.  1024 is arbitrary.  The idea is that the frequently
      used transition tables will be quickly rebuilt, whereas the ones that
-     were only needed once or twice will be cleared away.  However, do
-     not clear the initial state, as it's always used.  */
+     were only needed once or twice will be cleared away.  However, do not
+     clear the initial D->min_trcount states, since they are always used.  */
   if (d->trcount >= 1024)
     {
-      for (i = 1; i < d->tralloc; ++i)
+      for (i = d->min_trcount; i < d->tralloc; ++i)
         {
           free (d->trans[i]);
           free (d->fails[i]);
           d->trans[i] = d->fails[i] = NULL;
         }
-      d->trcount = 1;
+      d->trcount = d->min_trcount;
     }
 
   ++d->trcount;
@@ -3236,15 +3249,22 @@ transit_state (struct dfa *d, state_num s, unsigned 
char const **pp,
    expression "\\" accepts the codepoint 0x5c, but should not accept the second
    byte of the codepoint 0x815c.  Then the initial state must skip the bytes
    that are not a single byte character nor the first byte of a multibyte
-   character.  */
+   character.
+
+   Given DFA state d, use mbs_to_wchar to advance MBP until it reaches or
+   exceeds P.  If WCP is non-NULL, set *WCP to the final wide character
+   processed, or if no wide character is processed, set it to WEOF.
+   Both P and MBP must be no larger than END.  */
 static unsigned char const *
 skip_remains_mb (struct dfa *d, unsigned char const *p,
-                 unsigned char const *mbp, char const *end)
+                 unsigned char const *mbp, char const *end, wint_t *wcp)
 {
-  wint_t wc;
+  wint_t wc = WEOF;
   while (mbp < p)
     mbp += mbs_to_wchar (&wc, (char const *) mbp,
                          end - (char const *) mbp, d);
+  if (wcp != NULL)
+    *wcp = wc;
   return mbp;
 }
 
@@ -3306,20 +3326,44 @@ dfaexec_main (struct dfa *d, char const *begin, char 
*end,
             {
               s1 = s;
 
-              if (s == 0)
+              if (s < d->min_trcount)
                 {
-                  if (d->states[s].mbps.nelem == 0)
+                  if (d->min_trcount == 1)
                     {
-                      do
+                      if (d->states[s].mbps.nelem == 0)
                         {
-                          while (t[*p] == 0)
-                            p++;
-                          p = mbp = skip_remains_mb (d, p, mbp, end);
+                          do
+                            {
+                              while (t[*p] == 0)
+                                p++;
+                              p = mbp = skip_remains_mb (d, p, mbp, end, NULL);
+                            }
+                          while (t[*p] == 0);
                         }
-                      while (t[*p] == 0);
+                      else
+                        p = mbp = skip_remains_mb (d, p, mbp, end, NULL);
                     }
                   else
-                    p = mbp = skip_remains_mb (d, p, mbp, end);
+                    {
+                      wint_t wc;
+                      mbp = skip_remains_mb (d, p, mbp, end, &wc);
+
+                      /* If d->min_trcount is greater than 1, maybe
+                         transit to another initial state after skip.  */
+                      if (p < mbp)
+                        {
+                          int context = wchar_context (wc);
+                          if (context == CTX_LETTER)
+                            s = d->initstate_letter;
+                          else
+                            /* It's CTX_NONE.  CTX_NEWLINE cannot happen,
+                               as we assume that a newline is always a
+                               single byte character.  */
+                            s = d->initstate_others;
+                          p = mbp;
+                          s1 = s;
+                        }
+                    }
                 }
 
               if (d->states[s].mbps.nelem == 0)
diff --git a/tests/euc-mb b/tests/euc-mb
index 6a9a845..b625046 100755
--- a/tests/euc-mb
+++ b/tests/euc-mb
@@ -39,6 +39,7 @@ make_input BABAAB |euc_grep AB > out || fail=1
 make_input BABAAB > exp || framework_failure_
 compare exp out || fail=1
 make_input BABABA |euc_grep AB; test $? = 1 || fail=1
+make_input BABABA |euc_grep '^x\|AB'; test $? = 1 || fail=1
 
 # -P supports only unibyte and UTF-8 locales.
 LC_ALL=$locale grep -P x /dev/null

-----------------------------------------------------------------------

Summary of changes:
 NEWS         |    4 +++
 src/dfa.c    |   80 +++++++++++++++++++++++++++++++++++++++++++++-------------
 tests/euc-mb |    1 +
 3 files changed, 67 insertions(+), 18 deletions(-)


hooks/post-receive
-- 
grep



reply via email to

[Prev in Thread] Current Thread [Next in Thread]