bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets


From: Jim Meyering
Subject: Re: [PATCH 5/9] dfa: optimize simple character sets under UTF-8 charsets
Date: Wed, 17 Mar 2010 11:10:31 +0100

Paolo Bonzini wrote:
> Use a bitset when not involving MBCSET is possible.  Testcase:
>    yes 'the quick brown fox jumps over the lazy dog' | sed 100000q | \
>      time grep -c [ABCDEFGHIJKLMNOPQRSTUVWXYZ,]
>
> Before: 51ms (best of three runs); after: 16ms(best of three runs).
>
> * src/dfa.c (check_utf8, using_utf8): New.
> (parse_bracket_exp): For simple bracket expressions under UTF-8,
> use a CSET.
> (dfacomp): Call check_utf8.
> ---
>  src/dfa.c |   34 +++++++++++++++++++++++++++++++++-
>  1 files changed, 33 insertions(+), 1 deletions(-)
>
> diff --git a/src/dfa.c b/src/dfa.c
> index ed4e1ae..da70aa1 100644
> --- a/src/dfa.c
> +++ b/src/dfa.c
> @@ -21,6 +21,7 @@
>     Modified July, 1988 by Arthur David Olson to assist BMG speedups  */
>
>  #include <config.h>
> +#include <assert.h>
>  #include <ctype.h>
>  #include <stdio.h>
>  #include <sys/types.h>
> @@ -78,6 +79,7 @@
>  /* We can handle multibyte strings. */
>  # include <wchar.h>
>  # include <wctype.h>
> +# include <langinfo.h>
>  #endif
>
>  #include "regex.h"
> @@ -312,8 +314,27 @@ static wchar_t *inputwcs;        /* Wide character 
> representation of input
>                                  And inputwcs[i] is the codepoint.  */
>  static unsigned char const *buf_begin;       /* reference to begin in 
> dfaexec().  */
>  static unsigned char const *buf_end; /* reference to end in dfaexec().  */
> +
> +/* UTF-8 encoding allows some optimizations that we can't otherwise
> +   assume in a multibyte encoding. */
> +static int using_utf8;
> +
> +static void
> +check_utf8 (void)
> +{
> +#ifdef HAVE_LANGINFO_CODESET
> +  if (strcmp (nl_langinfo (CODESET), "UTF-8") == 0)
> +    using_utf8 = 1;
> +#endif
> +}
> +#else
> +static void
> +check_utf8 (void)
> +{
> +}
>  #endif /* MBS_SUPPORT  */

What do you think about dropping the global variable
and simply calling the function "using_utf8"?

static inline bool
using_utf8 (void)
{
  static bool utf8;
  static bool first_call = true;
  if (first_call)
    {
#ifdef HAVE_LANGINFO_CODESET
      utf8 = (strcmp (nl_langinfo (CODESET), "UTF-8") == 0);
#else
      utf8 = false;
#endif
      first_call = false;
    }

  return utf8;
}

Hmm... I guess we have to be leery of using "bool" in dfa.c since it's
slated to be shared with gawk (which lacks gnulib).  So we should
stick with "int" and 0/1.

Either way, ACK.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]