bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: speed up u8_strstr


From: Bruno Haible
Subject: Re: speed up u8_strstr
Date: Fri, 21 Jan 2011 13:41:11 +0100
User-agent: KMail/1.9.9

Hi Pádraig,

> > But at the same time, can you also try to apply the O(n) algorithm to
> > u16_strstr and u32_strstr?
>
> Attached is a first attempt at this.

Thanks for working on this.

> Questions I have are:
> 
> Is it OK to include str-mp.h and malloca
> in unistr/ modules like I do?

Yes, sure. Why not?

Possible namespacing problems are solved by libunistring's Makefile.

Let me go through your patch.

1) It's a good choice to use the Knuth-Morris-Pratt algorithm for u16_strstr
and u32_strstr. It's O(n). Eric's two-way algorithm is O(n) as well, but is
more complicated code and requires to set up a table whose size is 1 << 
CHAR_BIT,
which would be 1 << 16 for uint16_t or 1 << 32 for uint32_t - infeasible.

2) It's a good choice to parameterize str-kmp.h, rather than to duplicate code.
But while doing so, please try to avoid gcc warnings. Always test your patches
with CPPFLAGS="-Wall" before submitting them:

mbscasestr.c:376: warning: pointer targets in passing argument 1 of 
‘knuth_morris_pratt’ differ in signedness
mbscasestr.c:376: warning: pointer targets in passing argument 2 of 
‘knuth_morris_pratt’ differ in signedness
mbscasestr.c:376: warning: passing argument 4 of ‘knuth_morris_pratt’ from 
incompatible pointer type
mbsstr.c:347: warning: pointer targets in passing argument 1 of 
‘knuth_morris_pratt’ differ in signedness
mbsstr.c:347: warning: pointer targets in passing argument 2 of 
‘knuth_morris_pratt’ differ in signedness

And there's no reason to remove part of the specification of CANON_ELEMENT in
lib/str-kmp.h. Removing reasonable comments is a good way to have your patches
rejected.

3) About the module descriptions:

 Makefile.am:
 if LIBUNISTRING_COMPILE_UNISTR_U16_STRSTR
-lib_SOURCES += unistr/u16-strstr.c
+lib_SOURCES += unistr/u16-strstr.c str-kmp.h
 endif

You don't need to add .h files to lib_SOURCES.

4) The line  gl_LIBUNISTRING_MODULE([0.9], [unistr/u32-strstr])
needs to be updated, because with a libunistring from versions 0.9 .. 0.9.3
the new unit tests will fail. So when a libunistring version < 0.9.4 is found,
you need to enforce compilation of the module from source. This is done by 
changing
the line to
             gl_LIBUNISTRING_MODULE([0.9.4], [unistr/u32-strstr])

5) The unit test tests/unistr/test-u-strstr.h: It is good to share code among 
tests.
rather than write down the same code 3 times. But please follow GNU coding 
style:
indentation by 2 spaces, not 4, and put spaces around arithmetic operators.
More importantly, though, the test code that you put there is part of
tests/test-strstr.c and tests/test-mbsstr1.c. Why only a part? u16_strstr and
u32_strstr should have all the properties that strstr and mbsstr have. So the
right thing to do is to adapt all of tests/test-strstr.c into
tests/unistr/test-u-strstr.h. I've done this, and now I get test failures:

/bin/sh: line 5: 24308 Alarm clock             EXEEXT='' srcdir='.' MAKE='make' 
${dir}$tst
FAIL: test-u16-strstr
/bin/sh: line 5: 24325 Alarm clock             EXEEXT='' srcdir='.' MAKE='make' 
${dir}$tst
FAIL: test-u32-strstr

6) In unit tests, the order of the #includes is the following:
  - First comes <config.h>.
  - Then comes the header file supposed to declare the particular functionality.
    This comes immediately after <config.h> in order to check that the header is
    self-contained.
  - Then come system includes <...>.
  - Then come gnulib includes "...".

7) Also please go into more details when writing a ChangeLog entry. Always 
mention
the function names, especially when you're renaming a function.


So, the first part that can go in is this:


2011-01-21  Pádraig Brady  <address@hidden>
            Bruno Haible  <address@hidden>

        Prepare for faster uN_strstr functions.
        * lib/str-kmp.h: Support definable UNITs.
        (knuth_morris_pratt): Renamed from knuth_morris_pratt_unibyte. Add
        needle_len argument.
        * lib/mbsstr.c (mbsstr): Adjust for the changed str-kmp.h.
        * lib/mbscasestr.c (mbscasestr): Likewise.

--- lib/mbscasestr.c.orig       Fri Jan 21 13:29:32 2011
+++ lib/mbscasestr.c    Fri Jan 21 13:05:06 2011
@@ -30,6 +30,7 @@
 #define TOLOWER(Ch) (isupper (Ch) ? tolower (Ch) : (Ch))
 
 /* Knuth-Morris-Pratt algorithm.  */
+#define UNIT unsigned char
 #define CANON_ELEMENT(c) TOLOWER (c)
 #include "str-kmp.h"
 
@@ -368,10 +369,12 @@
                   if (needle_last_ccount == NULL)
                     {
                       /* Try the Knuth-Morris-Pratt algorithm.  */
-                      const char *result;
+                      const unsigned char *result;
                       bool success =
-                        knuth_morris_pratt_unibyte (haystack, needle - 1,
-                                                    &result);
+                        knuth_morris_pratt ((const unsigned char *) haystack,
+                                            (const unsigned char *) (needle - 
1),
+                                            strlen (needle - 1),
+                                            &result);
                       if (success)
                         return (char *) result;
                       try_kmp = false;
--- lib/mbsstr.c.orig   Fri Jan 21 13:29:32 2011
+++ lib/mbsstr.c        Fri Jan 21 13:05:06 2011
@@ -27,6 +27,7 @@
 #include "mbuiter.h"
 
 /* Knuth-Morris-Pratt algorithm.  */
+#define UNIT unsigned char
 #define CANON_ELEMENT(c) c
 #include "str-kmp.h"
 
@@ -339,10 +340,12 @@
                   if (needle_last_ccount == NULL)
                     {
                       /* Try the Knuth-Morris-Pratt algorithm.  */
-                      const char *result;
+                      const unsigned char *result;
                       bool success =
-                        knuth_morris_pratt_unibyte (haystack, needle - 1,
-                                                    &result);
+                        knuth_morris_pratt ((const unsigned char *) haystack,
+                                            (const unsigned char *) (needle - 
1),
+                                            strlen (needle - 1),
+                                            &result);
                       if (success)
                         return (char *) result;
                       try_kmp = false;
--- lib/str-kmp.h.orig  Fri Jan 21 13:29:32 2011
+++ lib/str-kmp.h       Fri Jan 21 13:05:06 2011
@@ -1,4 +1,4 @@
-/* Substring search in a NUL terminated string of 'char' elements,
+/* Substring search in a NUL terminated string of UNIT elements,
    using the Knuth-Morris-Pratt algorithm.
    Copyright (C) 2005-2011 Free Software Foundation, Inc.
    Written by Bruno Haible <address@hidden>, 2005.
@@ -18,21 +18,26 @@
    Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.  */
 
 /* Before including this file, you need to define:
+     UNIT                    The element type of the needle and haystack.
      CANON_ELEMENT(c)        A macro that canonicalizes an element right after
-                             it has been fetched from one of the two strings.
-                             The argument is an 'unsigned char'; the result
-                             must be an 'unsigned char' as well.  */
+                             it has been fetched from needle or haystack.
+                             The argument is of type UNIT; the result must be
+                             of type UNIT as well.  */
 
 /* Knuth-Morris-Pratt algorithm.
    See http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm
+   HAYSTACK is the NUL terminated string in which to search for.
+   NEEDLE is the string to search for in HAYSTACK, consisting of NEEDLE_LEN
+   units.
    Return a boolean indicating success:
    Return true and set *RESULTP if the search was completed.
    Return false if it was aborted because not enough memory was available.  */
 static bool
-knuth_morris_pratt_unibyte (const char *haystack, const char *needle,
-                            const char **resultp)
+knuth_morris_pratt (const UNIT *haystack,
+                    const UNIT *needle, size_t needle_len,
+                    const UNIT **resultp)
 {
-  size_t m = strlen (needle);
+  size_t m = needle_len;
 
   /* Allocate the table.  */
   size_t *table = (size_t *) nmalloca (m, sizeof (size_t));
@@ -66,14 +71,14 @@
            The inequality needle[x..i-1] != needle[0..i-1-x] is known to hold
            for x < table[i-1], by induction.
            Furthermore, if j>0: needle[i-1-j..i-2] = needle[0..j-1].  */
-        unsigned char b = CANON_ELEMENT ((unsigned char) needle[i - 1]);
+        UNIT b = CANON_ELEMENT (needle[i - 1]);
 
         for (;;)
           {
             /* Invariants: The inequality needle[x..i-1] != needle[0..i-1-x]
                is known to hold for x < i-1-j.
                Furthermore, if j>0: needle[i-1-j..i-2] = needle[0..j-1].  */
-            if (b == CANON_ELEMENT ((unsigned char) needle[j]))
+            if (b == CANON_ELEMENT (needle[j]))
               {
                 /* Set table[i] := i-1-j.  */
                 table[i] = i - ++j;
@@ -108,17 +113,16 @@
   /* Search, using the table to accelerate the processing.  */
   {
     size_t j;
-    const char *rhaystack;
-    const char *phaystack;
+    const UNIT *rhaystack;
+    const UNIT *phaystack;
 
     *resultp = NULL;
     j = 0;
     rhaystack = haystack;
     phaystack = haystack;
     /* Invariant: phaystack = rhaystack + j.  */
-    while (*phaystack != '\0')
-      if (CANON_ELEMENT ((unsigned char) needle[j])
-          == CANON_ELEMENT ((unsigned char) *phaystack))
+    while (*phaystack != 0)
+      if (CANON_ELEMENT (needle[j]) == CANON_ELEMENT (*phaystack))
         {
           j++;
           phaystack++;



reply via email to

[Prev in Thread] Current Thread [Next in Thread]