Re: hash resizing bug

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: hash resizing bug

From:	Eric Blake
Subject:	Re: hash resizing bug
Date:	Thu, 18 Jun 2009 07:29:52 -0600
User-agent:	Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.21) Gecko/20090302 Thunderbird/2.0.0.21 Mnenhy/0.7.6.666

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

According to Jim Meyering on 6/17/2009 1:21 PM:
>> is not quite right.  A hashing function may be somewhat ill-conditioned for 
>> one
>> bucket size, but much better conditioned for another size.  At any rate, it
>> looks like the correct fix is to check for rehashing thresholds based on 
>> number
>> of entries, not number of buckets, and that this check needs to happen on 
>> every
>> insertion, not just insertions that land in an empty bucket.  Or, at the very
>> least, check for threshold PRIOR to insertion, rather than after, such that 
>> in
>> the above case with m4, the very next call to hash_insert would notice the
>> table was at 100%, and force another resize above 907 buckets.
>>
>> Thoughts, before I implement something along these lines?
> 
> Sounds good.

Hey - I just noticed that this also fixed the bug where hash_insert was
inconsistent on NULL return whether an entry had been inserted or not.
Now, you are guaranteed that a NULL return means the entry was not inserted.

Here's two simple patches.  I'm committing the first, as there should be
enough cases where you end up searching for an element already known to be
hashed, where making an indirect function call is needless overhead
(particularly if the comparator function is expensive, and neglected to do
the pointer comparison filter itself).  And by documenting this behavior,
the user's comparator functions can now be safely rewritten to avoid the
pointer comparison filter.

I haven't pushed the second one to savannah yet; what do you think of it?
 Sometimes it is desirable to hash distinct pointers, in which case
allowing a NULL user function to request this seems to make sense.  My
design choice was to provide trivial functions at initialization rather
than penalize normal users by checking at every use of the callback
whether the function was NULL.

I've rebased my patch series, although the last couple of patches (my
first cut at dealing with hash_rehash allocation failure, and rewriting
overflow management to be more efficient) are rather untested at the
moment, and might not even compile:
http://repo.or.cz/w/gnulib/ericb.git
$ git pull git://repo.or.cz/gnulib/ericb.git master

- --
Don't work too hard, make some time for fun as well!

Eric Blake             address@hidden
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAko6QU8ACgkQ84KuGfSFAYDwkACfZUk5ZbgoMc96pkGNCOdkhMFE
YE8AoKbbrBBMYhrdBcX+DAt9SQXFZE7K
=FWag
-----END PGP SIGNATURE-----

>From 16908fc83e83a08e4f0e8c004141ff01ae12ca86 Mon Sep 17 00:00:00 2001
From: Eric Blake <address@hidden>
Date: Thu, 18 Jun 2009 06:56:13 -0600
Subject: [PATCH 1/2] hash: minor optimization

* lib/hash.c (hash_lookup, hash_find_entry): Avoid function call
when possible.
(hash_initialize): Document this promise.
(hash_do_for_each, hash_clear, hash_free): Use C89 syntax.
* tests/test-hash.c (hash_compare_strings): Test this.

Signed-off-by: Eric Blake <address@hidden>
---
 ChangeLog         |    9 +++++++++
 lib/hash.c        |   20 ++++++++++----------
 tests/test-hash.c |    1 +
 3 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 1b7cd78..09cc027 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,12 @@
+2009-06-18  Eric Blake  <address@hidden>
+
+       hash: minor optimization
+       * lib/hash.c (hash_lookup, hash_find_entry): Avoid function call
+       when possible.
+       (hash_initialize): Document this promise.
+       (hash_do_for_each, hash_clear, hash_free): Use C89 syntax.
+       * tests/test-hash.c (hash_compare_strings): Test this.
+
 2009-06-18  Bruno Haible  <address@hidden>

        * m4/strstr.m4 (gl_FUNC_STRSTR): Skip linear time test if strstr is
diff --git a/lib/hash.c b/lib/hash.c
index 4de1069..e464ce3 100644
--- a/lib/hash.c
+++ b/lib/hash.c
@@ -263,7 +263,7 @@ hash_lookup (const Hash_table *table, const void *entry)
     return NULL;

   for (cursor = bucket; cursor; cursor = cursor->next)
-    if (table->comparator (entry, cursor->data))
+    if (entry == cursor->data || table->comparator (entry, cursor->data))
       return cursor->data;

   return NULL;
@@ -373,7 +373,7 @@ hash_do_for_each (const Hash_table *table, Hash_processor 
processor,
        {
          for (cursor = bucket; cursor; cursor = cursor->next)
            {
-             if (!(*processor) (cursor->data, processor_data))
+             if (! processor (cursor->data, processor_data))
                return counter;
              counter++;
            }
@@ -535,7 +535,8 @@ check_tuning (Hash_table *table)
    The user-supplied COMPARATOR function should be provided.  It accepts two
    arguments pointing to user data, it then returns true for a pair of entries
    that compare equal, or false otherwise.  This function is internally called
-   on entries which are already known to hash to the same bucket index.
+   on entries which are already known to hash to the same bucket index,
+   but which are distinct pointers.

    The user-supplied DATA_FREER function, when not NULL, may be later called
    with the user data as an argument, just before the entry containing the
@@ -628,7 +629,7 @@ hash_clear (Hash_table *table)
          for (cursor = bucket->next; cursor; cursor = next)
            {
              if (table->data_freer)
-               (*table->data_freer) (cursor->data);
+               table->data_freer (cursor->data);
              cursor->data = NULL;

              next = cursor->next;
@@ -640,7 +641,7 @@ hash_clear (Hash_table *table)

          /* Free the bucket head.  */
          if (table->data_freer)
-           (*table->data_freer) (bucket->data);
+           table->data_freer (bucket->data);
          bucket->data = NULL;
          bucket->next = NULL;
        }
@@ -670,9 +671,7 @@ hash_free (Hash_table *table)
          if (bucket->data)
            {
              for (cursor = bucket; cursor; cursor = cursor->next)
-               {
-                 (*table->data_freer) (cursor->data);
-               }
+               table->data_freer (cursor->data);
            }
        }
     }
@@ -769,7 +768,7 @@ hash_find_entry (Hash_table *table, const void *entry,
     return NULL;

   /* See if the entry is the first in the bucket.  */
-  if ((*table->comparator) (entry, bucket->data))
+  if (entry == bucket->data || table->comparator (entry, bucket->data))
     {
       void *data = bucket->data;

@@ -796,7 +795,8 @@ hash_find_entry (Hash_table *table, const void *entry,
   /* Scan the bucket overflow.  */
   for (cursor = bucket; cursor->next; cursor = cursor->next)
     {
-      if ((*table->comparator) (entry, cursor->next->data))
+      if (entry == cursor->next->data
+         || table->comparator (entry, cursor->next->data))
        {
          void *data = cursor->next->data;

diff --git a/tests/test-hash.c b/tests/test-hash.c
index 2266545..73a3512 100644
--- a/tests/test-hash.c
+++ b/tests/test-hash.c
@@ -46,6 +46,7 @@
 static bool
 hash_compare_strings (void const *x, void const *y)
 {
+  ASSERT (x != y);
   return STREQ (x, y) ? true : false;
 }

-- 
1.6.3.rc3.2.g4b51


>From 3096059ada9eba36cf7bbfcaa6f4d766b0f481a9 Mon Sep 17 00:00:00 2001
From: Eric Blake <address@hidden>
Date: Thu, 18 Jun 2009 07:11:39 -0600
Subject: [PATCH 2/2] hash: provide default callback functions

* lib/hash.c (raw_hasher, raw_comparator): New functions.
(hash_initialize): Use them as defaults.
* tests/test-hash.c (main): Test this.

Signed-off-by: Eric Blake <address@hidden>
---
 ChangeLog         |    5 +++++
 lib/hash.c        |   25 +++++++++++++++++++++----
 tests/test-hash.c |   12 ++++++++++++
 3 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 09cc027..6861f6b 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,10 @@
 2009-06-18  Eric Blake  <address@hidden>

+       hash: provide default callback functions
+       * lib/hash.c (raw_hasher, raw_comparator): New functions.
+       (hash_initialize): Use them as defaults.
+       * tests/test-hash.c (main): Test this.
+
        hash: minor optimization
        * lib/hash.c (hash_lookup, hash_find_entry): Avoid function call
        when possible.
diff --git a/lib/hash.c b/lib/hash.c
index e464ce3..52a211e 100644
--- a/lib/hash.c
+++ b/lib/hash.c
@@ -479,6 +479,21 @@ hash_reset_tuning (Hash_tuning *tuning)
   *tuning = default_tuning;
 }

+/* If the user passes a NULL hasher, we hash the raw pointer.  */
+static size_t
+raw_hasher (const void *data, size_t n)
+{
+  return (size_t) data % n;
+}
+
+/* If the user passes a NULL comparator, we use pointer comparison.  */
+static bool
+raw_comparator (const void *a, const void *b)
+{
+  return a == b;
+}
+
+
 /* For the given hash TABLE, check the user supplied tuning structure for
    reasonable values, and return true if there is no gross error with it.
    Otherwise, definitively reset the TUNING field to some acceptable default
@@ -527,12 +542,12 @@ check_tuning (Hash_table *table)
    provided but the values requested are out of bounds or might cause
    rounding errors, return NULL.

-   The user-supplied HASHER function should be provided.  It accepts two
+   The user-supplied HASHER function, when not NULL, accepts two
    arguments ENTRY and TABLE_SIZE.  It computes, by hashing ENTRY contents, a
    slot number for that entry which should be in the range 0..TABLE_SIZE-1.
    This slot number is then returned.

-   The user-supplied COMPARATOR function should be provided.  It accepts two
+   The user-supplied COMPARATOR function, when not NULL, accepts two
    arguments pointing to user data, it then returns true for a pair of entries
    that compare equal, or false otherwise.  This function is internally called
    on entries which are already known to hash to the same bucket index,
@@ -553,8 +568,10 @@ hash_initialize (size_t candidate, const Hash_tuning 
*tuning,
 {
   Hash_table *table;

-  if (hasher == NULL || comparator == NULL)
-    return NULL;
+  if (hasher == NULL)
+    hasher = raw_hasher;
+  if (comparator == NULL)
+    comparator = raw_comparator;

   table = malloc (sizeof *table);
   if (table == NULL)
diff --git a/tests/test-hash.c b/tests/test-hash.c
index 73a3512..6bb9652 100644
--- a/tests/test-hash.c
+++ b/tests/test-hash.c
@@ -135,6 +135,18 @@ main (void)
       hash_clear (ht);
       ASSERT (hash_get_n_entries (ht) == 0);
       hash_free (ht);
+
+      /* Test pointer hashing.  */
+      ht = hash_initialize (sz, NULL, NULL, NULL, NULL);
+      ASSERT (ht);
+      {
+       char *str = xstrdup ("a");
+       insert_new (ht, "a");
+       insert_new (ht, str);
+       ASSERT (hash_lookup (ht, str) == str);
+       free (str);
+      }
+      hash_free (ht);
     }

   /* Now, each entry is malloc'd.  */
-- 
1.6.3.rc3.2.g4b51

[Prev in Thread]

Current Thread

[Next in Thread]

Re: hash_rehash allocation failure, (continued)

Prev by Date: Re: strstr
Next by Date: gettime build failure with recent gnulib for mingw systems
Previous by thread: Re: hash resizing bug
Next by thread: Re: hash resizing bug
Index(es):
- Date
- Thread