bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC] Adding a real HashTable implementation to gnulib


From: Tim Rühsen
Subject: Re: [RFC] Adding a real HashTable implementation to gnulib
Date: Mon, 3 Dec 2018 10:13:20 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.3.1

On 12/2/18 2:41 PM, Bruno Haible wrote:
> Hi,
> 
> Darshit Shah wrote:
>> I recently tried to use the hash table implementation in gnulib which resides
>> in the "hash" module. However, I quickly realised that the hash table in 
>> gnulib
>> seems to be what is otherwise popularly known as a hash set, i.e., it 
>> supports
>> storing and retrieving just values from the structure. 
>>
>> On the other hand, a hash table is usually expected to have a key->value
>> mapping that is stored.
> 
> I agree that the gnulib 'hash' module is just a particular case, and
> probably the module name is not very descriptive.
> 
>> Within GNU Wget, we have a fairly portable version of a hash table 
>> implemented
>> which I think would be a good addition for gnulib. What do you think?
> 
> There's not only the one from wget
>   https://git.savannah.gnu.org/gitweb/?p=wget.git;a=blob;f=src/hash.h
>   https://git.savannah.gnu.org/gitweb/?p=wget.git;a=blob;f=src/hash.c
> 
> but also the one from gettext
>   
> https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob;f=gnulib-local/lib/hash.h
>   
> https://git.savannah.gnu.org/gitweb/?p=gettext.git;a=blob;f=gnulib-local/lib/hash.c
> 
> and the one from glib
>   https://gitlab.gnome.org/GNOME/glib/blob/master/glib/ghash.h
>   https://gitlab.gnome.org/GNOME/glib/blob/master/glib/ghash.c
> 
> and the one from libxml
>   https://gitlab.gnome.org/GNOME/libxml2/blob/master/include/libxml/hash.h
>   https://gitlab.gnome.org/GNOME/libxml2/blob/master/hash.c
> 
> and the ones from CLN
>   https://www.ginac.de/CLN/cln.git/?p=cln.git;a=tree;f=src/base/hash
> 
> and many more.
> 
> The implementation you are proposing is an "open-addressed table with linear
> probing collision resolution". To me that is unacceptable. When I used Kyoto
> Common Lisp (KCL) many years ago, I got an endless loop during a hash table
> access, and it was precisely because of this open-addressed table structure.
> I don't want a code which requires careful setting of parameters in order
> not to run into an endless loop. Instead better have a code that cannot run
> into an endless loop *by design*.
> 
> The hash_string function that you propose shifts by 5 bits at each step;
> I suspect that it has the same problem as the one I tested and discussed in
> https://haible.de/bruno/hashfunc.html .
> 
> For Gnulib, I would want a generic container, i.e. a "map", like we have
> "list" and "ordered set" already (modules 'list' and 'oset'). Other GNU
> maintainers have reported that they like this approach.

We have 'hashmap' [1] in GNU Wget2. It does not work with linear probing
collision resolution, but with a linked list as fallback for collisions
(that could be replaced by a different algorithm).

So it's a key/value store where you can have NULL as value - in case you
just want to know if a key exists or not.

You apply your own hash and compare function when you create a hashmap
instance. So it's up to the dev to decide for an optimal hash strategy
for a certain use case.

You can add entries (key+value) with or without malloc/copying.

You can easily make wrappers around hashmap for special purposes, e.g.
we have a 'stringmap' API for case sensitive and insensitive keys [2].

The API is documented at [3] (hashmap) and [4] (stringmap).

It's a naive implementation but fast enough for out purposes (70ms to
read in the german translation of the holy bible (both parts) and count
unique words).

Surely many details have to change, but I am absolutely willing to
accept critics and work together with you - and anyone else who likes -
to get it into shape. The API hasn't been released, so we are currently
pretty flexible with changes.


[1] https://gitlab.com/gnuwget/wget2/blob/master/libwget/hashmap.c
[2] https://gitlab.com/gnuwget/wget2/blob/master/libwget/stringmap.c
[3] https://gnuwget.gitlab.io/wget2/reference/group__libwget-hashmap.html
[4] https://gnuwget.gitlab.io/wget2/reference/group__libwget-stringmap.html

> 
> However, this will still not fit all possible needs because there are
> special cases that people want to see optimized:
>   - The case when the key is a string; additionally when the key is
>     allocated in an obstack and there is no remove.
>   - The struniq function (as in localename.c).
> 
> Then, what about extra requirements?
>   - The existing gnulib 'hash' module is pretty unique: it keeps statistics.
>     But is anyone really using this feature?
>   - malloc vs. xmalloc.
>   - Multithread-safety should IMO not be considered as an extra requirement.
>     This is better done in application logic, because typically in the scope
>     of the lock the application will do more than just the hash table lookup.

There are no stats in Wget2's implementation. The malloc/free functions
are currently not configurable (but that's trivial).

Regards, Tim

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]