m4-discuss
[Top][All Lists]

 From: Eric Blake Subject: Re: linkedhash-list vs. hash Date: Sat, 26 Jul 2008 14:06:40 -0600 User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.16) Gecko/20080708 Thunderbird/2.0.0.16 Mnenhy/0.7.5.666

```-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

According to Bruno Haible on 7/23/2008 5:40 AM:
| Eric Blake wrote:
|> both linkedhash-list and hash appear to fit the bill
|
| The main difference between linkedhash-list and hash is that the former
| implements a list in its own right, i.e. iterating over the elements always
| returns them in the defined order. Whereas in 'hash', you are effectively
| iterating over a set, not a list; i.e. the elements come out in random
order.

m4 doesn't care about the order of the iteration.  So I went with the hash
module.

|
|> Does it really matter whether the set size is prime vs. 2^n-1 in how
|> likely a modulo operation in the hash is to cause collisions?
|
| I would not compromise on this - a modulo operation is fast on all hardware
| since 1995, and the time to search for the next prime around n is
| O(sqrt(n)*log(n)) - much less than the O(n) that you need for resizing the
| table.

Reasonable enough; so the dynamic calculation of a prime is not a strike
against the hash module.

|
| Also, when your hash table size is 2^n-1, you are calling malloc(8*(2^n-1)),
| which - in some malloc implementations - may end up allocating twice as much
| memory, because sizeof(malloc_header_size) + 8*(2^n-1) is just slightly
larger
| than a power of two.

Good point.

|
|> Another difference: linkedlist-hash is hard-coded on its growth
|> parameters, while hash allows the user to specify both fullness threshold
|> and growth factors
|
| Whether you need this or not, depends on the ratio between the number of
| lookups in the table vs. the number of insertions in the table. If this
| ratio is low, you profit from a larger growth factor (maybe around 1.5
or 2.0);
| if it is high, a smaller growth factor (around 1.2 or 1.4) is likely better.
| Anyway, I would first do some profiling to see whether it matters at all.

For 'autoconf -f' on coreutils: pre-patch noticed 2246308 lookups, 209873
collisions, using 509 buckets and 0 resizes.  Overall time was 22.296 seconds.

post-patch, there were still 2246308 lookups, but only 20784 collisions
(an order of magnitude better).  Also, this counted 2326034 hash
callbacks, 3433800 compare callbacks, using 3659 buckets and 5 resizes
from the default size (when m4 -H is not specified, the default is 509,
but since I used NULL for the tuning parameters, the hash module
pre-scales that to 641 buckets to anticipate 80% capacity at 509 entries).
~ Overall time was 22.169 seconds (barely noticeable).

At any rate, here's the series of patches I've pushed to m4 to switch from
hand-rolled to the gnulib hash module.

- --
Don't work too hard, make some time for fun as well!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkiLg9AACgkQ84KuGfSFAYC3+gCcCOurXGasT0LrzQ21G5AYbYvR
k7wAoNbgzj2ptKMrBFwZk2Tr5kzBa4LK
=KE1d
-----END PGP SIGNATURE-----
```
```>From 0735b27919f180fcbc03e90681c0c849f40a63c5 Mon Sep 17 00:00:00 2001
Date: Tue, 22 Jul 2008 16:58:26 -0600
Subject: [PATCH] Backport faster hash lookups of pushdef stack collisions.

* src/m4.h (struct symbol): Replace shadowed bit by stack field.
* src/symtab.c (lookup_symbol): Manage pushdef stacks
independently from hash buckets, so that collisions with large
pushdef stacks can be quickly skipped over.
* src/builtin.c (dump_symbol): Likewise.
* src/freeze.c (reverse_symbol_list, produce_frozen_state):
Likewise.
* doc/m4.texinfo (Using frozen files): Test the change.

---
ChangeLog      |   12 ++++++++
doc/m4.texinfo |   16 +++++++++++
src/builtin.c  |    2 +-
src/freeze.c   |   25 +++++++++--------
src/m4.h       |    5 +--
src/symtab.c   |   77 ++++++++++++++++++++++++++++++++++---------------------
6 files changed, 91 insertions(+), 46 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index c393ee2..dda5fc0 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,17 @@

+       Backport faster hash lookups of pushdef stack collisions.
+       * src/m4.h (struct symbol): Replace shadowed bit by stack field.
+       * src/symtab.c (lookup_symbol): Manage pushdef stacks
+       independently from hash buckets, so that collisions with large
+       pushdef stacks can be quickly skipped over.
+       * src/builtin.c (dump_symbol): Likewise.
+       * src/freeze.c (reverse_symbol_list, produce_frozen_state):
+       Likewise.
+       * doc/m4.texinfo (Using frozen files): Test the change.
+
Mention patch to make earlier autoconf happy.
* NEWS: Give URL to autoconf patch to avoid undefined popdef.

diff --git a/doc/m4.texinfo b/doc/m4.texinfo
index 79ea41d..e78061c 100644
--- a/doc/m4.texinfo
+++ b/doc/m4.texinfo
@@ -7084,6 +7084,22 @@ traceon(`undefined')dnl

@c Make sure freezing is successful.

+ifdef(`__unix__', ,
+      `errprint(` skipping: syscmd does not have unix semantics
+')m4exit(`77')')dnl
+changequote(`[', `]')dnl
+syscmd([echo 'changequote([,])pushdef([divnum],[hi])dnl' \
+       | ]__program__[ -F in.m4f \
+     && echo 'divnum popdef([divnum])divnum' \
+       | ]__program__[ -R in.m4f \
+     && rm in.m4f])status sysval
+
+
@comment options: -F /none/such
@comment status: 1
@example
diff --git a/src/builtin.c b/src/builtin.c
index 0ea814a..f8a3f3c 100644
--- a/src/builtin.c
+++ b/src/builtin.c
@@ -826,7 +826,7 @@ static void
dump_symbol (symbol *sym, void *arg)
{
struct dump_symbol_data *data = (struct dump_symbol_data *) arg;
-  if (!SYMBOL_SHADOWED (sym) && SYMBOL_TYPE (sym) != TOKEN_VOID)
+  if (SYMBOL_TYPE (sym) != TOKEN_VOID)
{
obstack_blank (data->obs, sizeof (symbol *));
data->base = (symbol **) obstack_base (data->obs);
diff --git a/src/freeze.c b/src/freeze.c
index 5e35c81..4d0eeb4 100644
--- a/src/freeze.c
+++ b/src/freeze.c
@@ -36,8 +36,8 @@ reverse_symbol_list (symbol *sym)
result = NULL;
while (sym)
{
-      next = SYMBOL_NEXT (sym);
-      SYMBOL_NEXT (sym) = result;
+      next = sym->stack;
+      sym->stack = result;
result = sym;
sym = next;
}
@@ -53,6 +53,7 @@ produce_frozen_state (const char *name)
{
FILE *file;
int h;
+  symbol *stack;
symbol *sym;
const builtin *bp;

@@ -95,15 +96,16 @@ produce_frozen_state (const char *name)

/* Dump all symbols.  */

-  for (h = 0; h < hash_table_size; h++)
+  for (stack = symtab[h = 0]; h < hash_table_size;
+       stack = (stack ? SYMBOL_NEXT (stack) : symtab[h++]))
{
-
-      /* Process all entries in one bucket, from the last to the first.
+      if (!stack)
+       continue;
+      /* Process all entries in each stack from the last to the first.
This order ensures that, at reload time, pushdef's will be
executed with the oldest definitions first.  */
-
-      symtab[h] = reverse_symbol_list (symtab[h]);
-      for (sym = symtab[h]; sym; sym = SYMBOL_NEXT (sym))
+      sym = stack = reverse_symbol_list (stack);
+      while (sym)
{
switch (SYMBOL_TYPE (sym))
{
@@ -140,11 +142,10 @@ produce_frozen_state (const char *name)
abort ();
break;
}
+         sym = sym->stack;
}
-
-      /* Reverse the bucket once more, putting it back as it was.  */
-
-      symtab[h] = reverse_symbol_list (symtab[h]);
+      /* Reverse the stack once more, putting it back as it was.  */
+      stack = reverse_symbol_list (stack);
}

/* Let diversions be issued from output.c module, its cleaner to have this
diff --git a/src/m4.h b/src/m4.h
index 3afe476..dcb4613 100644
--- a/src/m4.h
+++ b/src/m4.h
@@ -417,9 +417,9 @@ enum symbol_lookup
/* Symbol table entry.  */
struct symbol
{
-  struct symbol *next;
+  struct symbol *next; /* Next symbol with the same hash.  */
+  struct symbol *stack; /* Stack of shadowed symbols of the same name.  */
bool_bitfield traced : 1;
bool_bitfield macro_args : 1;
bool_bitfield blind_no_args : 1;
bool_bitfield deleted : 1;
@@ -432,7 +432,6 @@ struct symbol

#define SYMBOL_NEXT(S)         ((S)->next)
#define SYMBOL_TRACED(S)       ((S)->traced)
#define SYMBOL_MACRO_ARGS(S)   ((S)->macro_args)
#define SYMBOL_BLIND_NO_ARGS(S)        ((S)->blind_no_args)
#define SYMBOL_DELETED(S)      ((S)->deleted)
diff --git a/src/symtab.c b/src/symtab.c
index 7e76cfc..381d025 100644
--- a/src/symtab.c
+++ b/src/symtab.c
@@ -215,7 +215,6 @@ lookup_symbol (const char *name, size_t len, symbol_lookup
mode)
SYMBOL_TRACED (sym) = SYMBOL_TRACED (old);
SYMBOL_NAME (sym) = xmemdup0 (name, len);
SYMBOL_NAME_LEN (sym) = len;
SYMBOL_MACRO_ARGS (sym) = false;
SYMBOL_BLIND_NO_ARGS (sym) = false;
SYMBOL_DELETED (sym) = false;
@@ -223,7 +222,9 @@ lookup_symbol (const char *name, size_t len, symbol_lookup
mode)

SYMBOL_NEXT (sym) = SYMBOL_NEXT (old);
SYMBOL_NEXT (old) = NULL;
-             (*spp) = sym;
+             sym->stack = old->stack;
+             old->stack = NULL;
+             *spp = sym;
}
return sym;
}
@@ -240,19 +241,21 @@ lookup_symbol (const char *name, size_t len,
symbol_lookup mode)
SYMBOL_TRACED (sym) = false;
SYMBOL_NAME (sym) = xmemdup0 (name, len);
SYMBOL_NAME_LEN (sym) = len;
SYMBOL_MACRO_ARGS (sym) = false;
SYMBOL_BLIND_NO_ARGS (sym) = false;
SYMBOL_DELETED (sym) = false;
SYMBOL_PENDING_EXPANSIONS (sym) = 0;

SYMBOL_NEXT (sym) = *spp;
-      (*spp) = sym;
+      sym->stack = NULL;
+      *spp = sym;

if (mode == SYMBOL_PUSHDEF && cmp == 0)
{
-         SYMBOL_SHADOWED (SYMBOL_NEXT (sym)) = true;
-         SYMBOL_TRACED (sym) = SYMBOL_TRACED (SYMBOL_NEXT (sym));
+         sym->stack = sym->next;
+         sym->next = sym->stack->next;
+         sym->stack->next = NULL;
+         SYMBOL_TRACED (sym) = SYMBOL_TRACED (sym->stack);
}
return sym;

@@ -270,24 +273,30 @@ lookup_symbol (const char *name, size_t len,
symbol_lookup mode)
return NULL;
{
bool traced = false;
-        symbol *result = sym;
-       if (SYMBOL_NEXT (sym) != NULL
-           && mode == SYMBOL_POPDEF)
+       symbol *result = sym;
+       if (sym->stack && mode == SYMBOL_POPDEF)
{
-           SYMBOL_SHADOWED (SYMBOL_NEXT (sym)) = false;
-           SYMBOL_TRACED (SYMBOL_NEXT (sym)) = SYMBOL_TRACED (sym);
+           SYMBOL_TRACED (sym->stack) = SYMBOL_TRACED (sym);
+           sym->stack->next = sym->next;
+           *spp = sym->stack;
+           sym->next = NULL;
+           sym->stack = NULL;
+           free_symbol (sym);
}
else
-         traced = SYMBOL_TRACED (sym);
-       do
{
-           *spp = SYMBOL_NEXT (sym);
-           free_symbol (sym);
-           sym = *spp;
+           traced = SYMBOL_TRACED (sym);
+           *spp = sym->next;
+           do
+             {
+               symbol *old = sym;
+               sym = sym->stack;
+               old->next = NULL;
+               old->stack = NULL;
+               free_symbol (old);
+             }
+           while (sym);
}
-       while (*spp != NULL && SYMBOL_SHADOWED (*spp)
-              && mode == SYMBOL_DELETE);
if (traced)
{
sym = (symbol *) xmalloc (sizeof (symbol));
@@ -295,16 +304,16 @@ lookup_symbol (const char *name, size_t len,
symbol_lookup mode)
SYMBOL_TRACED (sym) = true;
SYMBOL_NAME (sym) = xmemdup0 (name, len);
SYMBOL_NAME_LEN (sym) = len;
SYMBOL_MACRO_ARGS (sym) = false;
SYMBOL_BLIND_NO_ARGS (sym) = false;
SYMBOL_DELETED (sym) = false;
SYMBOL_PENDING_EXPANSIONS (sym) = 0;

SYMBOL_NEXT (sym) = *spp;
-           (*spp) = sym;
+           sym->stack = NULL;
+           *spp = sym;
}
-        return result;
+       return result;
}

default:
@@ -389,19 +398,27 @@ static void
symtab_print_list (int i)
{
symbol *sym;
+  symbol *stack;
size_t h;

xprintf ("Symbol dump #%d:\n", i);
for (h = 0; h < hash_table_size; h++)
for (sym = symtab[h]; sym != NULL; sym = sym->next)
-      xprintf ("\tname %s, len %zu, bucket %lu, addr %p, next %p, "
-              "flags%s%s%s, pending %d\n",
-              SYMBOL_NAME (sym), SYMBOL_NAME_LEN (sym),
-              (unsigned long int) h, sym, SYMBOL_NEXT (sym),
-              SYMBOL_TRACED (sym) ? " traced" : "",
-              SYMBOL_DELETED (sym) ? " deleted" : "",
-              SYMBOL_PENDING_EXPANSIONS (sym));
+      {
+       stack = sym;
+       do
+         {
+           xprintf ("\tname %s, len %zu, bucket %lu, addr %p, next %p, "
+                    "stack %p, flags%s%s, pending %d\n",
+                    SYMBOL_NAME (stack), SYMBOL_NAME_LEN (stack),
+                    (unsigned long int) h, stack, SYMBOL_NEXT (stack),
+                    stack->stack, SYMBOL_TRACED (stack) ? " traced" : "",
+                    SYMBOL_DELETED (stack) ? " deleted" : "",
+                    SYMBOL_PENDING_EXPANSIONS (stack));
+           stack = stack->stack;
+         }
+       while (stack);
+      }
}

#endif /* DEBUG_SYM */
--
1.5.6.4

>From 38a274a6e3c1e13e2d9d1c803d64b1a5414ff603 Mon Sep 17 00:00:00 2001
Date: Tue, 22 Jul 2008 17:18:10 -0600
Subject: [PATCH] Make symbol table opaque.

* src/m4.h (symtab, hash_table_size): No longer export.
* src/symtab.c (symtab): Make static, so we can later change
implementations.
(hash_table_size): New private size variable.
(symtab_init): Take size hint, rather than reading a global.
* src/m4.c (hash_table_size): Delete.
(main): Track -H as local variable instead.
* src/freeze.c (dump_symbol_CB): New function, extracted from...
(produce_frozen_state): ...here, to use hack_all_symbols rather
than raw traversal.

---
ChangeLog    |   13 +++++++
src/freeze.c |  115 ++++++++++++++++++++++++++++++----------------------------
src/m4.c     |    7 ++--
src/m4.h     |    7 +---
src/symtab.c |   17 ++++-----
5 files changed, 86 insertions(+), 73 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index dda5fc0..ed2acce 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,18 @@

+       Make symbol table opaque.
+       * src/m4.h (symtab, hash_table_size): No longer export.
+       * src/symtab.c (symtab): Make static, so we can later change
+       implementations.
+       (hash_table_size): New private size variable.
+       (symtab_init): Take size hint, rather than reading a global.
+       * src/m4.c (hash_table_size): Delete.
+       (main): Track -H as local variable instead.
+       * src/freeze.c (dump_symbol_CB): New function, extracted from...
+       (produce_frozen_state): ...here, to use hack_all_symbols rather
+       than raw traversal.
+
Backport faster hash lookups of pushdef stack collisions.
* src/m4.h (struct symbol): Replace shadowed bit by stack field.
diff --git a/src/freeze.c b/src/freeze.c
index 4d0eeb4..3a48b03 100644
--- a/src/freeze.c
+++ b/src/freeze.c
@@ -44,6 +44,65 @@ reverse_symbol_list (symbol *sym)
return result;
}

+/*-------------------------------------------------------------------.
+| Dump a stack of pushdef references to the stream F.  Designed as a |
+| callback for hack_all_symbols.                                     |
+`-------------------------------------------------------------------*/
+
+static void
+dump_symbol_CB (symbol *sym, void *f)
+{
+  FILE *file = (FILE *) f;
+  symbol *stack;
+  const builtin *bp;
+
+  /* Process all entries in each stack from the last to the first.
+     This order ensures that, at reload time, pushdef's will be
+     executed with the oldest definitions first.  */
+  sym = stack = reverse_symbol_list (sym);
+  while (sym)
+    {
+      switch (SYMBOL_TYPE (sym))
+       {
+       case TOKEN_TEXT:
+         xfprintf (file, "T%d,%d\n",
+                   (int) SYMBOL_NAME_LEN (sym),
+                   (int) strlen (SYMBOL_TEXT (sym)));
+         fwrite (SYMBOL_NAME (sym), 1, SYMBOL_NAME_LEN (sym), file);
+         fputs (SYMBOL_TEXT (sym), file);
+         fputc ('\n', file);
+         break;
+
+       case TOKEN_FUNC:
+         bp = find_builtin_by_addr (SYMBOL_FUNC (sym));
+         if (bp == NULL)
+           {
+             assert (!"dump_symbol_CB");
+             abort ();
+           }
+         xfprintf (file, "F%d,%d\n",
+                   (int) SYMBOL_NAME_LEN (sym),
+                   (int) strlen (bp->name));
+         fwrite (SYMBOL_NAME (sym), 1, SYMBOL_NAME_LEN (sym), file);
+         fputs (bp->name, file);
+         fputc ('\n', file);
+         break;
+
+       case TOKEN_VOID:
+         /* Ignore placeholder tokens that exist due to traceon.  */
+         break;
+
+       default:
+         assert (!"dump_symbol_CB");
+         abort ();
+         break;
+       }
+      sym = sym->stack;
+    }
+  /* Reverse the stack once more, putting it back as it was.  */
+  reverse_symbol_list (stack);
+}
+
/*------------------------------------------------.
| Produce a frozen state to the given file NAME.  |
`------------------------------------------------*/
@@ -52,10 +111,6 @@ void
produce_frozen_state (const char *name)
{
FILE *file;
-  int h;
-  symbol *stack;
-  symbol *sym;
-  const builtin *bp;

file = fopen (name, O_BINARY ? "wb" : "w");
if (!file)
@@ -96,57 +151,7 @@ produce_frozen_state (const char *name)

/* Dump all symbols.  */

-  for (stack = symtab[h = 0]; h < hash_table_size;
-       stack = (stack ? SYMBOL_NEXT (stack) : symtab[h++]))
-    {
-      if (!stack)
-       continue;
-      /* Process all entries in each stack from the last to the first.
-        This order ensures that, at reload time, pushdef's will be
-        executed with the oldest definitions first.  */
-      sym = stack = reverse_symbol_list (stack);
-      while (sym)
-       {
-         switch (SYMBOL_TYPE (sym))
-           {
-           case TOKEN_TEXT:
-             xfprintf (file, "T%d,%d\n",
-                       (int) SYMBOL_NAME_LEN (sym),
-                       (int) strlen (SYMBOL_TEXT (sym)));
-             fwrite (SYMBOL_NAME (sym), 1, SYMBOL_NAME_LEN (sym), file);
-             fputs (SYMBOL_TEXT (sym), file);
-             fputc ('\n', file);
-             break;
-
-           case TOKEN_FUNC:
-             bp = find_builtin_by_addr (SYMBOL_FUNC (sym));
-             if (bp == NULL)
-               {
-                 assert (!"produce_frozen_state");
-                 abort ();
-               }
-             xfprintf (file, "F%d,%d\n",
-                       (int) SYMBOL_NAME_LEN (sym),
-                       (int) strlen (bp->name));
-             fwrite (SYMBOL_NAME (sym), 1, SYMBOL_NAME_LEN (sym), file);
-             fputs (bp->name, file);
-             fputc ('\n', file);
-             break;
-
-           case TOKEN_VOID:
-             /* Ignore placeholder tokens that exist due to traceon.  */
-             break;
-
-           default:
-             assert (!"produce_frozen_state");
-             abort ();
-             break;
-           }
-         sym = sym->stack;
-       }
-      /* Reverse the stack once more, putting it back as it was.  */
-      stack = reverse_symbol_list (stack);
-    }
+  hack_all_symbols (dump_symbol_CB, file);

/* Let diversions be issued from output.c module, its cleaner to have this
piece of code there.  */
diff --git a/src/m4.c b/src/m4.c
index c73e275..7f0af7e 100644
--- a/src/m4.c
+++ b/src/m4.c
@@ -38,9 +38,6 @@ int sync_output = 0;
/* Debug (-d[flags]).  */
int debug_level = 0;

-/* Hash table size (should be a prime) (-Hsize).  */
-size_t hash_table_size = HASHMAX;
-
/* Disable GNU extensions (-G).  */
int no_gnu_extensions = 0;

@@ -412,6 +409,8 @@ main (int argc, char *const *argv, char *const *envp)
const char *frozen_file_to_write = NULL;
const char *macro_sequence = "";
+  /* Hash table size (should be a prime) (-Hsize).  */
+  size_t hash_table_size = HASHMAX;

set_program_name (argv[0]);
retcode = EXIT_SUCCESS;
@@ -592,7 +591,7 @@ main (int argc, char *const *argv, char *const *envp)

input_init ();
output_init ();
-  symtab_init ();
+  symtab_init (hash_table_size);
set_macro_sequence (macro_sequence);
include_env_init ();

diff --git a/src/m4.h b/src/m4.h
index dcb4613..011439e 100644
--- a/src/m4.h
+++ b/src/m4.h
@@ -127,7 +127,6 @@ typedef unsigned int bool_bitfield;
/* Option flags.  */
extern int sync_output;                        /* -s */
extern int debug_level;                        /* -d */
-extern size_t hash_table_size;         /* -H */
extern int no_gnu_extensions;          /* -G */
extern int prefix_all_builtins;                /* -P */
extern size_t max_debug_argument_length;/* -l */
@@ -448,10 +447,8 @@ typedef void hack_symbol (symbol *, void *);

#define HASHMAX 509            /* default, overridden by -Hsize */

-extern symbol **symtab;
-
-void free_symbol (symbol *sym);
-void symtab_init (void);
+void free_symbol (symbol *);
+void symtab_init (size_t);
symbol *lookup_symbol (const char *, size_t, symbol_lookup);
void hack_all_symbols (hack_symbol *, void *);

diff --git a/src/symtab.c b/src/symtab.c
index 381d025..7420e50 100644
--- a/src/symtab.c
+++ b/src/symtab.c
@@ -93,18 +93,17 @@ profile_memcmp (const char *s1, const char *s2, size_t l)
`----------------------------------------------------------------------*/

/* Pointer to symbol table.  */
-symbol **symtab;
+static symbol **symtab;
+
+/* Number of buckets in symbol table.  */
+static size_t hash_table_size;

void
-symtab_init (void)
+symtab_init (size_t size)
{
-  size_t i;
-  symbol **s;
-
-  s = symtab = (symbol **) xnmalloc (hash_table_size, sizeof (symbol *));
-
-  for (i = 0; i < hash_table_size; i++)
-    s[i] = NULL;
+  hash_table_size = size;
+  symtab = (symbol **) xnmalloc (hash_table_size, sizeof *symtab);
+  memset (symtab, 0, hash_table_size * sizeof *symtab);

#ifdef DEBUG_SYM
atexit (show_profile); /* Ignore failure, since this is debug code.  */
--
1.5.6.4

>From ffaa885fc0efb39bf0ca0df0a592a1c39f92729c Mon Sep 17 00:00:00 2001
Date: Sat, 26 Jul 2008 13:27:02 -0600
Subject: [PATCH] Use hash module for self-growing symbol table.

* m4/gnulib-cache.m4: Import hash module.
* src/m4.h (struct symbol): Remove next member, change stack to be
circular.
(SYMBOL_NEXT): Delete.
(symtab_free): New prototype.
* src/symtab.c (show_profile) [DEBUG_SYM]: Track more hash
statistics, and dump to /dev/tty rather than stderr.
(symtab): Change type.
(hash_table_size): Delete.
(symtab_hasher, symtab_comparator, symtab_free_entry): New
functions.
(symtab_init, lookup_symbol, hack_all_symbols): Rewrite to wrap
external hash table.
(symtab_free): New function.
* src/m4.c (main): Call symbol table cleanup.
* src/freeze.c (dump_symbol_CB, reverse_symbol_list): Deal with
fact that pushdef stack is now circular.

---
ChangeLog          |   22 ++++
m4/gnulib-cache.m4 |    3 +-
src/freeze.c       |   25 ++++--
src/m4.c           |    5 +
src/m4.h           |    5 +-
src/symtab.c       |  271 +++++++++++++++++++++++++++++++--------------------
6 files changed, 213 insertions(+), 118 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index ed2acce..de44226 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,25 @@
+
+       Use hash module for self-growing symbol table.
+       * m4/gnulib-cache.m4: Import hash module.
+       * src/m4.h (struct symbol): Remove next member, change stack to be
+       circular.
+       (SYMBOL_NEXT): Delete.
+       (symtab_free): New prototype.
+       * src/symtab.c (show_profile) [DEBUG_SYM]: Track more hash
+       statistics, and dump to /dev/tty rather than stderr.
+       (symtab): Change type.
+       (hash_table_size): Delete.
+       (symtab_hasher, symtab_comparator, symtab_free_entry): New
+       functions.
+       (symtab_init, lookup_symbol, hack_all_symbols): Rewrite to wrap
+       external hash table.
+       (symtab_free): New function.
+       * src/m4.c (main): Call symbol table cleanup.
+       * src/freeze.c (dump_symbol_CB, reverse_symbol_list): Deal with
+       fact that pushdef stack is now circular.
+

Make symbol table opaque.
diff --git a/m4/gnulib-cache.m4 b/m4/gnulib-cache.m4
index f716908..a512830 100644
--- a/m4/gnulib-cache.m4
+++ b/m4/gnulib-cache.m4
@@ -15,7 +15,7 @@

# Specification in the form of a command-line invocation:
-#   gnulib-tool --import --dir=. --local-dir=local --lib=libm4
--source-base=lib --m4-base=m4 --doc-base=doc --aux-dir=build-aux --with-tests
--no-libtool --macro-prefix=M4 announce-gen assert autobuild avltree-oset
binary-io clean-temp cloexec close-stream closein config-h error fdl fflush
flexmember fopen-safer fseeko gendocs getopt git-version-gen gnumakefile
gnupload gpl-3.0 intprops memmem mkstemp obstack obstack-printf-posix progname
quote regex stdbool stdint stdlib-safer strtod strtol unlocked-io
vasnprintf-posix verror version-etc version-etc-fsf xalloc xmemdup0 xprintf
xvasprintf-posix
+#   gnulib-tool --import --dir=. --local-dir=local --lib=libm4
--source-base=lib --m4-base=m4 --doc-base=doc --aux-dir=build-aux --with-tests
--no-libtool --macro-prefix=M4 announce-gen assert autobuild avltree-oset
binary-io clean-temp cloexec close-stream closein config-h error fdl fflush
flexmember fopen-safer fseeko gendocs getopt git-version-gen gnumakefile
gnupload gpl-3.0 hash intprops memmem mkstemp obstack obstack-printf-posix
progname quote regex stdbool stdint stdlib-safer strtod strtol unlocked-io
vasnprintf-posix verror version-etc version-etc-fsf xalloc xmemdup0 xprintf
xvasprintf-posix

# Specification in the form of a few gnulib-tool.m4 macro invocations:
gl_LOCAL_DIR([local])
@@ -42,6 +42,7 @@ gl_MODULES([
gnumakefile
gpl-3.0
+  hash
intprops
memmem
mkstemp
diff --git a/src/freeze.c b/src/freeze.c
index 3a48b03..2a7d9dc 100644
--- a/src/freeze.c
+++ b/src/freeze.c
@@ -30,17 +30,24 @@
static symbol *
reverse_symbol_list (symbol *sym)
{
-  symbol *result;
+  symbol *first = sym;
symbol *next;
+  symbol *prev = sym;
+  symbol *result;

-  result = NULL;
-  while (sym)
+  assert (sym);
+  if (sym->stack == sym)
+    return sym;
+  next = sym->stack;
+  do
{
-      next = sym->stack;
-      sym->stack = result;
-      result = sym;
+      result = prev;
sym = next;
+      next = sym->stack;
+      sym->stack = prev;
+      prev = sym;
}
+  while (prev != first);
return result;
}

@@ -59,8 +66,9 @@ dump_symbol_CB (symbol *sym, void *f)
/* Process all entries in each stack from the last to the first.
This order ensures that, at reload time, pushdef's will be
executed with the oldest definitions first.  */
-  sym = stack = reverse_symbol_list (sym);
-  while (sym)
+  stack = sym;
+  sym = reverse_symbol_list (sym);
+  do
{
switch (SYMBOL_TYPE (sym))
{
@@ -99,6 +107,7 @@ dump_symbol_CB (symbol *sym, void *f)
}
sym = sym->stack;
}
+  while (sym != stack->stack);
/* Reverse the stack once more, putting it back as it was.  */
reverse_symbol_list (stack);
}
diff --git a/src/m4.c b/src/m4.c
index 7f0af7e..551d80c 100644
--- a/src/m4.c
+++ b/src/m4.c
@@ -683,8 +683,13 @@ main (int argc, char *const *argv, char *const *envp)
undivert_all ();
}
output_exit ();
+#ifndef NDEBUG
+  /* Only spend time freeing memory to help isolate leaks; if
+     assertions are disabled, save the time and exit now.  */
free_regex ();
quotearg_free ();
+  symtab_free ();
+#endif /* NDEBUG */
#ifdef DEBUG_REGEX
if (trace_file)
fclose (trace_file);
diff --git a/src/m4.h b/src/m4.h
index 011439e..ff0377a 100644
--- a/src/m4.h
+++ b/src/m4.h
@@ -416,8 +416,7 @@ enum symbol_lookup
/* Symbol table entry.  */
struct symbol
{
-  struct symbol *next; /* Next symbol with the same hash.  */
-  struct symbol *stack; /* Stack of shadowed symbols of the same name.  */
+  struct symbol *stack; /* Circular list for pushdef stack of symbol.  */
bool_bitfield traced : 1;
bool_bitfield macro_args : 1;
bool_bitfield blind_no_args : 1;
@@ -429,7 +428,6 @@ struct symbol
token_data data;  /* Type should be only TOKEN_TEXT or TOKEN_FUNC.  */
};

-#define SYMBOL_NEXT(S)         ((S)->next)
#define SYMBOL_TRACED(S)       ((S)->traced)
#define SYMBOL_MACRO_ARGS(S)   ((S)->macro_args)
#define SYMBOL_BLIND_NO_ARGS(S)        ((S)->blind_no_args)
@@ -449,6 +447,7 @@ typedef void hack_symbol (symbol *, void *);

void free_symbol (symbol *);
void symtab_init (size_t);
+void symtab_free (void);
symbol *lookup_symbol (const char *, size_t, symbol_lookup);
void hack_all_symbols (hack_symbol *, void *);

diff --git a/src/symtab.c b/src/symtab.c
index 7420e50..a9160c8 100644
--- a/src/symtab.c
+++ b/src/symtab.c
@@ -33,6 +33,8 @@

#include "m4.h"

+#include "hash.h"
+
#ifdef DEBUG_SYM
/* When evaluating hash table performance, this profiling code shows
how many collisions were encountered.  */
@@ -47,19 +49,28 @@ struct profile

static struct profile profiles[5];
static symbol_lookup current_mode;
+static unsigned long long hash_entry;
+static unsigned long long comparator_entry;
+static size_t current_size;
+static unsigned int resizes;

/* On exit, show a profile of symbol table performance.  */
static void
show_profile (void)
{
int i;
+  FILE *f = fopen ("/dev/tty", "w");
for (i = 0; i < 5; i++)
{
-      xfprintf(stderr, "m4: lookup mode %d called %d times, %d compares, "
+      xfprintf(f, "m4: lookup mode %d called %d times, %d compares, "
"%d misses, %lld bytes\n",
i, profiles[i].entry, profiles[i].comparisons,
profiles[i].misses, profiles[i].bytes);
}
+  xfprintf(f, "m4: %llu hash callbacks, %llu compare callbacks, "
+          "%zu buckets, %u resizes\n",
+          hash_entry, comparator_entry, current_size, resizes - 1);
+  fclose (f);
}

/* Like memcmp (S1, S2, L), but also track profiling statistics.  */
@@ -87,33 +98,12 @@ profile_memcmp (const char *s1, const char *s2, size_t l)
#endif /* DEBUG_SYM */

-/*----------------------------------------------------------------------.
-| Initialise the symbol table, by allocating the necessary storage, and |
-| zeroing all the entries.                                             |
-`----------------------------------------------------------------------*/
-
/* Pointer to symbol table.  */
-static symbol **symtab;
-
-/* Number of buckets in symbol table.  */
-static size_t hash_table_size;
-
-void
-symtab_init (size_t size)
-{
-  hash_table_size = size;
-  symtab = (symbol **) xnmalloc (hash_table_size, sizeof *symtab);
-  memset (symtab, 0, hash_table_size * sizeof *symtab);
-
-#ifdef DEBUG_SYM
-  atexit (show_profile); /* Ignore failure, since this is debug code.  */
-#endif /* DEBUG_SYM */
-}
+static Hash_table *symtab;

/*--------------------------------------------------.
| Return a hashvalue for a string S of length LEN.  |
`--------------------------------------------------*/
-
static size_t
hash (const char *s, size_t len)
{
@@ -126,6 +116,82 @@ hash (const char *s, size_t len)
return val;
}

+/*----------------------------------------------------.
+| Wrap our hash inside signature expected by hash.h.  |
+`----------------------------------------------------*/
+static size_t
+symtab_hasher (const void *entry, size_t buckets)
+{
+#ifdef DEBUG_SYM
+  hash_entry++;
+  if (buckets != current_size)
+    {
+      resizes++;
+      current_size = buckets;
+    }
+#endif /* DEBUG_SYM */
+  const symbol *sym = (const symbol *) entry;
+  return hash (SYMBOL_NAME (sym), SYMBOL_NAME_LEN (sym)) % buckets;
+}
+
+/*----------------------------------------------.
+| Compare two hash table entries for equality.  |
+`----------------------------------------------*/
+static bool
+symtab_comparator (const void *entry_a, const void *entry_b)
+{
+#ifdef DEBUG_SYM
+  comparator_entry++;
+#endif /* DEBUG_SYM */
+  const symbol *sym_a = (const symbol *) entry_a;
+  const symbol *sym_b = (const symbol *) entry_b;
+  return (SYMBOL_NAME_LEN (sym_a) == SYMBOL_NAME_LEN (sym_b)
+         && memcmp (SYMBOL_NAME (sym_a), SYMBOL_NAME (sym_b),
+                    SYMBOL_NAME_LEN (sym_a)) == 0);
+}
+
+/*---------------------------.
+| Reclaim an entry on exit.  |
+`---------------------------*/
+static void
+symtab_free_entry (void *entry)
+{
+  symbol *sym = (symbol *) entry;
+  while (sym->stack != sym)
+    {
+      symbol *old = sym->stack;
+      sym->stack = old->stack;
+      assert (!SYMBOL_PENDING_EXPANSIONS (old));
+      free_symbol (old);
+    }
+  assert (!SYMBOL_PENDING_EXPANSIONS (sym));
+  free_symbol (sym);
+}
+
+/*--------------------------------------------------------------.
+| Initialize the symbol table, with SIZE as a hint for expected |
+| number of entries.                                           |
+`--------------------------------------------------------------*/
+void
+symtab_init (size_t size)
+{
+  symtab = hash_initialize (size, NULL, symtab_hasher, symtab_comparator,
+                           symtab_free_entry);
+
+#ifdef DEBUG_SYM
+  atexit (show_profile); /* Ignore failure, since this is debug code.  */
+#endif /* DEBUG_SYM */
+}
+
+/*------------------------.
+| Clean up entire table.  |
+`------------------------*/
+void
+symtab_free (void)
+{
+  hash_free (symtab);
+}
+
/*--------------------------------------------.
| Free all storage associated with a symbol.  |
`--------------------------------------------*/
@@ -162,38 +228,23 @@ free_symbol (symbol *sym)
symbol *
lookup_symbol (const char *name, size_t len, symbol_lookup mode)
{
-  size_t h;
-  int cmp = 1;
-  symbol *sym, *prev;
-  symbol **spp;
+  symbol *sym;
+  symbol *entry;
+  symbol tmp;

#if DEBUG_SYM
current_mode = mode;
profiles[mode].entry++;
#endif /* DEBUG_SYM */

-  h = hash (name, len);
-  sym = symtab[h % hash_table_size];
-
-  for (prev = NULL; sym != NULL; prev = sym, sym = sym->next)
-    {
-      cmp = (len < SYMBOL_NAME_LEN (sym) ? -1 : len > SYMBOL_NAME_LEN (sym) ? 1
-            : memcmp (SYMBOL_NAME (sym), name, len));
-      if (cmp >= 0)
-       break;
-    }
-
-  /* If just searching, return status of search.  */
-
-  if (mode == SYMBOL_LOOKUP)
-    return cmp == 0 ? sym : NULL;
-
-
-  spp = (prev != NULL) ?  &prev->next : &symtab[h % hash_table_size];
+  tmp.name = (char *) name;
+  tmp.len = len;
+  entry = (symbol *) hash_lookup (symtab, &tmp);

switch (mode)
{
+    case SYMBOL_LOOKUP:
+      return entry ? entry->stack : NULL;

case SYMBOL_INSERT:

@@ -202,14 +253,15 @@ lookup_symbol (const char *name, size_t len,
symbol_lookup mode)
a new one; if not, just return the symbol.  If not found, just
insert the name, and return the new symbol.  */

-      if (cmp == 0 && sym != NULL)
+      if (entry)
{
+         sym = entry->stack;
if (SYMBOL_PENDING_EXPANSIONS (sym) > 0)
{
symbol *old = sym;
SYMBOL_DELETED (old) = true;

-             sym = (symbol *) xmalloc (sizeof (symbol));
+             sym = (symbol *) xmalloc (sizeof *sym);
SYMBOL_TYPE (sym) = TOKEN_VOID;
SYMBOL_TRACED (sym) = SYMBOL_TRACED (old);
SYMBOL_NAME (sym) = xmemdup0 (name, len);
@@ -219,11 +271,20 @@ lookup_symbol (const char *name, size_t len,
symbol_lookup mode)
SYMBOL_DELETED (sym) = false;
SYMBOL_PENDING_EXPANSIONS (sym) = 0;

-             SYMBOL_NEXT (sym) = SYMBOL_NEXT (old);
-             SYMBOL_NEXT (old) = NULL;
-             sym->stack = old->stack;
+             if (old == entry)
+               {
+                 old = (symbol *) hash_delete (symtab, entry);
+                 assert (entry == old);
+                 sym->stack = sym;
+                 entry = (symbol *) hash_insert (symtab, sym);
+                 assert (sym == entry);
+               }
+             else
+               {
+                 entry->stack = sym;
+                 sym->stack = old->stack;
+               }
old->stack = NULL;
-             *spp = sym;
}
return sym;
}
@@ -231,11 +292,13 @@ lookup_symbol (const char *name, size_t len,
symbol_lookup mode)

case SYMBOL_PUSHDEF:

-      /* Insert a name in the symbol table.  If there is already a symbol
-        with the name, insert this in front of it, and mark the old
-
-      sym = (symbol *) xmalloc (sizeof (symbol));
+      /* Insert a name in the symbol table.  If there is already a
+        symbol with the name, add it to the pushdef stack.  Since the
+        hash table does not allow the insertion of duplicates, the
+        pushdef stack is a circular chain; the hash entry is the
+        oldest entry, which points to the newest entry; all other
+        entries point to the next older entry.  */
+      sym = (symbol *) xmalloc (sizeof *sym);
SYMBOL_TYPE (sym) = TOKEN_VOID;
SYMBOL_TRACED (sym) = false;
SYMBOL_NAME (sym) = xmemdup0 (name, len);
@@ -245,17 +308,19 @@ lookup_symbol (const char *name, size_t len,
symbol_lookup mode)
SYMBOL_DELETED (sym) = false;
SYMBOL_PENDING_EXPANSIONS (sym) = 0;

-      SYMBOL_NEXT (sym) = *spp;
-      sym->stack = NULL;
-      *spp = sym;
-
-      if (mode == SYMBOL_PUSHDEF && cmp == 0)
+      if (entry)
{
-         sym->stack = sym->next;
-         sym->next = sym->stack->next;
-         sym->stack->next = NULL;
+         assert (mode == SYMBOL_PUSHDEF);
+         sym->stack = entry->stack;
+         entry->stack = sym;
SYMBOL_TRACED (sym) = SYMBOL_TRACED (sym->stack);
}
+      else
+       {
+         sym->stack = sym;
+         entry = (symbol *) hash_insert (symtab, sym);
+         assert (sym == entry);
+       }
return sym;

case SYMBOL_DELETE:
@@ -268,37 +333,36 @@ lookup_symbol (const char *name, size_t len,
symbol_lookup mode)
definition is still in use, let the caller free the memory
after it is done with the symbol.  */

-      if (cmp != 0 || sym == NULL)
+      if (!entry)
return NULL;
{
bool traced = false;
-       symbol *result = sym;
-       if (sym->stack && mode == SYMBOL_POPDEF)
+       symbol *result = sym = entry->stack;
+       if (sym != entry && mode == SYMBOL_POPDEF)
{
SYMBOL_TRACED (sym->stack) = SYMBOL_TRACED (sym);
-           sym->stack->next = sym->next;
-           *spp = sym->stack;
-           sym->next = NULL;
+           entry->stack = sym->stack;
sym->stack = NULL;
free_symbol (sym);
}
else
{
traced = SYMBOL_TRACED (sym);
-           *spp = sym->next;
-           do
+           while (sym != entry)
{
symbol *old = sym;
sym = sym->stack;
-               old->next = NULL;
old->stack = NULL;
free_symbol (old);
}
-           while (sym);
+           sym = (symbol *) hash_delete (symtab, entry);
+           assert (sym == entry);
+           sym->stack = NULL;
+           free_symbol (sym);
}
if (traced)
{
-           sym = (symbol *) xmalloc (sizeof (symbol));
+           sym = (symbol *) xmalloc (sizeof *sym);
SYMBOL_TYPE (sym) = TOKEN_VOID;
SYMBOL_TRACED (sym) = true;
SYMBOL_NAME (sym) = xmemdup0 (name, len);
@@ -308,9 +372,9 @@ lookup_symbol (const char *name, size_t len, symbol_lookup
mode)
SYMBOL_DELETED (sym) = false;
SYMBOL_PENDING_EXPANSIONS (sym) = 0;

-           SYMBOL_NEXT (sym) = *spp;
-           sym->stack = NULL;
-           *spp = sym;
+           sym->stack = sym;
+           entry = (symbol *) hash_insert (symtab, sym);
+           assert (sym == entry);
}
return result;
}
@@ -335,20 +399,17 @@ lookup_symbol (const char *name, size_t len,
symbol_lookup mode)
void
hack_all_symbols (hack_symbol *func, void *data)
{
-  size_t h;
-  symbol *sym;
+  symbol *sym = (symbol *) hash_get_first (symtab);
symbol *next;

-  for (h = 0; h < hash_table_size; h++)
+  while (sym)
{
/* We allow func to call SYMBOL_POPDEF, which can invalidate
sym, so we must grab the next element to traverse before
calling func.  */
-      for (sym = symtab[h]; sym != NULL; sym = next)
-       {
-         next = SYMBOL_NEXT (sym);
-         func (sym, data);
-       }
+      next = (symbol *) hash_get_next (symtab, sym);
+      func (sym->stack, data);
+      sym = next;
}
}

@@ -396,28 +457,26 @@ symtab_debug (void)
static void
symtab_print_list (int i)
{
-  symbol *sym;
+  symbol *sym = (symbol *) hash_get_first (symtab);
symbol *stack;
-  size_t h;

xprintf ("Symbol dump #%d:\n", i);
-  for (h = 0; h < hash_table_size; h++)
-    for (sym = symtab[h]; sym != NULL; sym = sym->next)
-      {
-       stack = sym;
-       do
-         {
-           xprintf ("\tname %s, len %zu, bucket %lu, addr %p, next %p, "
-                    "stack %p, flags%s%s, pending %d\n",
-                    SYMBOL_NAME (stack), SYMBOL_NAME_LEN (stack),
-                    (unsigned long int) h, stack, SYMBOL_NEXT (stack),
-                    stack->stack, SYMBOL_TRACED (stack) ? " traced" : "",
-                    SYMBOL_DELETED (stack) ? " deleted" : "",
-                    SYMBOL_PENDING_EXPANSIONS (stack));
-           stack = stack->stack;
-         }
-       while (stack);
-      }
+  while (sym)
+    {
+      stack = sym->stack;
+      do
+       {
+         xprintf ("\tname %s, len %zu, addr %p, "
+                  "stack %p, flags%s%s, pending %d\n",
+                  SYMBOL_NAME (stack), SYMBOL_NAME_LEN (stack),
+                  stack, stack->stack, SYMBOL_TRACED (stack) ? " traced" : "",
+                  SYMBOL_DELETED (stack) ? " deleted" : "",
+                  SYMBOL_PENDING_EXPANSIONS (stack));
+         stack = stack->stack;
+       }
+      while (stack != sym);
+      sym = (symbol *) hash_get_next (symtab, sym);
+    }
}

#endif /* DEBUG_SYM */
--
1.5.6.4

```