help-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How does gawk allocate memory for arrays?


From: Ed Morton
Subject: Re: How does gawk allocate memory for arrays?
Date: Mon, 30 May 2022 21:46:09 -0500
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1

On 5/30/2022 9:41 PM, Andrew J. Schorr wrote:
On Mon, May 30, 2022 at 08:54:23AM -0500, Ed Morton wrote:
A question came up recently that can be reduced to comparing the
line numbers associated with unique key values in 2 files that are
over 10G lines each. While unique, those values only need to be
compared within groups of 10 contiguous lines (in reality there's
other things going on irrelevant to this discussion).

So we had:

  * Approach 1:  while reading file1 create an array that's 10G of
    entries then while reading file2 just do a hash lookup.
  * Approach 2: while reading file1 for every 10 lines create an array
    that holds those 10 entries, then getline 10 entries from file2 into
    a second array of 10 entries, then loop through all the values in
    the file1 array to compare to the file2 array then delete both
    arrays and start over.

Obviously the 2nd approach was going to use far less memory but
according to the OP it was also an order of magnitude faster than
the 1st approach so that got me wondering about how awk arrays are
allocated, e.g. is there a default size that gets allocated
initially and then new chunks of memory get allocated as needed? If
so what is that size?  I expect I could find and read the code but
I'm really hoping to not have to do that and the design is
documented somewhere or someone can just tell me what it is.
There is a lot of overhead to allocating array entries. My guess
is that the system is paging in approach 1. Did the OP check for paging?
I doubt it.
There is an initial default array size, but the array is expanded as
necessary.
That's what I suspected and I was asking out of curiosity if anyone know what that size was and if the expansion occurs in chunks that are the same size as the original or different sizes.
  The problem is that each array entry has a lot of overhead.
The array format is not memory-efficient.
Good to know.

Thanks,

    Ed.

Regards,
Andy


reply via email to

[Prev in Thread] Current Thread [Next in Thread]