help-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

How does gawk allocate memory for arrays?


From: Ed Morton
Subject: How does gawk allocate memory for arrays?
Date: Mon, 30 May 2022 08:54:23 -0500
User-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1

A question came up recently that can be reduced to comparing the line numbers associated with unique key values in 2 files that are over 10G lines each. While unique, those values only need to be compared within groups of 10 contiguous lines (in reality there's other things going on irrelevant to this discussion).

So we had:

 * Approach 1:  while reading file1 create an array that's 10G of
   entries then while reading file2 just do a hash lookup.
 * Approach 2: while reading file1 for every 10 lines create an array
   that holds those 10 entries, then getline 10 entries from file2 into
   a second array of 10 entries, then loop through all the values in
   the file1 array to compare to the file2 array then delete both
   arrays and start over.

Obviously the 2nd approach was going to use far less memory but according to the OP it was also an order of magnitude faster than the 1st approach so that got me wondering about how awk arrays are allocated, e.g. is there a default size that gets allocated initially and then new chunks of memory get allocated as needed? If so what is that size?  I expect I could find and read the code but I'm really hoping to not have to do that and the design is documented somewhere or someone can just tell me what it is.

    Ed.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]