[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Memory leak
From: |
Stephane Delsert |
Subject: |
Re: [bug-gawk] Memory leak |
Date: |
Mon, 27 Mar 2017 18:03:42 +0000 |
Hi,
I've joined a little sample and a little script if you want create a bigger
file. This script doesn't change the initial order. My user sort function uses
2 internal tables that could be a research way and I tried to make a test with
a setting of those tables in the BEGIN statement but without success.
Normally I use gawk as filter for simple processing. The number of lines in
input and in output is huge but the processes remain simple. This tool is
already highly powerful and I had processed several billions of lines with high
performances nevertheless I will study all opportunities that this extension
can offer.
Great thanks ,
Regards,
Stéphane.
-----Original Message-----
From: Andrew J. Schorr [mailto:address@hidden
Sent: lundi 27 mars 2017 17:20
To: Stephane Delsert <address@hidden>
Cc: address@hidden; Fatima Aliane <address@hidden>; Vihan_Sharma - Vihan Sharma
(LiveRamp) <address@hidden>
Subject: Re: [bug-gawk] Memory leak
Hi,
Thanks for bug report. Is it possible for you to supply a small sample dataset
that can be used with this script?
Also, gawk's array implementation currently incurs a lot of overhead for each
array entry saved. I think the last time I measured this, it was around 253
bytes per array element when the index and the value were both strings. Since
you are using numeric indices, the overhead should be less, but it still can
consume a tremendous amount of memory. If you load 320 million records, that
might come to tens of GB of overhead. Are you certain that the
PROCINFO["sorted_in"] setting really matters? I wonder if this is simply a
problem with gawk array overhead.
For working with massive datasets, you might consider trying the gawkextlib
lmdb extension. It is very fast and handles large key-value stores. You can
download it here:
https://sourceforge.net/projects/gawkextlib/files/
Regards,
Andy
On Mon, Mar 27, 2017 at 02:42:28PM +0000, Stephane Delsert wrote:
> Hi,
>
> We hit a memory leak with gawk for the joined script. This script sorts a
> file already sorted on primary keys for additional keys. For achieve this I
> used a user defined function and set this function as follow :the
> PROCINFO["sorted_in"]="__sort_subsort"
> We noticed a growth of memory required by gawk with the increase of the
> processed records. Gawk ended after over 320 MM of records. The memory size
> was over 20Gb. A post analysis shown that the maximum size of the tables of
> the script was 121 elements.
> I made different tests and it appears that this issue doesn't arrive when I
> don't use PROCINFO mechanism. For little files, this script works correctly.
>
> I didn't see this kind of bug in the bug reports. I made tests with version
> 4.1.3 and version 4.1.4 without success.
>
> Thank you for your help.
>
> Best regards,
>
> Stéphane Delsert.
>
> **********************************************************************
> ***** The information contained in this communication is confidential,
> is intended only for the use of the recipient named above, and may be
> legally privileged.
>
> If the reader of this message is not the intended recipient, you are
> hereby notified that any dissemination, distribution or copying of
> this communication is strictly prohibited.
>
> If you have received this communication in error, please resend this
> communication to the sender and delete the original message or any
> copy of it from your computer system.
>
> Thank You.
> **********************************************************************
> ******
> BEGIN {
> FS="|"
> OFS="|"
>
> sort_old_key_1=""
> sort_old_key_2=""
> sort_old_key_3=""
> sort_old_key_4=""
> sort_old_key_5=""
> sort_old_key_6=""
> sort_old_key_7=""
> sort_old_key_8=""
> sort_old_key_9=""
> split("", tab_store);
> split("", subsort_tab1);
> split("", subsort_tab2);
> nb_tab_store=0;
> PROCINFO["sorted_in"]="__sort_subsort"
> }
> {
> FIELD0=$1
> FIELD1=$2
> FIELD2=$3
> FIELD3=$4
> FIELD4=$5
> FIELD5=$6
> FIELD6=$7
> FIELD7=$8
> FIELD8=$9
> FIELD9=$10
> FIELD10=$11
> FIELD11=$12
>
> sort_key_1=" " FIELD2
> sort_key_2=" " FIELD3
> sort_key_3=" " FIELD4
> sort_key_4=" " FIELD5
> sort_key_5=" " FIELD6
> sort_key_6=" " FIELD7
> sort_key_7=" " FIELD8
> sort_key_8=" " FIELD1
> sort_key_9=" " FIELD9
> sort_prim_compare = ( ( sort_old_key_1 < sort_key_1 ) ? -1 : (
> ( sort_old_key_1 == sort_key_1 ) ? 0 : 1 ) );
> sort_prim_compare = ( sort_prim_compare
> == 0 ) ? ( ( sort_old_key_2 < sort_key_2 ) ? -1 : ( ( ( sort_old_key_2 ==
> sort_key_2 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> sort_prim_compare = ( sort_prim_compare
> == 0 ) ? ( ( sort_old_key_3 < sort_key_3 ) ? -1 : ( ( ( sort_old_key_3 ==
> sort_key_3 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> sort_prim_compare = ( sort_prim_compare
> == 0 ) ? ( ( sort_old_key_4 < sort_key_4 ) ? -1 : ( ( ( sort_old_key_4 ==
> sort_key_4 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> sort_prim_compare = ( sort_prim_compare
> == 0 ) ? ( ( sort_old_key_5 < sort_key_5 ) ? -1 : ( ( ( sort_old_key_5 ==
> sort_key_5 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> sort_prim_compare = ( sort_prim_compare
> == 0 ) ? ( ( sort_old_key_6 < sort_key_6 ) ? -1 : ( ( ( sort_old_key_6 ==
> sort_key_6 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> sort_prim_compare = ( sort_prim_compare
> == 0 ) ? ( ( sort_old_key_7 < sort_key_7 ) ? -1 : ( ( ( sort_old_key_7 ==
> sort_key_7 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> sort_prim_compare = ( sort_prim_compare
> == 0 ) ? ( ( sort_old_key_8 < sort_key_8 ) ? -1 : ( ( ( sort_old_key_8 ==
> sort_key_8 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> sort_prim_compare = ( sort_prim_compare
> == 0 ) ? ( (
> sort_old_key_9 < sort_key_9 ) ? -1 : ( ( ( sort_old_key_9 ==
> sort_key_9 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
>
> if ( ( sort_prim_compare > 0 ) && ( NR > 1 ) ) {
> print "file not correctly sorted at " NR " line " >
> ".sortcsv.sh_14831_S.acx_error_message.9d"
> exit 9
> }
>
> sort_sec_key_1=" " FIELD11
> if ( ( sort_prim_compare != 0 ) || ( NR == 1 ) ) {
> if ( nb_tab_store > 1 ) {
> for ( sort_tmp_line in tab_store ) {
> print tab_store[sort_tmp_line] ;
> }
> }
> else {
> if ( nb_tab_store > 0 ) {
> print tab_store[0] ;
> }
> }
>
> sort_old_key_1= sort_key_1
> sort_old_key_2= sort_key_2
> sort_old_key_3= sort_key_3
> sort_old_key_4= sort_key_4
> sort_old_key_5= sort_key_5
> sort_old_key_6= sort_key_6
> sort_old_key_7= sort_key_7
> sort_old_key_8= sort_key_8
> sort_old_key_9= sort_key_9
> split("", tab_store);
> nb_tab_store=0;
> }
> $1=$1
> tab_store[nb_tab_store] = sort_sec_key_1 OFS $0
> nb_tab_store += 1;
> }
>
>
> END {
> for ( sort_tmp_line in tab_store ) {
> print tab_store[sort_tmp_line] ;
> }
> }
> function __sort_subsort(i1,v1,i2,v2)
> {
> nb_subsort_tab1 = split(v1, subsort_tab1 );
> nb_subsort_tab2 = split(v2, subsort_tab2 );
>
> sort_sec_compare = ( ( subsort_tab1[1] < subsort_tab2[1] ) ? -1
> : (
> ( subsort_tab1[1] == subsort_tab2[1] ) ? 0 : 1 ) );
>
> return(sort_sec_compare)
> }
samplegnu.zip
Description: samplegnu.zip
- [bug-gawk] Memory leak, Stephane Delsert, 2017/03/27
- Re: [bug-gawk] Memory leak, arnold, 2017/03/27
- Re: [bug-gawk] Memory leak, Andrew J. Schorr, 2017/03/27
- Re: [bug-gawk] Memory leak,
Stephane Delsert <=
- Re: [bug-gawk] Memory leak, arnold, 2017/03/28
- Re: [bug-gawk] Memory leak, Andrew J. Schorr, 2017/03/28
- Re: [bug-gawk] Memory leak, arnold, 2017/03/28
- Re: [bug-gawk] Memory leak, Andrew J. Schorr, 2017/03/28
- Re: [bug-gawk] Memory leak, Stephane Delsert, 2017/03/29
- Re: [bug-gawk] Memory leak, Andrew J. Schorr, 2017/03/29