bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Is the speed of array index checking same as string compa


From: Peng Yu
Subject: Re: [bug-gawk] Is the speed of array index checking same as string comparison?
Date: Mon, 26 Nov 2018 17:03:50 -0600

On Mon, Nov 26, 2018 at 4:17 PM Peng Yu <address@hidden> wrote:
>
> > Awk arrays are hash tables.  As they fill up, access will slow down.
> > That said, gawk goes to a lot of trouble to try to keep access fast,
> > with a small number of collisions per bucket on average.  I suspect you'd
> > have to REALLY load an array before you'd notice a big difference
> > (millions of elements on modern machines).
>
> Here is my test case. It seems that only when the hash size reaches a
> billion, it can cause more than 80% slow-down. Before that, the
> slowdown is only 10%~20% which is quite acceptable in practice. Isn't
> it?
>
> ns=(
>     10000
>     100000
>     1000000
>     10000000
>     100000000
>     1000000000
> )
> m=100000000
> for n in "address@hidden"
> do
>     time1=$(( time awk -v n="$n" 'BEGIN { for(i=1;i<=n;++i) d[i]=""}' ) 2>&1)
>     time2=$(( time awk -v n="$n" -v m="$m" 'BEGIN { for(i=1;i<=n;++i)
> d[i]=""; d["NA"]=1; for(i=1;i<=m;++i) ("NA" in d) }') 2>&1)
>     time3=$(( time awk -v n="$n" -v m="$m" 'BEGIN { for(i=1;i<=n;++i)
> d[i]=""; x="NA"; for(i=1;i<=m;++i) x=="NA" }' ) 2>&1)
>     printf '%s\t%s\t%s\t%s\t%s\n' "$n" "$time1" "$time2" "$time3"
> "$(bc -l <<< "($time2-$time1)/($time3-$time1)")"
> done
> 10000    0.016    7.430    6.313    1.17738605685246942988
> 100000    0.045    7.404    6.354    1.16642891107941036614
> 1000000    0.220    7.581    6.471    1.17757158854583266677
> 10000000    1.938    9.298    8.170    1.18100128369704749679
> 100000000    19.468    26.804    25.726    1.17225950782997762863
> 1000000000    200.543    205.224    203.093    1.83568627450980392156

The above test case has a bug --- the baseline was not correct.

Here is the correct one. It seems that there is only a constant factor
(~40%) slow-down no matter how large the hash size is. Is it?

ns=(
        10
        100
        1000
        10000
        100000
        1000000
        10000000
        100000000
        1000000000
)
m=100000000
for n in "address@hidden"
do
        time1=$(( time awk -v n="$n" -v m="$m" 'BEGIN {
for(i=1;i<=n;++i) d[i]=""; for(i=1;i<=m;++i) {} }' ) 2>&1)
        time2=$(( time awk -v n="$n" -v m="$m" 'BEGIN {
for(i=1;i<=n;++i) d[i]=""; d["NA"]=1; for(i=1;i<=m;++i) ("NA" in d) }'
) 2>&1)
        time3=$(( time awk -v n="$n" -v m="$m" 'BEGIN {
for(i=1;i<=n;++i) d[i]=""; x="NA"; for(i=1;i<=m;++i) x=="NA" }' )
2>&1)
        printf '%s\t%s\t%s\t%s\t%s\n' "$n" "$time1" "$time2" "$time3"
"$(bc -l <<< "($time2-$time1)/($time3-$time1)")"
done
10      3.695   7.461   6.350   1.41845574387947269303
100     3.630   7.420   6.297   1.42107236595425571803
1000    3.694   7.431   6.345   1.40965673330818559034
10000   3.686   7.424   6.344   1.40632054176072234762
100000  3.693   7.440   6.347   1.41183119819140919366
1000000 3.776   7.576   6.496   1.39705882352941176470
10000000        5.692   9.302   8.230   1.42237982663514578408
100000000       23.647  26.845  25.913  1.41129744042365401588
1000000000      201.134 204.302 203.471 1.35558408215661103979

-- 
Regards,
Peng



reply via email to

[Prev in Thread] Current Thread [Next in Thread]