[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Is the speed of array index checking same as string compa
From: |
Peng Yu |
Subject: |
Re: [bug-gawk] Is the speed of array index checking same as string comparison? |
Date: |
Mon, 26 Nov 2018 17:03:50 -0600 |
On Mon, Nov 26, 2018 at 4:17 PM Peng Yu <address@hidden> wrote:
>
> > Awk arrays are hash tables. As they fill up, access will slow down.
> > That said, gawk goes to a lot of trouble to try to keep access fast,
> > with a small number of collisions per bucket on average. I suspect you'd
> > have to REALLY load an array before you'd notice a big difference
> > (millions of elements on modern machines).
>
> Here is my test case. It seems that only when the hash size reaches a
> billion, it can cause more than 80% slow-down. Before that, the
> slowdown is only 10%~20% which is quite acceptable in practice. Isn't
> it?
>
> ns=(
> 10000
> 100000
> 1000000
> 10000000
> 100000000
> 1000000000
> )
> m=100000000
> for n in "address@hidden"
> do
> time1=$(( time awk -v n="$n" 'BEGIN { for(i=1;i<=n;++i) d[i]=""}' ) 2>&1)
> time2=$(( time awk -v n="$n" -v m="$m" 'BEGIN { for(i=1;i<=n;++i)
> d[i]=""; d["NA"]=1; for(i=1;i<=m;++i) ("NA" in d) }') 2>&1)
> time3=$(( time awk -v n="$n" -v m="$m" 'BEGIN { for(i=1;i<=n;++i)
> d[i]=""; x="NA"; for(i=1;i<=m;++i) x=="NA" }' ) 2>&1)
> printf '%s\t%s\t%s\t%s\t%s\n' "$n" "$time1" "$time2" "$time3"
> "$(bc -l <<< "($time2-$time1)/($time3-$time1)")"
> done
> 10000 0.016 7.430 6.313 1.17738605685246942988
> 100000 0.045 7.404 6.354 1.16642891107941036614
> 1000000 0.220 7.581 6.471 1.17757158854583266677
> 10000000 1.938 9.298 8.170 1.18100128369704749679
> 100000000 19.468 26.804 25.726 1.17225950782997762863
> 1000000000 200.543 205.224 203.093 1.83568627450980392156
The above test case has a bug --- the baseline was not correct.
Here is the correct one. It seems that there is only a constant factor
(~40%) slow-down no matter how large the hash size is. Is it?
ns=(
10
100
1000
10000
100000
1000000
10000000
100000000
1000000000
)
m=100000000
for n in "address@hidden"
do
time1=$(( time awk -v n="$n" -v m="$m" 'BEGIN {
for(i=1;i<=n;++i) d[i]=""; for(i=1;i<=m;++i) {} }' ) 2>&1)
time2=$(( time awk -v n="$n" -v m="$m" 'BEGIN {
for(i=1;i<=n;++i) d[i]=""; d["NA"]=1; for(i=1;i<=m;++i) ("NA" in d) }'
) 2>&1)
time3=$(( time awk -v n="$n" -v m="$m" 'BEGIN {
for(i=1;i<=n;++i) d[i]=""; x="NA"; for(i=1;i<=m;++i) x=="NA" }' )
2>&1)
printf '%s\t%s\t%s\t%s\t%s\n' "$n" "$time1" "$time2" "$time3"
"$(bc -l <<< "($time2-$time1)/($time3-$time1)")"
done
10 3.695 7.461 6.350 1.41845574387947269303
100 3.630 7.420 6.297 1.42107236595425571803
1000 3.694 7.431 6.345 1.40965673330818559034
10000 3.686 7.424 6.344 1.40632054176072234762
100000 3.693 7.440 6.347 1.41183119819140919366
1000000 3.776 7.576 6.496 1.39705882352941176470
10000000 5.692 9.302 8.230 1.42237982663514578408
100000000 23.647 26.845 25.913 1.41129744042365401588
1000000000 201.134 204.302 203.471 1.35558408215661103979
--
Regards,
Peng