bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large


From: Jason C. Kwan
Subject: Slowness in bignum mode ( gmp | gawk -M ) when doubling extremely large inputs
Date: Wed, 26 Jan 2022 22:35:22 +0000 (UTC)

Hi GAWK team,
Not sure if I should be reporting this to the gawk team or the GnuMP team - 
it's neither a bug or nor new feature per se, but merely a performance one - 
I've noticed on that for extremely large inputs, say, integers with more than 5 
million digits, the bignum gawk -M mode could be somewhat slow, with possible 
room for speed improvement (proof-of-concept attached below). my gawk version 
information : 
GNU Awk 5.1.1, API: 3.1 (GNU MPFR 4.1.0, GNU MP 6.2.1)
Darwin MacBook-Pro.local 21.2.0 Darwin Kernel Version 21.2.0: Sun Nov 28 
20:28:41 PST 2021; root:xnu-8019.61.5~1/RELEASE_ARM64_T6000 arm64
On my test-case of 71.5 million digits, it's possible to to save 63.7% of time 
, even in regular gawk, when compared to gawk -M  : 
==================
gawk -be '{ print length($0) }' test.txt71591842
test command ::
echo; time ( pv -q < test.txt | LC_ALL=C gawkmx -b -e '{ print _2x_($0) }' ) | 
xxh128sum ; echo; time (pv -q < test.txt | LC_ALL=C gawk -M -b -e '{ print 
$0+$0 }' ) | xxh128sum ; echo
a56c2d2302d9ea8751f810b848e6f354  stdin( pv -q < test.txt | LC_ALL=C LC_ALL=C 
gawk -e "${mfx}" -b) 8.16s user 0.60s system 99% cpu 8.789 totalxxh128sum  
0.01s user 0.00s system 0% cpu 8.789 total
a56c2d2302d9ea8751f810b848e6f354  stdin( pv -q < test.txt | LC_ALL=C gawk -M -b 
-e ; )  23.26s user 0.96s system 99% cpu 24.240 totalxxh128sum  0.01s user 
0.00s system 0% cpu 24.239 total
=====================
In another test case of slightly over 275 milion digits, the time savings are 
71.7% : 
fc2231bdff375b7870586d8dffc0841c  stdin( pv -q < jwengowengonoewgnwoegn.txt | 
LC_ALL=C gawk -b -e "${mfx}" -e ; )  27.25s user 6.48s system 98% cpu 34.350 
totalxxh128sum  0.04s user 0.02s system 0% cpu 34.349 total
fc2231bdff375b7870586d8dffc0841c  stdin( pv -q < jwengowengonoewgnwoegn.txt | 
LC_ALL=C gawk -M -b -e ; )  116.58s user 4.78s system 99% cpu 2:01.42 
totalxxh128sum  0.04s user 0.02s system 0% cpu 2:01.42 total
====================
Attached below is the full proof-of-concept code for function _2x_( ) to 
demostrate that the time savings are very much concrete and possible, not 
simply theoretical. The test file, being 70MB+, is a big large for email, but 
basically any file using ASCII digits 0-9 to represent any integer over 5 
million digits will do. The speed difference isn't noticeable for smaller 
inputs, and for inputs fewer than 7 digits, most likely it would be slower than 
gawk -M. 
I tried maximizing portability of the proof-of-concept function by refraining 
from any gawk-specific extensions of awk - this same code has also been tested 
in mawk 1.3.4, mawk 1.9.9.6, and macos 12.1 awk/nawk. 
It's entirely self-contained, performs no recursion, has no external 
dependencies, no bit-wise ops, and doesn't include any advanced/fancy math - 
just straight up grade-school long-form addition, doubling them 15-digits per 
chunk, and 2 chunks per while loop. The carrying part is performed by gsub( ) 
prior to the while( ) loop, and thus, eliminating the need to track them along 
the way. Performance scales linearly at the log-base-10 level.
( I didn't include any copyright notice or credits to a priori since I don't 
think grade school addition is something copyrightable) 
Obviously this is simply awk scripting code and can't be directly incorporated 
into the C code base - I'm merely raising awareness of the issue.
Thanks for your time.
Jason
===================================================================================
( I can reformat the code for readability if you prefer )  : 
function _2x_(__,_,_______,____,_________,      
___________,__________,____________,              ________,_____,______,___) {  
  if(__!~/[1-9]/) {        return +_ }    ___=(__~/^[-]/)    
sub("^[+-]?["(-"")"]*","",__)    if (length(__)<(_____=((_+=++_)^_^_-!!_))+_) { 
       if (_________^((_____+(_~____))^(_____-_)<+__)) {            return 
(-!-"")^___*(+__+__)    } }    ___=substr("-",_~"",___)    if (__!~"[5-9]") {   
     gsub(/4/,"8",__)-gsub(/3/,"6",__)        gsub(/2/,"4",__)-gsub(/1/,"2",__) 
       return (___)__    }; _______=____=________="";    
__=(_____=substr(________=(_++^_--+_)^_____,_))(__)    gsub(/./,".",_____)      
   sub("("(_____)")+$","_&",__)    __=substr(__,index(__,"_")+!+"")    
________+=+________;_="";    if 
((gsub(_____,"&_",__)+gsub(/[_][5-9]/,".5&",__)*(+""))%2) {        
_=(":")(________+__+__)        __=substr(__,index(__,"_")+!+"")    }    
for(_____ in ______) { }    if (______[_____=split(__,______,/[_]/)]=="") {     
   delete ______[_____--] };__=____~____;    __________=____________*=\    
____________=___________=_________=32;    while(__<_____) { if(!--__________){  
      __________=____________;_______=(_______)_;_="";        
if(!--___________){        ___________=_________;____=(____)_______;_______=""; 
       } }        _=(_)(":")(+______[__++]*2+________)\             
(":")(+______[__++]*2+________) }    ____=(____)(_______)(_)\         
(__==_____?(":")(________+2*______[_____]):"")    gsub(/[:]+[^:]/,"",____)      
    sub(/^0*/,"",____); return (___)____ }

reply via email to

[Prev in Thread] Current Thread [Next in Thread]