Re: Help-gawk Digest, Vol 1, Issue 3

help-gawk
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Help-gawk Digest, Vol 1, Issue 3

From:	J Naman
Subject:	Re: Help-gawk Digest, Vol 1, Issue 3
Date:	Mon, 19 Jul 2021 13:58:38 -0400
BTW: I benchmarked time for sprintf()+gsub()= 16.15 secs for 1,000 loops,
more than one HUNDRED times slower than time for doubling & substr(loop!)=
12.36 secs for 100,000 loops. (unless my benchmark code had a bug ...)
Someone mentioned a "bug" in gsub() ...

On Mon, Jul 19, 2021 at 4:24 AM <help-gawk-request@gnu.org> wrote:

> Send Help-gawk mailing list submissions to
>         help-gawk@gnu.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://lists.gnu.org/mailman/listinfo/help-gawk
> or, via email, send a message with subject or body 'help' to
>         help-gawk-request@gnu.org
>
> You can reach the person managing the list at
>         help-gawk-owner@gnu.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Help-gawk digest..."
>
>
> Today's Topics:
>
>    1. Why string can be added with 0? (Peng Yu)
>    2. Re: Why string can be added with 0? (Neil R. Ormos)
>    3. Re: Why string can be added with 0? (Bob Proulx)
>    4. Re: How to Generate a Long String of the Same Character
>       (Bob Proulx)
>    5. Re: How to Generate a Long String of the Same Character
>       (Neil R. Ormos)
>    6. Re: Why string can be added with 0? (Wolfgang Laun)
>    7. Re: How to Generate a Long String of the Same Character
>       (Wolfgang Laun)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Sun, 18 Jul 2021 21:41:16 -0500
> From: Peng Yu <pengyu.ut@gmail.com>
> To: help-gawk@gnu.org
> Subject: Why string can be added with 0?
> Message-ID:
>         <CABrM6w=xSPGzqU=bExg8_ujO7ycDtuY8T6jKcGk4S=
> bAvdcUwA@mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> I see this. I don't find anything about it in 6.2.1 Arithmetic Operators.
>
> $ gawk '{ print typeof($1), $1 + 0 }' <<< a
> string 0
>
> But it seems that there should be an error to add a string to 0? Is it
> better to show some error instead of assuming a string as 0 in the
> context of arithmetic operations? Thanks.
>
> --
> Regards,
> Peng
>
>
>
> ------------------------------
>
> Message: 2
> Date: Sun, 18 Jul 2021 23:28:55 -0500 (CDT)
> From: "Neil R. Ormos" <ormos-gnulists17@ormos.org>
> To: Help Gawk List <help-gawk@gnu.org>
> Subject: Re: Why string can be added with 0?
> Message-ID: <Pine.GSO.4.64.2107182252020.27912@shell3.ripco.com>
> Content-Type: TEXT/PLAIN; charset=US-ASCII
>
> Peng Yu wrote:
>
> > I see this. I don't find anything about it in
> > 6.2.1 Arithmetic Operators.
>
> > $ gawk '{ print typeof($1), $1 + 0 }' <<< a
> > string 0
>
> > But it seems that there should be an error to
> > add a string to 0?
>
> It is not an error.
>
> | 6.1.4.1 How awk Converts Between Strings and Numbers
>
> | Strings are converted to numbers and numbers are
> | converted to strings, if the context of the awk
> | program demands it. For example, if the value of
> | either foo or bar in the expression 'foo + bar'
> | happens to be a string, it is converted to a
> | number before the addition is performed. [...]
>
> | [...] To force a string to be converted to a
> | number, add zero to that string. A string is
> | converted to a number by interpreting any
> | numeric prefix of the string as numerals: [...]
> | Strings that can't be interpreted as valid
> | numbers convert to zero.
>
> > Is it better to show some error instead of
> > assuming a string as 0 in the context of
> > arithmetic operations?
>
> No. Awk's behavior, when an arithmetic operation
> involving a string is attempted, of interpreting
> as numeric however much of the string appears to
> be numeric, without noisy error messages, is a
> feature that makes it easier to write concise
> programs that handle mixed string and numeric
> input.
>
> Besides, it has worked that way for eons, and
> programmers rely on it.  Changing it now would
> break zillions of working programs.
>
> If you prefer a programming language that does
> intrusive type checking or routinely changes out
> from under you so existing programs are rendered
> useless, there are plenty of choices.
>
>
>
> ------------------------------
>
> Message: 3
> Date: Sun, 18 Jul 2021 22:35:06 -0600
> From: Bob Proulx <bob@proulx.com>
> To: help-gawk@gnu.org
> Subject: Re: Why string can be added with 0?
> Message-ID: <20210718221845320319290@bob.proulx.com>
> Content-Type: text/plain; charset=us-ascii
>
> Peng Yu wrote:
> > I see this. I don't find anything about it in 6.2.1 Arithmetic Operators.
>
> You were very close to the best section of the manual.  In the section
> before that one is where the gawk manual talks about strings and
> numbers.  In my manual it is section 6.1.4.1 "How 'awk' Converts
> Between Strings and Numbers". The answer you seek is there.
>
>
> https://www.gnu.org/software/gawk/manual/html_node/Strings-And-Numbers.html
>
>     6.1.4.1 How 'awk' Converts Between Strings and Numbers
>     ......................................................
>
>     Strings are converted to numbers and numbers are converted to strings,
>     if the context of the 'awk' program demands it.  For example, if the
>     value of either 'foo' or 'bar' in the expression 'foo + bar' happens to
>     be a string, it is converted to a number before the addition is
>     performed.  If numeric values appear in string concatenation, they are
>     converted to strings.  Consider the following:
>
>          two = 2; three = 3
>          print (two three) + 4
>
>     This prints the (numeric) value 27.  The numeric values of the
> variables
>     'two' and 'three' are converted to strings and concatenated together.
>     The resulting string is converted back to the number 23, to which 4 is
>     then added.
>
>        If, for some reason, you need to force a number to be converted to a
>     string, concatenate that number with the empty string, '""'.  To force
> a
>     string to be converted to a number, add zero to that string.  A string
>     is converted to a number by interpreting any numeric prefix of the
>     string as numerals: '"2.5"' converts to 2.5, '"1e3"' converts to 1,000,
>     and '"25fix"' has a numeric value of 25.  Strings that can't be
>     interpreted as valid numbers convert to zero.
>
> I abbreviated the information here.  See the manual for the full
> section with more detail than I included here.
>
> > $ gawk '{ print typeof($1), $1 + 0 }' <<< a
> > string 0
>
> That's correct.  It's a string but then adding 0 to the string forces
> it to be a number.
>
> > But it seems that there should be an error to add a string to 0? Is it
> > better to show some error instead of assuming a string as 0 in the
> > context of arithmetic operations? Thanks.
>
> AWK was one of the first of the little languages to try to dynamically
> do the right thing to simplify the programmer's task of writing a
> program.  But following in the tradition of AWS is also Perl, Python,
> Ruby, and many other dynamic languages that all behave the same way.
> It's a design paradigm used to enhance programmer productivity.  If it
> is used like a string then it is converted to a string.  If it is used
> like a number then it is converted to a number.
>
> Bob
>
>
>
> ------------------------------
>
> Message: 4
> Date: Sun, 18 Jul 2021 22:59:53 -0600
> From: Bob Proulx <bob@proulx.com>
> To: help-gawk@gnu.org
> Subject: Re: How to Generate a Long String of the Same Character
> Message-ID: <20210718224110340130477@bob.proulx.com>
> Content-Type: text/plain; charset=us-ascii
>
> Neil R. Ormos wrote:
> > In a message on the bug-gawk list, Ed Mortin wrote:
> > That should have been "Ed Morton".
> > > On an online forum someone asked how to generate a
> > > string of 100,000,000 "x"s. They had tried this in
> > > a BEGIN section:
> > >
> > >    for(i=1;i<=100000000;i++) s = s "x"
> >...
> > Building a big string by iterating in tiny chunks
> > would seem to invite poor performance.
>
> Agreed.  Growing by one character at a time definitely seems
> inefficient.
>
> > Instead, why not append the string to itself,
> > doubling its size with each iteration?  For
> > example:
> >
> > time ~/.local/bin/gawk-5.1.0 \
> >   'BEGIN{sizelim=100000000; a="x"; while (length(a) < sizelim) {a=a a};
> a=substr(a, 1, sizelim); print length(a);}'
>
> I think that is probably one of the best ways with awk.
>
> My mind first thought that it would be better to produce a file that
> contained 100 million "x"s and then read it into awk.
>
>     awk '{print length($0)}' < bigfileofx
>
> Of course that simply changes the problem around to creating that
> file!  This is rather a silly response but it's fun just the same.
>
> Well...  There are certainly many ways to do it.  I would use dd for
> creating the byte stream of the right size.  But there seems no way to
> use dd to produce "x" characters.  But it can read /dev/zero okay.
> And tr can translate zeros to other characters such as an "x".
>
>     $ dd status=none if=/dev/zero bs=1 count=10 | tr "\0" "x"; echo
>     xxxxxxxxxx
>
>     $ dd status=none if=/dev/zero bs=1 count=10 | tr "\0" "x" | wc -c
>     10
>
> That looks promising.  Let's fire it up for the requested 100 million
> size.
>
>     $ time dd status=none if=/dev/zero bs=1M count=100 | tr "\0" "x" | wc
> -c
>
>     104857600
>
>     real    0m0.179s
>     user    0m0.126s
>     sys     0m0.167s
>
> Looks like the right size.  Let's get it into awk.
>
>     $ time dd status=none if=/dev/zero bs=1M count=100 | tr "\0" "x" |
> awk '{print length($0)}'
>     104857600
>
>     real    0m0.624s
>     user    0m0.451s
>     sys     0m0.398s
>
> That's looking pretty good.  Let's compare it against the reference
> above so one can see how slow my machine is about such things.
>
>     $ time awk 'BEGIN{sizelim=100000000; a="x"; while (length(a) <
> sizelim) {a=a a}; a=substr(a, 1, sizelim); print length(a);}'
>
>     100000000
>
>     real    0m1.469s
>     user    0m0.815s
>     sys     0m0.654s
>
> I am running this on an older Intel Core i5 CPU 750 2.67GHz.
>
> > On my not-very-fast machine, according to the time
> > built-in, that takes 0.17 seconds of elapsed time.
>
> Faster than my daily driving desktop!  :-)
>
> > Yes, worst-case, if the intended string has length
> > (2^N)+1, you wastefully build a string of size
> > 2^(N+1) and trim off almost half.  So maybe on
> > some machines, building the string in
> > single-character units would work but the doubling
> > would not.
>
> Fun stuff!  And illustrates the usefulness of benchmarking to collect
> data.
>
> Bob
>
>
>
> ------------------------------
>
> Message: 5
> Date: Mon, 19 Jul 2021 01:46:40 -0500 (CDT)
> From: "Neil R. Ormos" <ormos-gnulists17@ormos.org>
> To: Help Gawk List <help-gawk@gnu.org>
> Subject: Re: How to Generate a Long String of the Same Character
> Message-ID: <Pine.GSO.4.64.2107190129410.3912@shell3.ripco.com>
> Content-Type: TEXT/PLAIN; charset=US-ASCII
>
> Bob Proulx wrote:
>
> > That's looking pretty good.  Let's compare it against the reference
> > above so one can see how slow my machine is about such things.
> >
> >     $ time awk 'BEGIN{sizelim=100000000; a="x"; while (length(a) <
> sizelim) {a=a a}; a=substr(a, 1, sizelim); print length(a);}'
>
> >     100000000
> >
> >     real    0m1.469s
> >     user    0m0.815s
> >     sys     0m0.654s
> >
> > I am running this on an older Intel Core i5 CPU 750 2.67GHz.
>
> That seems really odd.  It takes under 0.5 seconds
> of elapsed time on a machine with a 25-watt mobile
> Core 2 Duo CPU that maxes out at 2.26 GHz.
>
> I tried your dd | tr | gawk solution and found the
> times vary bizarrely on machines where the pure
> gawk solution has run-times roughly in-line with
> what I'd expect.  Even the elapsed times of
> consecutive individual runs of the dd | tr | gawk
> solution vary strangely.
>
> Also, I think the blocksize parameter should be
> bs=1MB to get blocks of 10^6 bytes and not 2^20
> bytes.
>
>
>
> ------------------------------
>
> Message: 6
> Date: Mon, 19 Jul 2021 06:07:44 +0200
> From: Wolfgang Laun <wolfgang.laun@gmail.com>
> To: Peng Yu <pengyu.ut@gmail.com>
> Cc: help-gawk@gnu.org
> Subject: Re: Why string can be added with 0?
> Message-ID:
>         <
> CANaj1LfpCAA9k_KomStu0mB9O2AZ72yzJP7oja7sZrv12spEaQ@mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> See 6.1.4.1 How awk Converts Between Strings and Numbers.
> There are languages where the operator defines the kind of operation, and
> languages where the type of the argument decides what to do.
> If there can be doubts as to the correctness of the data, check.
>
> -W
>
>
>
> On Mon, 19 Jul 2021 at 05:36, Peng Yu <pengyu.ut@gmail.com> wrote:
>
> > I see this. I don't find anything about it in 6.2.1 Arithmetic Operators.
> >
> > $ gawk '{ print typeof($1), $1 + 0 }' <<< a
> > string 0
> >
> > But it seems that there should be an error to add a string to 0? Is it
> > better to show some error instead of assuming a string as 0 in the
> > context of arithmetic operations? Thanks.
> >
> > --
> > Regards,
> > Peng
> >
> >
>
> --
> Wolfgang Laun
>
>
> ------------------------------
>
> Message: 7
> Date: Mon, 19 Jul 2021 08:51:24 +0200
> From: Wolfgang Laun <wolfgang.laun@gmail.com>
> To: "Neil R. Ormos" <ormos-gnulists17@ormos.org>, help-gawk@gnu.org
> Subject: Re: How to Generate a Long String of the Same Character
> Message-ID:
>         <
> CANaj1Ldm_NOWCreZKSiAcehW0Z-kz6jkRYkMCkzAtrK_fbgV2Q@mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> Neil R. Ormos suggests the following code, which I put here as a function:
> function srep(n, s){
>     while( length(s) < n )
>         s = s s;
>     return substr( s, 1, n );
> }
> Neil points out that doubling in the while loop may overshoot the desired
> length by almost 100%, potentially causing the algorithm to fail. However,
> it is quite simple to avoid this:
> function srep(n, s){   # *dbl*
>     while( length(s)*2 <= n )
>         s = s s;
>     return s substr( s, 1, n - length(s) );
> }
>
> I have tried to keep track of all the solutions to the simple original
> problem, extending the functionality to string repetition (because this
> makes it more useful), and done some performance testing.
>
> The original question was whether this:
> function srep(n, s,  res){  # *rpt*
>      for( i = 1; i <= n; ++i )
>          res = res s
>      return res;
> }
> could be improved. This was proposed as an improvement over the *rpt*
> version:
> function srep(n, s,  res){    # *sub*
>      res = sprintf("%*s", n, "");
>      gsub( / /, s, res );
>      return res;
> }
> and I contributed:
> function srep(n, s,   h){  # *rec*
>      if( n == 0 ) return "";
>      h = srep( int(n/2), s )
>      return n % 2 == 1 ? h h s : h h;
> }
>
> I have used this code together with /usr/bin/time:
> BEGIN {
>     for( j = 1; j <= 300000; ++j ){
>         srep( j%1000, "a" );
>         srep( j%1000, "abcde" );
>     }
> }
> The results for the four versions:
>    *rec*  0m1,436s
>    *dbl*  0m2.322s
>    *rpt*   0m13.543s
>    *sub*  0m27.290s
>
> Note 1: It should be noted that version *sub* has a defect: using "&" or
> some combination with "\" is not handled correctly. I have read section
> 9.1.3.1, *More about ‘\’ and ‘&’ with sub(), gsub(), and gensub(), *of the
> GUM and, although it didn't cause me a headache, it made me gawk. I did not
> try to cook *sub.*
>
> Note 2: I have provoked the aforementioned failure in *dbl*, resulting in
> the somewhat laconic error message:
>     $ gawk -f srepDoubl.awk
>     Killed
> See the bug list for my comment on this message.
>
> Cheers
> Wolfgang
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Help-gawk mailing list
> Help-gawk@gnu.org
> https://lists.gnu.org/mailman/listinfo/help-gawk
>
>
> ------------------------------
>
> End of Help-gawk Digest, Vol 1, Issue 3
> ***************************************
>
[Prev in Thread]
Current Thread
[Next in Thread]
Re: Help-gawk Digest, Vol 1, Issue 3, J Naman <=
Prev by Date: Re: How to Generate a Long String of the Same Character
Next by Date: Re: How to Generate a Long String of the Same Character
Previous by thread: Why string can be added with 0?
Next by thread: Re: Help-gawk Digest, Vol 1, Issue 5
Index(es):
- Date
- Thread