bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory


From: Eli Zaretskii
Subject: Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory ...
Date: Sat, 12 Jul 2014 11:00:27 +0300

> Date: Sat, 12 Jul 2014 16:26:31 +0900
> From: green fox <address@hidden>
> Cc: address@hidden
> 
> > Thus, because UTF-nn encodings of Unicode do not permit all possible
> > byte combinations, it is quite easy to have filenames that must be
> > handled by software, and yet whose names are not describable as valid
> > Unicode character strings.
> Yes, and we work around that by treating everything but 0x2f || 0x00 as blobs.
> We had to do that, and it worked nicely.
> Until some random American came up with the idea that the worlds language can
> fit into 16 bit. haha....then 32bit,...hahaha... then
> trim-byte-5,byte6 off utf8...hahahahahah....
> and CJK...and Arabic bidi.... hahahahahahahahah.

What does bidi have to do with Gawk?  Bidi is a display-time feature,
it has nothing to do with batch-style text processing that Gawk is
doing.

Are you just trying to lump together unrelated issues, perhaps to
impress the uninitiated?  If so, please don't: some of the people here
do know what all this is about.  Let's stay focused on the target,
whatever it is.

> > The awk, mawk, nawk, and oawk implementations treat files as character
> > streams, where NUL (0x00) is a string terminator.  By contrast, gawk's
> > view of files is that they are simply byte streams, and no byte value
> > has any more significance than any other byte value: 0x00 is just a
> > normal data byte.  Thus, with care, gawk can be used to read and write
> > arbitrary files.  From that point of view, the less it knows about
> > `characters', the better.
> Agreed.
> The sad part is, we no longer have the capability in gawk to return to
> 'byte stream'.
> Once your in 'character stream', no way out of it...no such function as
>  sir_gawk_I_want_to_be_fed_byte_stream_data_please()
> 
> I ask for power to have the _do_what_I_ask_ option, so I can write
> 0x80-0xff range.
> It would be nicer if I had always_byte_based_length() and
> always_byte_based_substr()
> then the rest of the routine can be written on top of them...

I'm sorry for being somewhat blunt, but do you actually have some
actionable proposals for these issues?  If so, please show them, and
preferably do that in English, without a bunch of obscure Awk scripts
that make it much harder to understand what you mean and want.

Please describe in clear terms:

  . what problem(s), exactly, do you want to solve?

  . what solution(s) do you propose for that?

  . why do you think that having those solutions as loadable
    extensions (which are always distributed and installed with Gawk)
    is not TRT?

Thank you.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]