bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory


From: green fox
Subject: Re: [bug-gawk] gawk 4.x series mmap attempts to alocates 32GB of memory ...
Date: Sun, 13 Jul 2014 10:08:58 +0900

Hi Eli, thanks for the reply.
This is kinda getting off topic from gawk, so those who care,
most important point is at bottom.
To summarize, it is
-IPA updates, separation of render and formatting before passing it off is
 (a) possible for ASCII (b) not that nice for multi byte characters,
esp with IVS.
-How do I specify my name correctly.
-gawkapi is nice, being able to do more stuff from script is nicer in
my opinion.
That is all.

And Eli, me and you clearly have difference in our view.
I am partly guilty of being the evil, trying to explain everything.
But challenging everything, does not help. I had to dig up old memories, find
links, etc, to support my claim. For a Univ. discussion class, that is fine.

On gawk ml, it would be nicer if we could just talk about technical aspects.
Like why printf("%c",0xffffff00+0x80)  gives 0x80 as output.
And why I would want to do that....
If there is chance of getting some function so foo(0x80) outputs 0x80.
and how gawkapi may help / not help...
Let me know about the sample gawk extension attached at the end.
I believe the code is lacking some things, feedback is welcome.
Is moving to that direction KO for you?
It is ok and desirable for me.

On 7/12/14, Eli Zaretskii <address@hidden> wrote:
>> Date: Sat, 12 Jul 2014 19:47:12 +0900
>> From: green fox <address@hidden>
>
> Why private mail?  I've CC'ed the list again.
Probably fault on my part. Used the "Reply To:" button.

>> I use a language that needs multiple byte to represent correctly.
>
> Which language is that?  And which multibyte representation do you
> want to support in Gawk for that language?  Please be specific; these
> are very technical issues that cannot be discussed on such a high
> level, without any details.
Agreed.
Japanese and Chinese is what I use daily.
JISX2013 and BIG5 extended.
And as you know, these two have lots of extensions.
I do know that in theory, round trip conversion is possible with the
Unicode code points. In reality, I need to check for certain characters,
 handle exceptions, and be really careful.
And these are supposed to be all included into ISO/IEC 10646
 (in a perfect world).
We agreed to use IVS , however the process is not complete yet. It is on going.
Please check the Unicode IVD section, and Update listing from IPA,
ITSCJ committee for detail on what is going on. I know you know, but still.
To summarize, some letters are still not included.

To explain in detail, in many places, and in lots of text, we use
vendor / data specific extension.
Yes, I do know the theory of why it is not good.
And we are in the process of converting many data to Unicode.

But I must ask you question first.
_IF_ the code page (or should I say, code points and the glyph)
that is used daily in my country by lots of people is not registered
as _international_ well known standard, do you disagree, even to help
convert it to Unicode ?

In my country, it is a good working standard.
We also have lookup table to convert to Unicode now.
(the table is not complete, but most can be converted, and the rest
is left for later). And I need ways to convert that.

>> The language is _very_ sensitive in how it must be presented.
>> Location,order, width, height, place of dot, where to split line, is
>> all very important.
>
> These are all very important in many languages, I agree.  But they are
> not, and should not be, Gawk problems.  Gawk is a text processing
> tool, so it processes text in the logical order, disregarding any
> display-time features.  If you are looking for solutions for
> display-time problems, Gawk is not the right place to look.

Humm... sorry, I can not understand the humor that you put there.

For those who do not understand, I will explain.
Lets say I was crafting a piece of html page, from some text.
I use regexp pattern to strip out what I do not need.
Then, I print necessary tags, the text, then more tags, and it is all good.
This only works in single byte characters, and limited set of multi-byte.

The reason is, for some characters, if I do not specify correct 'hint'
the text becomes useless. In case of html, I must pass the type of
language used, as separate tags. With out such hint, the same
character gets printed as a different language.
This is all in utf-8, by he way. I must also add the correct IVS value
for some characters when processing.

Therefore, being able to 'hint' correctly is very crucial.
Think of it as leaving out the umlaut in German. Not nice.

I think building a html / tex / ps from some text, using gawk is a
valid use case. Maybe not for you, Eli, so we may differ here.

Why make life so hard on different language ?
To processes text in the logical order, as you say, it is not automated
for us. UCA and ISO 14651 are two very different beasts.
Language specific use case and sorting issues are very difficult to
implement correctly. (if at all. We have many case where standard is kinda
there and no implementation...)

Sometimes, I know the exact 'hint' that I must supply.
Being able to spit out exact bytes really helps here.

>
>> And, my country has cities and places, and name of people who are alive,
>> where the letter is not included in Unicode.
>> The situation was very very bad. My country made there own code page to
>> solve the problem.
>
> What codepage is that? does it have a name or a number?  Is it
> supported by any popular system out there, and if so, which systems
> support it?

在留カード用文字コード、戸籍統一文字、残存外字for various separate systems,
to name a few. Fujitsu,NEC、日本加除出版株式会社、Hitachi and a few others
sell such systems that you can buy.
You can also try out some on the net.
http://kosekimoji.moj.go.jp/kosekimojidb/mjko/PeopleSearch/EXECUTE
Sample query
http://kosekimoji.moj.go.jp/kosekimojidb/mjko/PeopleList/EXECUTE?
ihid_SelectedKskMjBng=552710&itxt_Code=552710&
ihid_clickedButtonName=iimag_Moji&ihid_SearchCount=1&
irdo_Code=1&itxt_Yomi1=&itxt_Yomi2=&itxt_Yomi3=&
islc_Kakusu1=&islc_Jis=&ihid_StartPosition=1

And as you know, IPA has the font and lookup table between such code
page and Unicode.

For _my_ daily use cases, I have files with names of people,
and scans of old documents. They include letters that are not available
on Unicode (at the moment) so we use some extensions so it can be
resolved later.

Say, I had character U+7950 and this was used as a persons name.
If this persons proper letter would be U+7950,U+E0104, then
I would not want it to be displayed as U+7950,U+E0101 or a single U+7950.
maybe it is a very minor difference, but it is a very large difference
for us. If your name was Fli, and was told F and E is close enough,
no need to specify the IVS, would you agree? Probably not.
We've agreed to use the IVS, so we'd want a good way to specify it.
Simple as that.

>> For America/Europe, Multibyte, CJK, bidi, is all Render issue.
>> For me and my country, it is all REAL day-to-day problems on handling
>> text.
>>
>> bidi is very important too. It is not only Arabic. Old and new
>> Chinese, Japanese,
>> we write in right-to-left, and in some cases, up-to-down as well.
>
> Displaying glyphs from right to left is not bidi, it's just a
> different layout.  Bidi is about _bidirectional_ display, where the
> direction changes from L2R to R2L and back within the same paragraph
> of text.
Without thinking (calculating) how it gets rendered, its really
impossible to perform the kinsoku operation, repeated-letter
 substitution, to name a few. There are lots of uncommon-operations-
 for-the-English, that has to be performed in other languages.
 There is more to that then just 'render' as you say.
And render layer is useless for that.
(The render layer already has enough thing to do)

You do not have to believe in my words.
But at least even try to build a system for our language. Then you know.
If you already have such a system that works nicely in our language,
tell me, I want to evaluate. I am not being sarcastic or irrational.
If such nice working system exists, I really want to use it now.

In a ideal world, the render layer and string manipulation layer
is separate. I wish I can just handle multi byte characters like ASCII.
But in reality I can not. I _can_, but with out heavy lifting, the
outcome is very terrible, in both the character order,
_and_ when rendering
(some things must be handled before passing it upstream to render layer).

> And it is still about display, so it's unrelated to Gawk.
>
>> Where to split, or insert
>> data? We check type of character, calculate length, or lookup for next
>> character with
>> matching type.
>
> I understand all that (and knew about it before), but what does this
> have to do with Gawk?  Gawk doesn't split words, or calculate their
> width or position on display, or consider any other display layout
> issues.  Gawk _produces_ text, which some other piece of software (a
> text terminal or emulator, a GUI rendering program, etc.) should then
> present to the user.  It is that other software's job to select the
> correct font and glyphs, reorder the text for display, be it bidi or
> otherwise, and display it so that the result is legible for users who
> speak that language.

To prepare the language for the next layer is gawks job, as I believe.
Being able to manipulate byte streams / character streams is
_as you say_ gawks job. Yet it seems contradicting to what you say.

Because if gawk is at character stream mode, we have no good / sane
way to output specific byte at the moment (v4.1.60). We used to have
such capability in the past (v3.2+patch).
jlength() vs length() , and such.
Just asking for the same capability that works on the utf-8 era.
That is, same kind, that works in utf8.

Without gawks capability to output proper sequence of byte / character
streams to files / pipes, the next program (or the render engine)
can not handle our language correctly.

>> Please, take back your statement for the language you do not know about.
>
> I'm not taking back anything.  I knew everything you told about
> already, there's nothing new here for me, at least not on the level on
> which you presented them.

Apology if it appeared to offend you in some way. But if you knew the
problem, it would be nice to share the solution with us, and the
rest of the world if you do not hesitate.
Input to issue is always welcome.

I attempted to opt-in for a workaround. Because there is a daily problem.
If alternative and better solution (other than saying, stop using gawk)
is provided, I would like to consider using it, if it matches our
languages use case.

> You are not the only one who understands these issues, or have ever
> worked on them.
If it appeared that I presented myself like that, then I am very
disappointed to hear that. So let me clearly.
I am a single individual. For many years, I have used various libs
that others have built, to process text. I am a ordinary user of such
routines. I work on top of shoulders of giants, and respect those
who figured out working solutions to difficult problems.
I am not, and never intend to claim, or act, as begin the first,
front runner of said field, as I am not.
The only thing I have done is the past 20 years is to learn how
the system works, occasionally report glitches, and very rarely,
write patches. Most of what I have done are small fixes and reports.

>> >   . what problem(s), exactly, do you want to solve?
>> Easier handling of Multi byte characters.
>> Meaning, If absolute bare minimum necessary, just enough so
>>  I can calculate code point by my self and print necessary byte stream to
>> disk
>
> Not sure what that means.  You cannot possibly have this if Gawk does
> not understand the multibyte encoding of your language, because
> there's very little Gawk can do with bytes if it doesn't know how to
> break them into characters.  You will have a very crippled, perhaps
> even unusable, Gawk.  This "bare minimum" makes very little sense to
> me.
It makes sense to me, and will not affect your usage case.
It will not affect performance either.

About 'Gawk does not understand the multibyte encoding of your language'
there is a small difference in understanding between me and you.

In the past, we used LANG=C setting of gawk, that is, hands_off_my_data
flag, to make gawk work under byte based stream.
we collected byte by byte, and built 'character' from that.
we played with characters, specially crafted regular expressions
that were heavily escaped, and sent them out as stream of bytes.
It was sad, but better than nothing.

When primitive bultibyte support came, we heavily modified and patched
regexp engins (but never was complete) and used various workarounds
to make it work. We used tricks of changing environment settings, too.

Today, most is utf-8. That is good.
But the grain of control to utf-8 string is very primitive.
We can still get a single 'character', calculate code points,
use lookup tables, but on output, there is no good way.

> 'it[gawk] doesn't know how to break them into characters'
if we run gawk with hands_off_my_data flag, we can reuse scripts written
in the past. It is rather the opposite, _when_gawk_knows_
that is when we have problem. what gawk knows, sometimes, we want to output
in something we know, but gawk doesn't like it, and modify/strips
our data.


>> >   . what solution(s) do you propose for that?
>> Not solution. Basic least intrusive addition to gawk.
>> Will be used as building block for complex features we need.
>
> Please elaborate: what exactly do you want from Gawk to be able to do
> that it cannot do now?
>
>> >   . why do you think that having those solutions as loadable
>> >     extensions (which are always distributed and installed with Gawk)
>> >     is not TRT?
>> I have pondered on the gawk extension idea myself for a while.
>> In general, the idea itself is very good. And hard work has been put in.
>>
>> The problem that needs ironing out is
>> 1) How to detect if lib is missing / handle gracefully when function
>> is not available.
>
> Why would you need that, if the library is _always_ present.  It is
> part of the Gawk distribution.
Can I not write extensions that call other liblary ?
In particular, being able to call iconv, nkf is nice.
But not all systems have it. I can use iconv most of the times, but
on some occasions, the fine grain control of nkf is nice to have.

Something like
Pusudo code
if( some_way_to_check_if_libs_are_available( nkf ) ){
  @load( nkf );
}else{
  if( some_way_to_check_if_libs_are_available( iconv ) ){
    @load( iconv );
  }
  # maybe use a stripped down implementation written in awk script
  @include( 'my_gawk_script_to_manipulate_strings.awk' );
}
from within awk script would be a nice thing to have.
You get to fallback, in worst case scenario and it still works to a degree.

> And assuming that the library _is_ missing -- what would your Gawk
> script do in that case, except abort (something Gawk already does)?
I can provide a backup routine that is less capable but still perform
minimum work. For my use case anyway. YMMV.
Giving the chance to decide, I believe, is very important.

Must I write check routine inside extension ?
Being able to do it directly from the script makes sense for me.
But I guess this is a separate issue.

If you insist, would you care to start a thread on this issue?
 (Pros and cons gawk extension load control from script)
 I believe we had a similar discussion in the past about this
 (for @include operator)

>> 2) Environmental difference. MSWIN/UNIX the awk script works nicely if
>> taken care.
>>     Needs much tougher deep thinking and planning on deploy using libs.
>>     I do not have a good solution to this problem yet.
>
> Gawk already puts the libraries where it will automatically find them,
> both on Windows and on Posix systems.  So I'm not sure which problem
> are you alluding to here.

I guess we have difference in view on said matter.

>> 3) Feasibility of having something similar to libffi, so other things
>> can be loaded
>> without wrapper.
>> We have lots of conversion routines, and if we can specify the calling
>> convention
>> from within the script, it would be very nice.
>
> That's what the dynamic loading feature in Gawk is all about.  Just
> use it; no need for libffi.
Humm... I will continue to look into it.
Just a question. Have you used libffi before ?

the samples in ./extension/ suggests that I must write lots of wrappers.

So
  script <address@hidden()-> gawk <--gawkapi--> wrapper <-std c api-> iconv
and such.

if the function to something equivalent to libffi is available, it can be
like
  script <address@hidden()-> gawk <--std c api--> iconv
and implement things in script.
being able to directly load/unload gives me more power.

Interpreters that have this capability (in contrast to having to write
wrappers for every and all libs one needs) allow more flexibility.

Many have it under name of loadlib or such. Lua, matlab, to name a few.
most have ability to
loadlib(),getfuncaddress(),setfunctype(),setfuncarg(),callfunc(),
special_strcpy(),freelib(), just enough to load and call/free
while keeping the language clean.

I guess we have diffrent view in this, that is ok.
If I write a libffi wrapper in gawkapi, problem is solved.
But that kinda misses the spirit.
What do you think?

>> The minimum capability I need was read / write of binary data.
>
> You were suggested to write an extension to do that.  If you think
> this would be impossible with an extension, please tell the details of
> why you think so.  I don't see any obstacles here.
>
I could write a my_binary_sprintf("%c",int), thats only

#include <stdio.h>
#include "gawkapi.h"
static const gawk_api_t *api;
static awk_ext_id_t *ext_id;
static const char *ext_version = "my_binary_sprintf extension: version 2.0";
static awk_bool_t init_my_binary_sprintf();
static awk_bool_t (*init_func)(void) = init_my_binary_sprintf;
int plugin_is_GPL_compatible;
static awk_value_t *
do_my_binary_sprintf(int nargs, awk_value_t *result)
{
  awk_value_t myval;
  int x;
  char *text;
  make_null_string(result);
  if(nargs != 1) return result;
  if(! get_argument(0, AWK_NUMBER, &myval) ){
    return result;
  }
  x = myval.num_value;
  sprintf( text , "%c", x & 0xff );
  make_malloced_string(text, 1 , result);
  return result;
}
static awk_input_parser_t my_binary_sprintf_parser = {
        "my_binary_sprintf",
        NULL
};
static awk_bool_t
init_my_binary_sprintf()
{
        register_input_parser(& my_binary_sprintf_parser);
        return awk_true;
}
static awk_ext_func_t func_table[] = {
        { "my_binary_sprintf", do_my_binary_sprintf, 1 },
};
dl_load_func(func_table, my_binary_sprintf, "")

...(code not checked , but it should be close)...

The current printf() and friends handle utf-8 correctly when "%c" is
used, and that is good.
But when I must print a byte, that is in the range of 0x80 - 0xff
the known way is buggy and unrelyable.
it is
gawk 'BEGIN{printf("%c",0xffffff00+0x80);}' |xxd
We want something that is future proof, so this kind of hack is unnecessary.
And it also must co-exist alongside correct utf-8 handling.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]