Re: [bug-gawk] Problem with substr() after match() with non-ASCII charac

Hi Arnold!

Sorry for the late reply; I couldn't access my hotmail account from
abroad for hotmail's security measures, but now I'm fully online again.

> The problem is that you're feeding gawk invalid multibyte data for
> the locale you're in. When gawk tries to figure out where, in terms of
> characters, the match starts, it gets confused because of this invalid
> data.

Obviously.

My view is that (a) I expect *consistency* in the functions, and (b) I should
be able to process any data (from unknown locales). I can achieve (b) by
the two means I posted, so *functionally* I'm fine now. I think that (a)
should be addressed (i.e. a consistent implementation that does not
"confuse" awk, and let awk's set of functions work with the same "metric").

YMMV, and all I can do is posting (on your demand) the issue.

Thanks.

Janis

---

> From: address@hidden
> Date: Mon, 24 Aug 2015 18:30:48 +0300
> To: address@hidden; address@hidden
> Subject: Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
>
> Hi.
>
> > From: Janis Papanagnou <address@hidden>
> > To: "address@hidden" <address@hidden>
> > Date: Sat, 22 Aug 2015 22:33:52 +0200
> > Subject: [bug-gawk] Problem with substr() after match() with non-ASCII
> > characters
> >
> > The issue was observed using GNU awk 4.1.2 and confirmed to show the
> > same behaviour in GNU awk 4.1.3.
> >
> > With the attached program 'testprog' applied on the attached data 'testdata'
> > I do *not* get the expected result of four lines containing "2007" each, but
> > instead I get:
> >
> > 2007
> > 0703
> > 2007
> > 0071
> >
> > The problem is caused/triggered by non-ASCII characters in 'testdata'.
> >
> > Note: I can run 'testprog' it with LC_ALL=C and the output is as expected.
>
> The problem is that you're feeding gawk invalid multibyte data for
> the locale you're in. When gawk tries to figure out where, in terms of
> characters, the match starts, it gets confused because of this invalid
> data.
>
> $ LC_ALL=en_US.UTF-8 gawk --lint -f testprog testdata
> 2007
> gawk: testprog:2: (FILENAME=testdata FNR=2) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale.
> 0703
> 2007
> 0071
>
> > My understanding is, though, that the implicit results from the match()
> > function, RSTART and RLENGTH, should be consistently usable in substr(),
> > independent of the locale setting.
>
> *When the data is valid*, this is correct and things work as expected.
> In your case, it's Garbage In, Garbage Out. :-(
>
> If there's a way to set the locale to latin-whatever for where you
> are, then things will probably work ok. Otherwise, you should use
> LC_ALL=C or the -b option.
>
> There really is no way around this; the underlying C library routines
> depend on the value of the locale variables in order to interpret
> the input data.
>
> HTH,
>
> Arnold

From:	Janis Papanagnou
Subject:	Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
Date:	Tue, 15 Sep 2015 23:35:58 +0200