Re: [bug-gawk] Problem with substr() after match() with non-ASCII charac

> Date: Wed, 16 Sep 2015 10:10:00 +0300
> From: address@hidden
> Subject: Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
> To: address@hidden
> CC: address@hidden; address@hidden
>
> > From: Janis Papanagnou <address@hidden>
> > Date: Tue, 15 Sep 2015 23:35:58 +0200
> >
> > > The problem is that you're feeding gawk invalid multibyte data for
> > > the locale you're in. When gawk tries to figure out where, in terms of
> > > characters, the match starts, it gets confused because of this invalid
> > > data.
> >
> > Obviously.
> >
> > My view is that (a) I expect *consistency* in the functions, and (b) I should
> > be able to process any data (from unknown locales). I can achieve (b) by
> > the two means I posted, so *functionally* I'm fine now. I think that (a)
> > should be addressed (i.e. a consistent implementation that does not
> > "confuse" awk, and let awk's set of functions work with the same "metric").
>
> You cannot have locale-independent processing as long as Gawk relies
> on locale-dependent functions such as mbrtowc, mbrlen, and strcoll.
> If we want to be locale-independent, we need to have
> locale-indifferent versions of those functions (and others like
> them). And even then, some users will _want_ locale dependency,
> e.g. when sorting text or displaying date/time values.
>
> So you are asking for something that is (a) a lot of work, and (b) is
> practically an unreachable goal, if you insist on 100% locale
> independence.

No. All I was asking for was to remove an inconsistency (or let gawk give
some hint that it has problems operating on the data).

The statement "is a lot of work" needs no reply in context of a bug report;
it's your decision, anyway, what you do and what you don't do (or ignore).

Thanks.

Janis

From:	Janis Papanagnou
Subject:	Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
Date:	Wed, 16 Sep 2015 13:40:40 +0200