|
From: | Janis Papanagnou |
Subject: | Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters |
Date: | Wed, 16 Sep 2015 13:40:40 +0200 |
> Date: Wed, 16 Sep 2015 10:10:00 +0300 > From: address@hidden > Subject: Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters > To: address@hidden > CC: address@hidden; address@hidden > > > From: Janis Papanagnou <address@hidden> > > Date: Tue, 15 Sep 2015 23:35:58 +0200 > > > > > The problem is that you're feeding gawk invalid multibyte data for > > > the locale you're in. When gawk tries to figure out where, in terms of > > > characters, the match starts, it gets confused because of this invalid > > > data. > > > > Obviously. > > > > My view is that (a) I expect *consistency* in the functions, and (b) I should > > be able to process any data (from unknown locales). I can achieve (b) by > > the two means I posted, so *functionally* I'm fine now. I think that (a) > > should be addressed (i.e. a consistent implementation that does not > > "confuse" awk, and let awk's set of functions work with the same "metric"). > > You cannot have locale-independent processing as long as Gawk relies > on locale-dependent functions such as mbrtowc, mbrlen, and strcoll. > If we want to be locale-independent, we need to have > locale-indifferent versions of those functions (and others like > them). And even then, some users will _want_ locale dependency, > e.g. when sorting text or displaying date/time values. > > So you are asking for something that is (a) a lot of work, and (b) is > practically an unreachable goal, if you insist on 100% locale > independence. No. All I was asking for was to remove an inconsistency (or let gawk give some hint that it has problems operating on the data). The statement "is a lot of work" needs no reply in context of a bug report; it's your decision, anyway, what you do and what you don't do (or ignore). Thanks. Janis |
[Prev in Thread] | Current Thread | [Next in Thread] |