|
From: | Janis Papanagnou |
Subject: | Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters |
Date: | Tue, 15 Sep 2015 23:35:58 +0200 |
Hi Arnold!
Sorry for the late reply; I couldn't access my hotmail account from abroad for hotmail's security measures, but now I'm fully online again. > The problem is that you're feeding gawk invalid multibyte data for > the locale you're in. When gawk tries to figure out where, in terms of > characters, the match starts, it gets confused because of this invalid > data. Obviously. My view is that (a) I expect *consistency* in the functions, and (b) I should be able to process any data (from unknown locales). I can achieve (b) by the two means I posted, so *functionally* I'm fine now. I think that (a) should be addressed (i.e. a consistent implementation that does not "confuse" awk, and let awk's set of functions work with the same "metric"). YMMV, and all I can do is posting (on your demand) the issue. Thanks. Janis --- > From: address@hidden > Date: Mon, 24 Aug 2015 18:30:48 +0300 > To: address@hidden; address@hidden > Subject: Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters > > Hi. > > > From: Janis Papanagnou <address@hidden> > > To: "address@hidden" <address@hidden> > > Date: Sat, 22 Aug 2015 22:33:52 +0200 > > Subject: [bug-gawk] Problem with substr() after match() with non-ASCII > > characters > > > > The issue was observed using GNU awk 4.1.2 and confirmed to show the > > same behaviour in GNU awk 4.1.3. > > > > With the attached program 'testprog' applied on the attached data 'testdata' > > I do *not* get the expected result of four lines containing "2007" each, but > > instead I get: > > > > 2007 > > 0703 > > 2007 > > 0071 > > > > The problem is caused/triggered by non-ASCII characters in 'testdata'. > > > > Note: I can run 'testprog' it with LC_ALL=C and the output is as expected. > > The problem is that you're feeding gawk invalid multibyte data for > the locale you're in. When gawk tries to figure out where, in terms of > characters, the match starts, it gets confused because of this invalid > data. > > $ LC_ALL=en_US.UTF-8 gawk --lint -f testprog testdata > 2007 > gawk: testprog:2: (FILENAME=testdata FNR=2) warning: Invalid multibyte data detected. There may be a mismatch between your data and your locale. > 0703 > 2007 > 0071 > > > My understanding is, though, that the implicit results from the match() > > function, RSTART and RLENGTH, should be consistently usable in substr(), > > independent of the locale setting. > > *When the data is valid*, this is correct and things work as expected. > In your case, it's Garbage In, Garbage Out. :-( > > If there's a way to set the locale to latin-whatever for where you > are, then things will probably work ok. Otherwise, you should use > LC_ALL=C or the -b option. > > There really is no way around this; the underlying C library routines > depend on the value of the locale variables in order to interpret > the input data. > > HTH, > > Arnold |
[Prev in Thread] | Current Thread | [Next in Thread] |