bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Problem with substr() after match() with non-ASCII charac


From: Eli Zaretskii
Subject: Re: [bug-gawk] Problem with substr() after match() with non-ASCII characters
Date: Wed, 16 Sep 2015 10:10:00 +0300

> From: Janis Papanagnou <address@hidden>
> Date: Tue, 15 Sep 2015 23:35:58 +0200
> 
> > The problem is that you're feeding gawk invalid multibyte data for
> > the locale you're in. When gawk tries to figure out where, in terms of
> > characters, the match starts, it gets confused because of this invalid
> > data.
> 
> Obviously.
> 
> My view is that (a) I expect *consistency* in the functions, and (b) I should
> be able to process any data (from unknown locales). I can achieve (b) by
> the two means I posted, so *functionally* I'm fine now. I think that (a)
> should be addressed (i.e. a consistent implementation that does not
> "confuse" awk, and let awk's set of functions work with the same "metric").

You cannot have locale-independent processing as long as Gawk relies
on locale-dependent functions such as mbrtowc, mbrlen, and strcoll.
If we want to be locale-independent, we need to have
locale-indifferent versions of those functions (and others like
them).  And even then, some users will _want_ locale dependency,
e.g. when sorting text or displaying date/time values.

So you are asking for something that is (a) a lot of work, and (b) is
practically an unreachable goal, if you insist on 100% locale
independence.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]