[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [BUG] fractional bin sizes do not work in some locales (e.g., de_DE.

From: Erik Auerswald
Subject: Re: [BUG] fractional bin sizes do not work in some locales (e.g., de_DE.UTF-8)
Date: Sat, 25 Jun 2022 00:36:05 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1

Hi Tim,

On 24.06.22 23:36, Tim Rice wrote:
Hey Erik,

while looking at the binning issues reported by Andreas Schamanek[0] I
noticed that providing floating point numbers as bin sizes does not work
when using a locale where comma (',') is used as decimal separator:

   $ echo $LC_NUMERIC


   $ echo 1,15 | datamash bin:0,1 1
   datamash: missing field for operation ‘bin’

I was having a play around with this, and (plot twist!), things work as expected when using LC_ALL instead of LC_NUMERIC:

$ datamash sum 1 <<< 1,1
datamash: invalid numeric value in line 1 field 1: '1,1'

$ LC_ALL=de_DE.utf8 datamash sum 1 <<< 1,1

I cannot reproduce that, both LC_NUMERIC and LC_ALL work for me.

Reading numbers in de_DE.UTF-8 format works:

$ printf '%s\n' 1,1 2,2 | ./datamash sum 1

They can be binned into buckets, too:

$ printf '%s\n' 1,1 2,2 | ./datamash --full bin:1 1
1,1     1
2,2     2

But the bucket size cannot be a floating point number:

$ printf '%s\n' 1,1 2,2 | ./datamash --full bin:1,1 1
./datamash: missing field for operation ‘bin’

$ printf '%s\n' 1,1 2,2 | ./datamash --full bin:1.1 1
./datamash: invalid operand ‘.1 1’

$ printf '%s\n' 1,1 2,2 | ./datamash --full bin:1\\,1 1
./datamash: invalid operation ‘1’

But with a locale using '.' as decimal separator, the bucket size
can be floating point:

$ printf '%s\n' 1.1 2.2 \
> | env LC_NUMERIC=en_US.UTF-8 ./datamash --full bin:1.1 1
1.1     1.1
2.2     2.2

I agree it should also work with LC_NUMERIC. So far, it is mysterious to me why it doesn't. I tried explicitly using `setlocale(LC_NUMERIC,"")` in the main function (where LC_ALL is set), but nothing seems to "stick".

Because the problem is not reading locale specific input, it is
parsing an operation specification comprising a floating point
number using ',' as decimal separator.  The comma has a special
meaning in operation parsing.

Do you have any insight about what the problem might be?

Not yet.  I supposed the operation parser does not take the locale
setting into account.

I tried checking what other GNU projects do. I thought GNU Awk or GNU bc might point me in the right direction. In fact, it seems like they don't even respect LC_ALL:

Yes, they just use '.' as decimal separator and do not honor the
locale setting.  I think that is fine.

$ awk '{printf "%f %f\n", $1, $2}' <<< "1,1 1.1"
1.000000 1.100000

$ LC_ALL=de_DE.utf8 awk '{printf "%f %f\n", $1, $2}' <<< "1,1 1.1"
1.000000 1.100000

$ LC_ALL=de_DE.utf8 bc <<< '1,1+1,1'
(standard_in) 1: syntax error
(standard_in) 1: syntax error

$ LC_ALL=de_DE.utf8 bc <<< '1.1+1.1'

So if we can figure this out for GNU Datamash, we may need to raise some bugs and submit some patches to other GNU projects too :)

I do not think so.  I actually prefer the behavior of GNU Awk or bc.

But GNU Datamash uses the locale setting since a long time, so IMHO
we should look into making it work better.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]