bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-datamash] Unable to use comma as field separator when the local


From: Erik Auerswald
Subject: Re: [Bug-datamash] Unable to use comma as field separator when the locale uses it as a decimal separator too
Date: Sun, 17 Jul 2022 15:49:36 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0

Hi,

On 17.07.22 13:42, Erik Auerswald wrote:
On 14.07.22 14:41, Erik Auerswald wrote:
On Wed, Sep 12, 2018 at 05:42:09PM +0200, Jérémie Roquet wrote:
[...]
With datamash using the locale to determine which decimal separator to
use, the behavior becomes inconsistent when the field separator and
the decimal separator are the same.
[...]
I can reproduce this with a German locale and current datamash, too:
[...]
Not all operations are affected:

This affects numeric operations, i.e., whenever the field contents need
to be interpreted as a (floating point) number.

The "getnum" operation is affected, too:

    $ echo bar-1.2soom | env LC_ALL=C ./datamash -t. getnum 1,2
    1.2.2

This is most likely in a different code path.

[...]
This issue affects using a period ('.') as field separator in locales
where a period is used as decimal point, too:
[...]
I think that this is a bug.  [...]

This problem results from the way GNU Datamash parses its input.
The input data is kept as-is, and the individual fields are designated
via pointers to the beginning and the field length.  Thus the input
field separator is part of the string containing the field data unless
processing the last field of a line.

Just replacing the field separator with a NUL breaks many tests.
The code obviously assumes generally unchanged input lines.

The start pointer of the field is given to the floating point parsing
function strtold().  If the field separator is the same as the decimal
separator, strtold() does not stop at the field separator, but continues
parsing a number.

After strtold() has possibly returned a number, it is checked if
parsing continued after the end of the field, which is interpreted as
an "invalid number".

This can result in additional surprising results with unusual field
separators:

    $ echo 1e2 | ./datamash -te sum 1
    ./datamash: invalid numeric value in line 1 field 1: '1'
    $ echo 1E2 | ./datamash -tE sum 1
    ./datamash: invalid numeric value in line 1 field 1: '1'
    $ echo 0x1p2 | ./datamash -tp sum 1
    ./datamash: invalid numeric value in line 1 field 1: '0x1'
    $ echo 0x1P2 | ./datamash -tP sum 1
    ./datamash: invalid numeric value in line 1 field 1: '0x1'

This is done in the field_op_collect() function in the src/field-ops.c
file.

This function contains code to copy the field contents into a temporary
string with NUL termination that is only used if the system's strtold()
does not stop at the default field separator TAB.  Always using this
code should fix this problem.

It would also introduce additional data copying that is unnecessary
in most use-cases.

As an optimization, the data copy could be used only if it is needed,
i.e., when either the field separator is the same as the (locale
specific) decimal separator, or when the field separator is a TAB
and strtold() is known not to stop at a TAB (this is checked during
./configure).

The temporary copy to give just the field contents to strtold() is
needed whenever the field separator is a character that can possibly
continue a floating point value.  This includes at least hexadecimal
digits, the locale specific decimal separator, and the exponent
indicator (e, E, p, P).  The strtold() function supports additional
words (inf, inifinity, nan) and even allows an implementation defined
suffix for nan to provide additional information about the type of
NaN.  This might possibly result in additional edge cases.

I'd say that always using a temporary copy to hold the string
representing a (floating point) number is the safest way forward.

Tim, Shawn, what do you think?

[...]

Br,
Erik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]