[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#13127: [PATCH] cut: use only one data strucutre
From: |
Jim Meyering |
Subject: |
bug#13127: [PATCH] cut: use only one data strucutre |
Date: |
Sun, 09 Dec 2012 21:45:03 +0100 |
Cojocaru Alexandru wrote:
> Subject: [PATCH] cut: use only one data structure
>
> The current implementation of cut, uses a bit array,
> an array of `struct range_pair's, and (when --output-delimiter
> is specified) a hash_table. The new implementation will use
> only an array of `struct range_pair's.
> The old implementation is inefficient for the following reasons:
> 1. When -b with a big num is specified, it allocates a lot of useless
> memory for `printable_field'.
> 2. When --output-delimiter is specified, it will allocate 31 buckets.
> Even if a few ranges are specified.
...
> -/* Given the list of field or byte range specifications FIELDSTR, set
> - MAX_RANGE_ENDPOINT and allocate and initialize the PRINTABLE_FIELD
> - array. If there is a right-open-ended range, set EOL_RANGE_START
> - to its starting index. FIELDSTR should be composed of one or more
> - numbers or ranges of numbers, separated by blanks or commas.
> - Incomplete ranges may be given: '-m' means '1-m'; 'n-' means 'n'
> - through end of line. Return true if FIELDSTR contains at least
> - one field specification, false otherwise. */
> -
> -/* FIXME-someday: What if the user wants to cut out the 1,000,000-th
> - field of some huge input file? This function shouldn't have to
> - allocate a table of a million bits just so we can test every
> - field < 10^6 with an array dereference. Instead, consider using
> - an adaptive approach: if the range of selected fields is too large,
> - but only a few fields/byte-offsets are actually selected, use a
> - hash table. If the range of selected fields is too large, and
> - too many are selected, then resort to using the range-pairs (the
> - 'rp' array) directly. */
Thanks for the patch.
This is large enough that you'll have to file a copyright assignment.
For details, see the "Copyright assignment" section in the file
named HACKING.
Have you considered performance in the common case?
I suspect that a byte or field number larger than 1000 is
not common. That is why, in the FIXME comment above,
I suggested to use an adaptive approach. I had the feeling
(don't remember if I profiled it) that testing a bit per
input field would be more efficient than an in-range test.
If you construct test cases and gather timing data, please do so
in a reproducible manner and include details when you report back,
so we can compare on different types of systems. Cache matters a
lot, these days.