[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#20511: split : does not account for --numeric-suffixes=FROM in calcu
From: |
Pádraig Brady |
Subject: |
bug#20511: split : does not account for --numeric-suffixes=FROM in calculation of suffix length? |
Date: |
Wed, 13 May 2015 02:20:27 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 |
On 06/05/15 11:53, Pádraig Brady wrote:
> On 06/05/15 05:29, Ben Rusholme wrote:
>> As you say, this can always be fixed by the "--suffix-length" argument, but
>> it’s only required for certain combinations of FROM and CHUNK, (and “split”
>> already has all the information it needs).
>>
>>> Now you could bump the suffix length based on the start number,
>>> though I don't think we should as that would impact on future
>>> processing (ordering) of the resultant files. I.E. specifying
>>> a FROM value to --numeric-suffixes should only impact the
>>> start value, rather than the width.
>>
>> Could you clarify this for me? Doesn’t the zero-padding ensure correct
>> processing order?
>
> There are two use cases supported by specifying FROM.
> 1. Setting the start for a single run (FROM is usually 1 in this case)
> 2. Setting the offset for multiple independent split runs.
> In the second case we can't infer the size of the total set
> in any particular run, and thus require that --suffix-length is specified
> appropriately.
> I.E. for multiple independent runs, the suffix length needs to be
> fixed width across the entire set for the total ordering to be correct.
>
>
> Things we could change are...
>
> 1. Special case FROM=1 to assume a single run and thus
> enable auto suffix expansion or appropriately sized suffix with CHUNK.
> This would be a backwards incompat change and also not
> guaranteed a single run, so I'm reluctant to do that.
>
> 2. Give an early error with specified FROM and CHUNK
> that would overflow the suffix size for CHUNK.
> This would save some processing, though doesn't add
> any protections against latent issues. I.E. you still get
> the error which is dependent on the parameters rather than the input data
> size.
> Therefore it's probably not worth the complication.
>
> 3. Leave suffix length at 2 when both FROM and CHUNK are specified.
> In retrospect, this would probably have been the best option
> to avoid ambiguities like this. However now we'd be breaking
> compat with scripts with FROM=1 and CHUNK=200 etc.
> While CHUNK values > 100 would be unusual
>
> 4. Auto set the suffix len based on FROM + CHUNK.
> That would support use case 1 (single run),
> but _silently_ break subsequent processing order
> of outputs from multiple split runs
> (as FROM is increased in multiples of CHUNK size).
> We could mitigate the _silent_ breakage though
> by limiting this change to when FROM < CHUNK.
>
> 5. Document in man page and with more detail in info docs
> that -a is recommended when specifying FROM
>
> So I'll do 4 and 5 I think.
Attached.
cheers,
Pádraig
split-from-width.patch
Description: Text Data