octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: New strsplit function


From: Ben Abbott
Subject: Re: New strsplit function
Date: Thu, 16 May 2013 14:52:44 +0800

On May 16, 2013, at 2:39 PM, John W. Eaton wrote:

> On 05/16/2013 02:19 AM, Ben Abbott wrote:
> 
>> hmmm ... I took a look at Matlab 2013a.  It's not clear to me that we'd want 
>> to copy this.
>> 
> 
> Well, Matlab users apparently want compatibility here.  That's why I
> received the report.
> 
>> matlab>  strsplit('', 'a')
>> 
>> ans =
>> 
>>     {''}
>> 
>> matlab>  strsplit('a', 'a')
>> 
>> ans =
>> 
>>     ''    ''
>> 
>> matlab>  strsplit('aa', 'a')
>> 
>> ans =
>> 
>>     ''    ''
>> 
>> matlab>  strsplit('aaa', 'a')
>> 
>> ans =
>> 
>>     ''    ''
>> 
>> matlab>  strsplit('aaaa', 'a')
>> 
>> ans =
>> 
>>     ''    ''
>> matlab>  strsplit ('abc', {'a','b','c'})
>> 
>> ans =
>> 
>>     ''    ''
>> In case it isn't clear, the output is a cellstring containing two empty 
>> strings.
> 
> Oh, so collapsdelimiters means that if multiple consecutive delimiters
> appear in the string that is being split, they should be treated as
> one?

That is my understanding.  A moment ago, it occured to me I should check to see 
what regexp () works.

octave> regexp ('aaaaa', '(a)+', 'split')
ans = 
{
  [1,1] = 
  [1,2] = 
}
octave> strsplit ('aaaaa', 'a', 'delimitertype', 'regularexpression')
ans = 
{
  [1,1] = 
  [1,2] = 
}

So, it looks unlikely that there is a Matlab bug, but instead it is a 
misunderstanding on my part.

> Then I think my guess about what was happening was wrong, and the
> behavior above is correct.  If the string is 'aa' and the delimiter is
> 'a', then it is the same as strsplit ('a', 'a') and the result should
> be two empty strings (one for before and one for after the
> delimiter).  That's the result we used to get for the simpler case of
> strsplit ('a', 'a').  Now we get an empty cell array, which looks
> wrong to me.

ahhh ... ok, that makes sense to me!

> So in this code
> 
>    ## Get substring lengths.
>    if (isempty (idx))
>      strlens = length (str);
>    else
>      strlens = [idx(1)-1, diff(idx)-1, numel(str)-idx(end)];
>    endif
>    if (nargout > 1)
>      ## Grab the separators
>      matches = num2cell (str(idx)(:)).';
>      if (args.collapsedelimiters)
>        ## Collapse the consequtive delimiters
>        ## TODO - is there a vectorized way?
>        for m = numel(matches):-1:2
>          if (strlens(m) == 0)
>            matches{m-1} = [matches{m-1:m}];
>            matches(m) = [];
>          endif
>        end
>      endif
>    endif
>    ## Remove separators.
>    str(idx) = [];
>    if (args.collapsedelimiters)
>      ## Omit zero lengths.
>      strlens = strlens(strlens != 0);
>    endif
> 
>    ## Convert!
>    result = mat2cell (str, 1, strlens);
> 
> it seems like we should be performing the "omit zero lengths" part on
> the output of diff, then tacking on the beginning and ending strings.
> But I don't understand what the "if (nargout > 1)" part in between is
> doing.

The (nargout > 1) part was there to allow the block t be skipped if "matches" 
isn't requested (the 2nd output).  I'll take a look at your suggested change.

Ben



reply via email to

[Prev in Thread] Current Thread [Next in Thread]