[bug-libunistring] A bug in u-strtok.h and the fix in libunistring-0.9.5

Hi,

First of all, thank you for the great Unicode library in C. Recently I've been using the library intensively.

During my experiments with libunistring-0.9.5, I've found an error in u-strtok.h as below. The lines starting with "////" are my changes.

/* Move past the token. */
{
    UNIT *token_end = U_STRPBRK (str, delim);
    if (token_end)
      {
        /* NUL-terminate the token. */
        *token_end = 0;
        *ptr = token_end + 1;
        //// These lines should be something like below.
        //// *ptr = token_end + (sizeof(uint8_t) * u8_strmblen(token_end));
       //// *token_end = 0;
      }
    else
      *ptr = NULL;
}

So the original code tries to start the next search without checking how many bytes are actually taken by a matched delimiter but assuming 1 by "token_end + 1". When the delimiter takes more than one UNIT such as a delimiter in Japanese, this assumption fails and starts the next search from an invalid location which is in the middle of a Unicode character.

To solve the issue, one can define U_STRMBLEN with u8_strmblen,u16_strmblen and u32_strmblen accordingly and call it like *ptr = token_end + (sizeof(UNIT) * U_STRMBLEN(token_end)) instead of *ptr = token_end + 1.

I've checked the source code and git log but could not find the relevant changes. However, if I've missed the change or misunderstood the logic and if it works as expected as it is, please discard this email.

Thank you once again for your great work.
Seiya

Seiya Kawashima

Intermediate Application Programmer | Department of Radiology

The University of Chicago Biological Science
5841 S. Maryland Ave. | Rm. IB-012 | Chicago, IL 60637
Office: 773-834-1791

From:	Seiya Kawashima
Subject:	[bug-libunistring] A bug in u-strtok.h and the fix in libunistring-0.9.5
Date:	Thu, 2 Jul 2015 19:29:34 +0000