Re: multibyte processing - handling invalid sequences (long)

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: multibyte processing - handling invalid sequences (long)

From:	Assaf Gordon
Subject:	Re: multibyte processing - handling invalid sequences (long)
Date:	Sat, 23 Jul 2016 14:05:25 -0400

> On Jul 23, 2016, at 06:51, Pádraig Brady <address@hidden> wrote:
> I was wondering about the tool being line/record oriented.
> 
> Disadvantages are:
>  requires arbitrary large buffers for arbitrary long lines
>  relatively slow in the presence of short/normal lines
>  sensitive to the current stdio buffering mode
>  requires -z option to support NUL termination
> 
> Processing instead a block at a time avoid such issues.
> UTF-8 at least is self synchronising, so after reading a block
> you just have to look at the last 3 bytes to know
> how many to append to the start of the next block.

block-at-a-time would work well for detecting/fixing invalid multibyte 
sequences, especially in UTF-8.
But I'm not sure about other multibyte encodings (I'll have to investigate).

However, for unicode normalization, I am not sure there's a stream interface to 
it (gnu lib's uniform takes a whole string to normalize). IIUC, normalization 
requires being able to examine some unicode characters ahead.

-assaf

[Prev in Thread]

Current Thread

[Next in Thread]

multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/20
- Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/20
  - Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/20
    - Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
    - Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/21
    - Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/21
    - Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/22
    - Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
    - Re: multibyte processing - handling invalid sequences (long), Assaf Gordon <=
    - Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/23
    - Re: multibyte processing - handling invalid sequences (long), Assaf Gordon, 2016/07/26
    - Re: multibyte processing - handling invalid sequences (long), Pádraig Brady, 2016/07/27
    - Re: multibyte processing - handling invalid sequences (long), Eric Blake, 2016/07/28

Prev by Date: Re: multibyte processing - handling invalid sequences (long)
Next by Date: Re: multibyte processing - handling invalid sequences (long)
Previous by thread: Re: multibyte processing - handling invalid sequences (long)
Next by thread: Re: multibyte processing - handling invalid sequences (long)
Index(es):
- Date
- Thread