coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: multibyte processing - handling invalid sequences (long)


From: Assaf Gordon
Subject: Re: multibyte processing - handling invalid sequences (long)
Date: Sat, 23 Jul 2016 14:05:25 -0400

> On Jul 23, 2016, at 06:51, Pádraig Brady <address@hidden> wrote:
> I was wondering about the tool being line/record oriented.
> 
> Disadvantages are:
>  requires arbitrary large buffers for arbitrary long lines
>  relatively slow in the presence of short/normal lines
>  sensitive to the current stdio buffering mode
>  requires -z option to support NUL termination
> 
> Processing instead a block at a time avoid such issues.
> UTF-8 at least is self synchronising, so after reading a block
> you just have to look at the last 3 bytes to know
> how many to append to the start of the next block.

block-at-a-time would work well for detecting/fixing invalid multibyte 
sequences, especially in UTF-8.
But I'm not sure about other multibyte encodings (I'll have to investigate).

However, for unicode normalization, I am not sure there's a stream interface to 
it (gnu lib's uniform takes a whole string to normalize). IIUC, normalization 
requires being able to examine some unicode characters ahead.

-assaf




reply via email to

[Prev in Thread] Current Thread [Next in Thread]