[coreutils] [PATCH] join: support multi-byte character encodings

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[coreutils] [PATCH] join: support multi-byte character encodings

From:	Pádraig Brady
Subject:	[coreutils] [PATCH] join: support multi-byte character encodings
Date:	Mon, 13 Sep 2010 16:37:35 +0100
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3

This is my start at applying robust and efficient multi-byte
processing to coreutils. I'm not about to merge this, but
would _really_ appreciate a review. A good place to start for the
new functionality, is the included test which tries to cover all
the new code paths.  BTW I picked `join` as the first util to
update as it does field matching, collation, delimiter searching,
and subset field output, and so is a good test of the support
routines in gnulib and libunistring.

I took the approach of changing the existing logic as little as possible,
so that when the changes mature we can look at refactoring further,
like merging line and field processing for various utils for example.

One consequence of linking with libunistring (which is now in debian and fedora)
is that printf no longer uses the static u8_uctomb_aux() (as lib/unistr.h
is not created as /usr/include/unistr.h is present). While this reduces the
size of printf a little, just dynamically linking to the extra libunistring lib
incurs a 16% startup overhead which is bad for such a frequently called, short 
lived
util like printf. I tested the difference with:
  time seq 1000 | xargs -n1 ./printf
Ideas for working around that overhead are appreciated.
It's worth noting also that many of the multi-byte routines remain in gnulib,
and thus increase the size of the binary. This is not seen as an issue,
but the size increase of a couple of routines is noted in the patch.

On a related note, I noticed that bootstrap doesn't update ./configure 
correctly,
when switching between libunistring enabled and not. When switching back to a
non libunistring branch, I have to do the following:
 rm m4/libunistring*; ./bootstrap; ...

On a more general note, I was wondering about unconditionally converting
input to UTF-8 before processing. I didn't do this yet as:
  Currently we support invalid input chars by processing as in the C locale.
  I was unsure how to generally correlate portions of a UTF-8 string with its
  corresponding source string, and wanted to transform the input as little as 
possible.
I may revisit the idea of processing using UTF8 internally if we support
normalization internally to the utils. I got a bit hung up on the details
and edge cases with this, so I've not included it at this time.
More on this later...

I do try to special case UTF-8 where beneficial, as
it can be quite a bit more efficient to process, and also it's very common:
http://www.w3.org/QA/2008/05/utf8-web-growth.html

cheers,
Pádraig.

join-i18n.diff
Description: Text Data

[Prev in Thread]

Current Thread

[Next in Thread]

[coreutils] [PATCH] join: support multi-byte character encodings, Pádraig Brady <=
- Re: [coreutils] [PATCH] join: support multi-byte character encodings, Pádraig Brady, 2010/09/14
  - [coreutils] Re: [PATCH] join: support multi-byte character encodings, Bruno Haible, 2010/09/20
    - [coreutils] Re: [PATCH] join: support multi-byte character encodings, Pádraig Brady, 2010/09/20
- Re: [coreutils] [PATCH] join: support multi-byte character encodings, Jim Meyering, 2010/09/15
- [coreutils] Re: [PATCH] join: support multi-byte character encodings, Bruno Haible, 2010/09/19
  - [coreutils] Re: [PATCH] join: support multi-byte character encodings, Pádraig Brady, 2010/09/20
  - Re: [coreutils] Re: [PATCH] join: support multi-byte character encodings, Eric Blake, 2010/09/20

Prev by Date: [coreutils] Support freebsd style bs=1m for 'dd'
Next by Date: [coreutils] [PATCH] dircolors: add rxvt-unicode-256color terminal type
Previous by thread: [coreutils] Support freebsd style bs=1m for 'dd'
Next by thread: Re: [coreutils] [PATCH] join: support multi-byte character encodings
Index(es):
- Date
- Thread