bug-libunistring
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-libunistring] observations on the manual


From: Bruno Haible
Subject: Re: [bug-libunistring] observations on the manual
Date: Wed, 29 Apr 2009 02:54:41 +0200
User-agent: KMail/1.9.9

Hello Paolo,

Thank you for the feedback!

> wwww

Fixed, thanks.

> > or --- if @code{wchar_t *}
> 
> I'm not sure if space before/after --- are good.

I find these spaces more aesthetic than no spaces. Maybe I'm influenced by
typesetting in French and Russian books.

> > These functions are locale dependent.  The @var{iso639_language} argument
> > identifies the language (e.g. @code{"tr"} for Turkish).  NULL means to use
> > locale independent case mappings.
> 
> Is it possible to pass just a POSIX locale name like tr_TR.UTF-8
> directly as @var{iso639_language}, with everything but the language part
> discarded?

No, such preprocessing ("tr_TR.UTF-8" -> "tr") needs to be done before
calling the u*_tolower etc. functions. It's because this preprocessing
most often needs to be done only once, for many u*_tolower calls.

> Also, maybe you could add a special constant for the current locale
> name, like ((void *)1), or even make that the default

The function uc_locale_language() exists precisely for this purpose.
You are supposed to call it once. It speeds up the u*_tolower etc. functions
to not have to look up the current locale over and over again.

> and specify "" for locale-independent case mappings?

For the locale independent mappings, you can use NULL, or "", or any
other invalid territory name.

> It seems to me that there is a limitation, in that you cannot turn to
> lowercase/uppercase/titlecase parts of a string; for that you have to
> use uc_toupper/lower/title and forget about the locale-specific mappings.

Good point. A function for lowercasing part of a string would be useful.
I'll add it.

> However, in many cases the context is available.  For example, if I
> modified sed to use u8_tolower, this:
> 
>   s/[Α-Ωα-ω]/\L&/g
> 
> should have the same effect as doing the conversion on the entire string
> (maybe more slowly).

Well, I cannot really speak about 'sed'; but that sed command appears to
request character-by-character processing. I don't know of a sed command
that would allow applying an operation to an entire substring of the current
line _without_ doing it character by character.

> I have not thought about the API so far, but it 
> seems to me that only the following character is needed, which makes it
> noticeably easier.

No, these functions have an arbitrary long lookahead and an arbitrary
long "look backwards". They don't need to look across lines, though.

> > @code{memcmp2}
> 
> This function is provided by gnulib and should be defined somewhere in
> the documentation.  It is also mentioned in unistr.texi.

Oops. I have to write "the gnulib function memcmp2".

> > Converts the string @var{s} of length @var{n} to a string in locale 
> > encoding,
> 
> The output of xfrm functions is not guaranteed to be in locale encoding.
>  In fact, it is just a sequence of bytes that represent the
> locale-specific collation rules.

Oops, right you are. I'm correcting this to say "a NUL-terminated byte
sequence". Thanks.

> I noticed that there are no functions accepting NULL-terminated strings.
>  Is this by design, or in the future they could be introduced (either as
> u8_strtoupper, or for example with something like a -1 value for the
> length)?

It is more or less desired, so as not to expose too many functions that
are not very different but blow up the documentation. The user who
passes a string S can easily write 'strlen (S) + 1'. It's so much of an
idiom that it's not hard for the user to remember.

Regarding a length of -1, that's a convention that is used in BASIC or
in ICU. I find it a horrible convention:
  1) Simple errors in arithmetic will cause a program to do something
     totally different than what the programmer intended, and *not*
     report the mistake through an exception or a core dump.
  2) Additionally, all calls to the function will pay a price (in form
     of a conditional jump) for the lazy programmer who cannot write
     'strlen (S) + 1' by himself. No, if a programmer is lazy,
     he should use a scripting language, not C.

> > @deftypefun {uint8_t *} u8_cpy_alloc (const uint8_t address@hidden, size_t 
> > @var{n})
> 
> Why not u8_dup?

Indeed, that would make a better analogy with u8_strdup. But OTOH, the
C function dup() does something entirely different... OpenBSD has a
memdup() function. But if I used the name u8_memdup, I would also have to
add a 'mem' infix to many other functions, from 'u8_memset' to 
'u8_memnormalize'.

> In uniwidth.texi:
> 
> > These functions are locale dependent.  The @var{encoding} argument 
> > identifies
> > the encoding (address@hidden @code{"ISO-8859-2"} for Polish).
> 
> The manual does not explain why an encoding is required rather than a
> language. I found this comment in the code: 
> 
>   /* In ancient CJK encodings, Cyrillic and most other characters are
>      double-width as well.  */

This is a tricky issue. The current API is probably not perfect, but is
designed to minimize differences between UTF-8 in Europe and UTF-8 in
Japan.

> I believe it should be possible to make the encoding argument optional
> (NULL = assume not in ancient CJK encodings).

If you pass "UTF-8", it will assume that ancient CJK encodings are not in
use.

> In unistdio.texi:
> 
> > The following functions take an ASCII format string and produce output in
> > locale encoding to a @code{FILE} stream.
> 
> I think these should be moved up with the other ulc_* functions, like:
> 
> "The following functions take an ASCII format string and produce output
> in locale encoding---either returning it a @code{char *} string or
> emitting it to a @code{FILE} stream".

Well, I find the difference in the destination (string vs. FILE *) to be
more important than the difference in encoding. Therefore the FILE * functions
come at the end.

> Finally, I think that you should put somewhere information about the
> intended ABI/API stability of libunistring (e.g. will be changed
> incompatibly until 1.0).

The fact that the tarball is on ftp.gnu.org, not alpha.gnu.org, gives
some hint already.

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]