[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-libunistring] observations on the manual
From: |
Bruno Haible |
Subject: |
Re: [bug-libunistring] observations on the manual |
Date: |
Wed, 29 Apr 2009 02:54:41 +0200 |
User-agent: |
KMail/1.9.9 |
Hello Paolo,
Thank you for the feedback!
> wwww
Fixed, thanks.
> > or --- if @code{wchar_t *}
>
> I'm not sure if space before/after --- are good.
I find these spaces more aesthetic than no spaces. Maybe I'm influenced by
typesetting in French and Russian books.
> > These functions are locale dependent. The @var{iso639_language} argument
> > identifies the language (e.g. @code{"tr"} for Turkish). NULL means to use
> > locale independent case mappings.
>
> Is it possible to pass just a POSIX locale name like tr_TR.UTF-8
> directly as @var{iso639_language}, with everything but the language part
> discarded?
No, such preprocessing ("tr_TR.UTF-8" -> "tr") needs to be done before
calling the u*_tolower etc. functions. It's because this preprocessing
most often needs to be done only once, for many u*_tolower calls.
> Also, maybe you could add a special constant for the current locale
> name, like ((void *)1), or even make that the default
The function uc_locale_language() exists precisely for this purpose.
You are supposed to call it once. It speeds up the u*_tolower etc. functions
to not have to look up the current locale over and over again.
> and specify "" for locale-independent case mappings?
For the locale independent mappings, you can use NULL, or "", or any
other invalid territory name.
> It seems to me that there is a limitation, in that you cannot turn to
> lowercase/uppercase/titlecase parts of a string; for that you have to
> use uc_toupper/lower/title and forget about the locale-specific mappings.
Good point. A function for lowercasing part of a string would be useful.
I'll add it.
> However, in many cases the context is available. For example, if I
> modified sed to use u8_tolower, this:
>
> s/[Α-Ωα-ω]/\L&/g
>
> should have the same effect as doing the conversion on the entire string
> (maybe more slowly).
Well, I cannot really speak about 'sed'; but that sed command appears to
request character-by-character processing. I don't know of a sed command
that would allow applying an operation to an entire substring of the current
line _without_ doing it character by character.
> I have not thought about the API so far, but it
> seems to me that only the following character is needed, which makes it
> noticeably easier.
No, these functions have an arbitrary long lookahead and an arbitrary
long "look backwards". They don't need to look across lines, though.
> > @code{memcmp2}
>
> This function is provided by gnulib and should be defined somewhere in
> the documentation. It is also mentioned in unistr.texi.
Oops. I have to write "the gnulib function memcmp2".
> > Converts the string @var{s} of length @var{n} to a string in locale
> > encoding,
>
> The output of xfrm functions is not guaranteed to be in locale encoding.
> In fact, it is just a sequence of bytes that represent the
> locale-specific collation rules.
Oops, right you are. I'm correcting this to say "a NUL-terminated byte
sequence". Thanks.
> I noticed that there are no functions accepting NULL-terminated strings.
> Is this by design, or in the future they could be introduced (either as
> u8_strtoupper, or for example with something like a -1 value for the
> length)?
It is more or less desired, so as not to expose too many functions that
are not very different but blow up the documentation. The user who
passes a string S can easily write 'strlen (S) + 1'. It's so much of an
idiom that it's not hard for the user to remember.
Regarding a length of -1, that's a convention that is used in BASIC or
in ICU. I find it a horrible convention:
1) Simple errors in arithmetic will cause a program to do something
totally different than what the programmer intended, and *not*
report the mistake through an exception or a core dump.
2) Additionally, all calls to the function will pay a price (in form
of a conditional jump) for the lazy programmer who cannot write
'strlen (S) + 1' by himself. No, if a programmer is lazy,
he should use a scripting language, not C.
> > @deftypefun {uint8_t *} u8_cpy_alloc (const uint8_t address@hidden, size_t
> > @var{n})
>
> Why not u8_dup?
Indeed, that would make a better analogy with u8_strdup. But OTOH, the
C function dup() does something entirely different... OpenBSD has a
memdup() function. But if I used the name u8_memdup, I would also have to
add a 'mem' infix to many other functions, from 'u8_memset' to
'u8_memnormalize'.
> In uniwidth.texi:
>
> > These functions are locale dependent. The @var{encoding} argument
> > identifies
> > the encoding (address@hidden @code{"ISO-8859-2"} for Polish).
>
> The manual does not explain why an encoding is required rather than a
> language. I found this comment in the code:
>
> /* In ancient CJK encodings, Cyrillic and most other characters are
> double-width as well. */
This is a tricky issue. The current API is probably not perfect, but is
designed to minimize differences between UTF-8 in Europe and UTF-8 in
Japan.
> I believe it should be possible to make the encoding argument optional
> (NULL = assume not in ancient CJK encodings).
If you pass "UTF-8", it will assume that ancient CJK encodings are not in
use.
> In unistdio.texi:
>
> > The following functions take an ASCII format string and produce output in
> > locale encoding to a @code{FILE} stream.
>
> I think these should be moved up with the other ulc_* functions, like:
>
> "The following functions take an ASCII format string and produce output
> in locale encoding---either returning it a @code{char *} string or
> emitting it to a @code{FILE} stream".
Well, I find the difference in the destination (string vs. FILE *) to be
more important than the difference in encoding. Therefore the FILE * functions
come at the end.
> Finally, I think that you should put somewhere information about the
> intended ABI/API stability of libunistring (e.g. will be changed
> incompatibly until 1.0).
The fact that the tarball is on ftp.gnu.org, not alpha.gnu.org, gives
some hint already.
Bruno