bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Displaying characters in user's locale


From: Gavin Smith
Subject: Re: Displaying characters in user's locale
Date: Sat, 1 Feb 2014 15:21:51 +0000

Thanks for the feedback. I'll work on these suggestions. Some comments -

On Sat, Feb 1, 2014 at 8:11 AM, Eli Zaretskii <address@hidden> wrote:

>> +/* Look for local variables section in FB and set encoding */
>> +static void
>> +set_file_lc_ctype (FILE_BUFFER *fb)
>
> I think this function should return UTF-8 if it doesn't find any
> coding: cookies in the file.  UTF-8 is probably the best default
> nowadays.
>

This depends on what files are out there with no encoding specified.
Do you know how long makeinfo has output an encoding section? Is it
still possible today that makeinfo could output a UTF-8 file with no
encoding specified?

>> +static void
>> +convert_characters (FILE_BUFFER *fb)
>> +{
>> +  long node = 0, nextnode;
>> +  SEARCH_BINDING binding;
>> +  char *to_locale;
>> +
>> +  iconv_t iconv_state;
>> +  int iconv_available = 0;
>> +
>> +  void (*degrade_funcs[5])(char **, size_t *,
>> +                           char **, size_t *) = {
>> +    degrade_dummy, degrade_utf8, degrade_dummy,
>> +    degrade_dummy, degrade_dummy };
>
> Why do we need any degrade_* functions except degrade_utf8?  Can you
> tell what possible features can benefit from this?
We rely on iconv to find an exact conversion between characters.
Failing that we look for an ASCII replacement. These will be useful if
the user's locale doesn't support a character in a file. For example,
an info file could be in ISO-8859-1 and contain a character that
doesn't exist in the user's locale (if they aren't using UTF-8). For
example "é" to "e" or "e'". This feature doesn't matter so much apart
for a few characters like directional quotation marks - it will mostly
be characters in people's names that look funny. It would require a
full conversion table for each input encoding supported.
>
>> +  /* Read environment locale */
>> +  to_locale = nl_langinfo(CODESET);
>> +
>> +  /* Don't degrade the contents if we are in fact
>> +   * in the right locale for the file */
>> +  if (!strcasecmp(to_locale, encoding_names[fb->lc_ctype]))
>> +    return;
>> +
>> +  degrade = degrade_funcs [fb->lc_ctype];
>
> One of the disadvantages of those degrade_* functions is that you must
> match each encoding with a functions, and there are an awful lot of
> possible encodings out there.
>
I've listed all the ones listed in the texinfo manual
(http://www.gnu.org/software/texinfo/manual/texinfo/html_node/_0040documentencoding.html#g_t_0040documentencoding).
Hopefully in the future more won't be added and everyone will use
UTF-8 instead.

>> +  /* Convert sections of the file separated by node separators. These
>> +   * will be preambles, nodes, tag tables, or local variable sections.
>> +   * We convert all of them, although probably only the nodes need to
>> +   * be converted.
>
> I would indeed suggest to convert only the node that is about to be
> displayed.  Some manuals are very large, so converting them in their
> entirety might produce an annoying delay at startup.  Did you try the
> Emacs Lisp manual, for example?
>
Possibly, but this would be a harder change to make considering the
way the info program stores the contents of files. At the moment all
the nodes are stored together in a file buffer and the offsets of the
starts of nodes into this buffer are stored. There is some overlap in
this problem with the other alterations I suggested for clearing up
notation from the buffer, because that also changes the byte length of
nodes - maybe it can done the same way as there.

Note also that some files are split - it should only convert each
subfile as they are loaded.

I've tried the elisp manual and my alterations don't seem to work
properly - I'll look into this.

>> +  while ((nextnode = find_node_separator (&binding)) != -1
>> +    || (node != fb->filesize && (nextnode = fb->filesize)))
>
> In this loop, I suggest an optimization: only call iconv for portions
> of text that include bytes above 127, unless the file's encoding is
> known to require conversion even in that case (some CJK encodings,
> like ISO-2022 family, are known to be long to the latter class).  This
> could save you some cycles.
>

No info file should be in a CJK encoding - see link above.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]