Re: i18n

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: i18n

From:	Ben Pfaff
Subject:	Re: i18n
Date:	Sat, 18 Mar 2006 17:38:17 -0800
User-agent:	Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux)

John Darrington <address@hidden> writes:

> 0. Data strings that need internationalisation include: 
>
>    * String Variable Data.
>    * Variable Names.
>    * Value Labels.
>    * Variable Labels.
>    * File Labels.
>    * Document Text.

OK.

> 1. If the system file format had been properly defined, it would
>    have stored the encoding used for its strings somewhere in the
>    file.   The fact of the matter is, that it doesn't.

Yes.

> 2. Therefore, we have to a) make a reasonable guess as to what a
>    system file's encoding is; and  b) ensure that reasonable behaviour
>    ensues if that assumption is incorrect.  We have to bear in mind
>    that PSPP can deal with more than one system file at the same time
>    eg: through the MATCH FILES command, and these could have been
>    written in different encodings.
>
>    2a might be acheived by i) using the LC_CTYPE environment variable,
>    ii) using the value set be SET LOCALE; or iii) we could introduce
>    an optional subcommand to the GET command to specify the locale.

I assume you mean that we should do *all* of these.  That's what
I've been thinking.

>    2b might be achieved by heuristics, using a library such as unac
>    http://home.gna.org/unac/unac.en.html or if all else fails, replace
>    unknown byte sequences by "...."

I assumed that we'd just use the iconv library (which is
standardized) to convert between character encodings.

I don't know about the unac library.  What are its advantages
over iconv?

> 3. At some level within PSPP we need to decide on an interface where
>    all strings will have a common encoding.  For instance, one
>    possibility would be to decide that all strings contained within
>    the dictionary would be utf8.  In this case, we'd need to convert
>    all string data to utf8 within the struct variable (except short_name).

Of course it's sensible to keep everything in a common encoding
(at least within a dictionary).  But I don't think it's a good
idea to insist that this encoding be UTF-8 (or any other specific
encoding).  Instead, I would suggest that we use the local
encoding (the one from LC_CTYPE or from SET LOCALE) and convert
everything else we encounter into that.

>    Whilst that's feasible, casefiles    cannot possibly (in the
>    current system) have this invariant, because the system files which
>    implement them may not in fact be utf8 and converting a casefile
>    doesn't scale.

You mean, to convert all the string data in a casefile to a
common encoding?  I think that's a bad idea for other reasons
too.  First, we don't know that all the string data in the
casefile is actually alphanumeric.  It could just be binary bits;
SPSS provides expression operators that can extract and pack
data from strings, even though they're not all that convenient.
Second, conversions between encodings can lengthen or shorten
them, whereas string variables are fixed length.

>    An alternative, would be to decide that it is the responsibility of
>    the user interface and output subsystem to convert to utf8.  In
>    which case, both these entities need to know the encoding of the
>    data they receive.  Since, (as in the case of MATCH FILES)
>    variables can come from different system sources, each variable
>    within a dictionary may have a different encoding.   Thus it may be
>    desirable to add an encoding property to struct variable.

I think that I disagree (but I may not quite understand what
you're saying).  I would think that the encoding would be a
property of the dictionary.  When we do something like MATCH
FILES that reads from multiple sources, we convert from the
encoding used by each source dictionary to the one used by the
target dictionary.  We'd assume that the source dictionaries and
the target dictionary are in the local encoding unless told
otherwise.

As for converting the case data in string variables in the
various source files to a common encoding, I doubt we'd want to
try to do that automatically because there's no way to tell that
they even have character data in them.  Instead, I'd suggest
adding some way to convert character data in the active file from
one encoding to another.  (I can think of several possible
syntaxes: a new feature for RECODE, or a function for use with
COMPUTE, or adding a new command altogether.)

As for the UI, I guess we'd want to convert from dictionary
encoding to display encoding at the UI interface.

> 4. However, when writing a system file, it would be sensible to
>    convert all variables to a common encoding first.

The way I have been thinking about it, this would simply be a
consequence of having just one encoding within a dictionary.
-- 
"The sound of peacocks being shredded can't possibly be
 any worse than the sound of peacocks not being shredded."
Tanuki the Raccoon-dog in the Monastery

[Prev in Thread]

Current Thread

[Next in Thread]

my status, Ben Pfaff, 2006/03/12
- Re: my status, John Darrington, 2006/03/13
  - Re: my status, Ben Pfaff, 2006/03/13
    - Re: my status, John Darrington, 2006/03/13
    - Re: my status, Ben Pfaff, 2006/03/13
  - Re: my status, Jason Stover, 2006/03/14
    - Re: my status, Ben Pfaff, 2006/03/14
- i18n, John Darrington, 2006/03/17
  - Re: i18n, Ben Pfaff <=
    - Re: i18n, John Darrington, 2006/03/19
    - Re: i18n, Ben Pfaff, 2006/03/19
    - Re: i18n, John Darrington, 2006/03/19
    - Re: i18n, Ben Pfaff, 2006/03/19

Prev by Date: i18n
Next by Date: Re: build weirdness
Previous by thread: i18n
Next by thread: Re: i18n
Index(es):
- Date
- Thread