Re: i18n

pspp-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: i18n

From:	John Darrington
Subject:	Re: i18n
Date:	Sun, 19 Mar 2006 20:41:06 +0800
User-agent:	Mutt/1.5.4i

On Sat, Mar 18, 2006 at 05:38:17PM -0800, Ben Pfaff wrote:

     >    2b might be achieved by heuristics, using a library such as unac
     >    http://home.gna.org/unac/unac.en.html or if all else fails, replace
     >    unknown byte sequences by "...."

     I assumed that we'd just use the iconv library (which is
     standardized) to convert between character encodings.

     I don't know about the unac library.  What are its advantages
     over iconv?

Iconv is only useful if we know the source encoding. If we don't know
it we have to guess.  If we guess it wrong, then iconv will fail.
Also, it won't convert between encodings where data would be lost.
Unac on the other hand is a (more) robust but lossy thing.  For example,
given character 0xe1 (acute a) in iso-8859-1 it'll convert to 'a' in
ascii.  I don't know how it would handle converting from Japanese
characters to ascii .... 

     Of course it's sensible to keep everything in a common encoding
     (at least within a dictionary).  But I don't think it's a good
     idea to insist that this encoding be UTF-8 (or any other specific
     encoding).  Instead, I would suggest that we use the local
     encoding (the one from LC_CTYPE or from SET LOCALE) and convert
     everything else we encounter into that.

It's just that utf-8 can encompass just about every other encoding.
If we try and convert from say Korean script into the local encoding
(say ascii) then we're not going to do a very good job.

     >    Whilst that's feasible, casefiles    cannot possibly (in the
     >    current system) have this invariant, because the system files which
     >    implement them may not in fact be utf8 and converting a casefile
     >    doesn't scale.

     You mean, to convert all the string data in a casefile to a
     common encoding?  I think that's a bad idea for other reasons
     too.  First, we don't know that all the string data in the
     casefile is actually alphanumeric.  It could just be binary bits;
     SPSS provides expression operators that can extract and pack
     data from strings, even though they're not all that convenient.
     Second, conversions between encodings can lengthen or shorten
     them, whereas string variables are fixed length.

So we agree then that casefile data must not be meddled with.
However, this also means that both a) The keys in Value Labels ; and
b) the Missing Values must also be left verbatim.  Otherwise, they'll
no longer match.  And this has a rather unfortunate consequence that
the dictionary cannot be gauranteed to have a consistent encoding.
Hence my suggestion of a per-variable encoding attribute.

     >    An alternative, would be to decide that it is the responsibility of
     >    the user interface and output subsystem to convert to utf8.  In
     >    which case, both these entities need to know the encoding of the
     >    data they receive.  Since, (as in the case of MATCH FILES)
     >    variables can come from different system sources, each variable
     >    within a dictionary may have a different encoding.   Thus it may be
     >    desirable to add an encoding property to struct variable.

     I think that I disagree (but I may not quite understand what
     you're saying).  I would think that the encoding would be a
     property of the dictionary.  When we do something like MATCH
     FILES that reads from multiple sources, we convert from the
     encoding used by each source dictionary to the one used by the
     target dictionary.  We'd assume that the source dictionaries and
     the target dictionary are in the local encoding unless told
     otherwise.

     As for converting the case data in string variables in the
     various source files to a common encoding, I doubt we'd want to
     try to do that automatically because there's no way to tell that
     they even have character data in them.  

Is it not the case that all variables with (Aw) format are intended to
contain character data?  I thought that bit patterns, blobs and the
like was supposed to use (AHEXw).

     Instead, I'd suggest
     adding some way to convert character data in the active file from
     one encoding to another.  (I can think of several possible
     syntaxes: a new feature for RECODE, or a function for use with
     COMPUTE, or adding a new command altogether.)

So you're suggesting only to convert if explicitly requested by the
user.

     As for the UI, I guess we'd want to convert from dictionary
     encoding to display encoding at the UI interface.

I'm thinking that too.

     > 4. However, when writing a system file, it would be sensible to
     >    convert all variables to a common encoding first.

     The way I have been thinking about it, this would simply be a
     consequence of having just one encoding within a dictionary.

If we can have the dictionary in one common encoding (and for reasons
above, I'm not sure that we can) this is fine.  But that still leaves
the case data.  I think it'll open up a real can of worms to have  a
variable whose name and value labels are in one encoding, but the data
corresponding to that variable in another.

Consider that the scenario where I'm conducting a global survey.  I
have representatives in various parts of the world who collate
information in their region and then each send me a system file.  The
system files I receive have identical variables, but come in encodings
appropriate to that locale (they might include personal names which
cannot be written in ascii).  I then want to combine all these system
files into one big system file before analysing it.  The only way I
can do this without data loss is to use a universal encoding (such as
utf-8). 

In summary I think the logic of my argument goes like this:

1.  Case Data must not be changed (unless explicitly requested by the
    user).

2.  Missing Value and Value Label keys must have the same encoding as
    the data to which they refer.

3.  1 ^ 2 --> Missing Values and Value Label keys must never change
    encodings. 

4.  Casefiles from different sources may come with arbitrary and
    distinct encodings and may need to be combined into a common
    casefile.  Further, every casefile has a corresponding dictionary.

5.  1 ^ 2 ^ 4 --> Missing Value and Value Label keys in the same
    dictionary, must in general be of different encodings.

Hope this makes sense.

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.

pgp9_kNDr8DbX.pgp
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

my status, Ben Pfaff, 2006/03/12
- Re: my status, John Darrington, 2006/03/13
  - Re: my status, Ben Pfaff, 2006/03/13
    - Re: my status, John Darrington, 2006/03/13
    - Re: my status, Ben Pfaff, 2006/03/13
  - Re: my status, Jason Stover, 2006/03/14
    - Re: my status, Ben Pfaff, 2006/03/14
- i18n, John Darrington, 2006/03/17
  - Re: i18n, Ben Pfaff, 2006/03/18
    - Re: i18n, John Darrington <=
    - Re: i18n, Ben Pfaff, 2006/03/19
    - Re: i18n, John Darrington, 2006/03/19
    - Re: i18n, Ben Pfaff, 2006/03/19

Prev by Date: Re: build weirdness
Next by Date: category.c
Previous by thread: Re: i18n
Next by thread: Re: i18n
Index(es):
- Date
- Thread