pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: i18n


From: John Darrington
Subject: Re: i18n
Date: Mon, 20 Mar 2006 10:36:32 +0800
User-agent: Mutt/1.5.9i

On Sun, Mar 19, 2006 at 05:26:47PM -0800, Ben Pfaff wrote:
     >      
     >      I don't know about the unac library.  What are its advantages
     >      over iconv?
     >
     > Iconv is only useful if we know the source encoding. If we don't know
     > it we have to guess.  If we guess it wrong, then iconv will fail.
     > Also, it won't convert between encodings where data would be lost.
     > Unac on the other hand is a (more) robust but lossy thing.  For example,
     > given character 0xe1 (acute a) in iso-8859-1 it'll convert to 'a' in
     > ascii.  I don't know how it would handle converting from Japanese
     > characters to ascii .... 
     
     I do not understand how unac could remove accents from text
     without knowing the source encoding.  I don't see any indication
     that it can do so, now that I have read the unac manpage from the
     webpage you pointed out.  In fact, the first argument to the
     unac_string() function is the name of the source encoding, and
     unac is documented to use iconv internally to convert to UTF-16.
     
     (Why would we want to remove accents, by the way?)

Ideally we wouldn't.  I've only looked very briefly at the unac web
page.  As I understood it, it was supposed to convert a string from an
arbitrary encoding, into a reasonable approximation of that string
which could be representing in plain ascii.  Perhaps I need to read
the web page more closely.
     
     > So we agree then that casefile data must not be meddled with.
     > However, this also means that both a) The keys in Value Labels ; and
     > b) the Missing Values must also be left verbatim.  Otherwise, they'll
     > no longer match.  And this has a rather unfortunate consequence that
     > the dictionary cannot be gauranteed to have a consistent encoding.
     > Hence my suggestion of a per-variable encoding attribute.
     
     This sounds like a mess.  Any reference to more than one string
     variable will have to deal with coding translation.  The most
     obvious place where this happens is in string expressions,
     e.g. consider the CONCAT function especially.  I'm sure we'll get
     confused when we have to fix up code all over to do that.  I bet
     that our users will get even more confused.

True.  I hadn't considered that.
     
     
     Let me elaborate.  Here is the plan that I envision:
     
     i. PSPP adopts a single locale that defaults to the system locale
        but can be changed with SET LOCALE.  (I'll call this the "PSPP
        locale".)
     
     ii. All string data in all casefiles and dictionaries is in the
         PSPP locale, or at least we make that assumption.
     
     iii. The GET command assumes by default that data read in is in
          the PSPP locale.  If the user provides a LOCALE subcommand
          specifying something different, then missing values and
          value label keys are converted as the dictionary is read and
          string case data is converted "on the fly" as data is read
          from the file.  We can also provide a NOCONVERT subcommand
          (with a better name, I hope) that flags string variables
          that are not to be converted.
     
     iv. The SAVE command assumes by default that data written out is
         to be in the PSPP locale.  If the user provides a LOCALE
         subcommand specifying something different, then we convert
         string data, etc., as we write it, and again exceptions can
         be accommodated.
     
     v. Users who want accurate translations, as in your survey
        example, choose a reasonable PSPP locale, e.g. something based
        on UTF-8.
     
     vi. We look into the possibility of tagging system files with a
         locale.  The system file format is extensible enough that
         this would really just be a matter of testing whether SPSS
         will complain loudly about our extension records or just
         silently ignore them.


I think there is no ideal solution to this problem.  Your proposal
might be as good as any other and certainly is simpler than what I had
suggested. However I'm worried about what happens if our assumption at
(ii) turns out to be wrong.  We need to make sure of some sensible
behaviour (hence my idea of unac).

Regarding (vi), I don't think spss would complain (at least not
loudly) about unrecognised records.  But all hell might break loose if
we commandeared an unused record type for this purpose, and a later
version of SPSS chose to use it for another purpose.
Incidently, SPSS V14 writes system files with a Type 7, Subtype 16
record.  I haven't been able to determine the purpose of this record.
Perhaps it specifies the encoding?

J'

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.


Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]