pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug #15820] Can not read sav file


From: John Darrington
Subject: Re: [bug #15820] Can not read sav file
Date: Fri, 24 Feb 2006 09:03:29 +0800
User-agent: Mutt/1.5.9i

On Thu, Feb 23, 2006 at 01:39:38PM -0800, Ben Pfaff wrote:

     Do you think we can assume that variables names are encoded in
     UTF-8?  

No.  The sample file provided with bug #15820 seems to be encoded in
iso-8859-1.  

     Then it is fairly easy to convert variable names to/from
     the current locale on system file input/output.
     
     I have not experimented with non-ASCII variable names in SPSS.  A
     few experiments might turn up the encoding.

Reading between the lines in the spss documentation, it seems to
suggest that the encoding is that of the environment of the machine
which created it.
     
     
     I think it'd still be a good idea to sanity-check variable names,
     assuming that we can figure out the variable name encoding used
     in system files.


It would be nice, but in view of the above, I don't think we know what
"sane"  is.  We just have to presume sanity unless proved otherwise.
     
     
     > Instead, let's do all that sort of checking in the lexer, and the
     > output routines.  Thus, 
     >
     >  DATA LIST LIST /Äpfel *.
     >
     > Will give an error (or perhaps just a warning) in the default "C"
     > locale, but continue happily if the LC_CTYPE locale has been set to
     > say "de_DE".  Similarly, if I generate output from a system file which
     > was created in the "de_DE" locale, but my current locale is "en_US",
     > then the output routine will generate a warning when it encounters a
     > variable name for which isalpha returns false.
     
     Is that the way that other languages with support for
     internationalization parse variable names?  e.g. how does Java
     work?  I must admit that I have a pretty weak grasp of how this
     sort of thing is supposed to work.

Most languages that I've encountered insist on ascii for identifiers.
The only exception I know of is TeX, which allows one to change it at
will.
     
     I found out what GCC does.  It assumes input files are in the
     locale's character set, or UTF-8 if it there's no locale, and
     there's a command line option to override.  Maybe we should do
     the same.

Seems to be reasonable except that I don't see how there can be "no
locale" on any *NIX system.  I suppose that section of the gcc manual
is just decribing what the code does if setlocale(LC_CTYPE, 0) returns
NULL.

     Do you think that the "short" variable names in system files
     should be all ASCII?

It would seem not.  See above.

J'

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.


Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]