pspp-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Character encoding again.


From: John Darrington
Subject: Re: Character encoding again.
Date: Sun, 31 Oct 2010 12:44:43 +0000
User-agent: Mutt/1.5.18 (2008-05-17)

On Sat, Oct 30, 2010 at 10:01:37PM -0700, Ben Pfaff wrote:
     
     Character set names and their aliases are listed by IANA:
             http://www.iana.org/assignments/character-sets

Wouldn't it have been nice if SPSS  had used IANA MIB numbers instead of 
these "codepage" numbers whose definition is so elusive!
     
     > Moreover, there are a lot of SPSS data files which I have seen
     > which have this "character_code" set to 2, yet contain data
     > which are clearly not 7 bit ascii.
     
     It was only a few SPSS versions back that SPSS appeared to start
     putting values other than 2 into that field, and there are still
     many older SPSS system files on the web.  I guess that we will
     have to either guess the encoding or depend on the user to tell
     us the encoding for these files.

Based on what users have reported, SPSS treats character_code 2 as windows-1252 
(even on non-windows OSes).
     
     and here's the current output:

Your table seems to be the most comphrehensive I've seen yet.  I suggest we'll 
have to hash it with gperf or something.  Codepage numbers which we cannot 
resolve, I 
suppose we'll have to devise some fallback heuristic.  As for encodings which 
we cannot
find a codepage number for, then we could just convert everything to UTF8.

J'

     

-- 
PGP Public key ID: 1024D/2DE827B3 
fingerprint = 8797 A26D 0854 2EAB 0285  A290 8A67 719C 2DE8 27B3
See http://pgp.mit.edu or any PGP keyserver for public key.


Attachment: signature.asc
Description: Digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]