bug-gnu-pspp
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts


From: Ben Pfaff
Subject: Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names
Date: Sat, 26 Apr 2014 13:54:58 -0700
User-agent: Mutt/1.5.21 (2010-09-15)

On Tue, Feb 18, 2014 at 06:37:37PM +0000, M?ller, Andre wrote:
> > -----Original Message-----
> > From: Ben Pfaff [mailto:address@hidden
> > Sent: Tuesday, February 18, 2014 18:33
> > To: M?ller, Andre
> > Cc: address@hidden
> > Subject: Re: PSPP-BUG: Failure to handle an antique SPSS file containing
> > umlauts in variable names
> > 
> > On Tue, Feb 18, 2014 at 03:17:26PM +0000, M??ller, Andre wrote:
> > > Another rather unfair testcase is a failure to identify a source file
> > > in DIN_66003 coding, but that really is to be expected -- DIN_66003 is
> > > a 7-bit-safe codepage for german, where a?????????????? take the place
> > > of us-ascii's {|}[\]~@, respectively. An evil solution for problems
> > > long gone.  I think it's sane to not try and handle 7-bit non-ascii
> > > codings, so that's just to let you know.  Really I cannot think of any
> > > way of handling them short of looking at oddities in character counts
> > > or success rates with matches against dictionarys.
> > 
> > The code that I wrote doesn't really identify encodings at all.
> > Instead, it just tries to recode all the strings in the file from each
> > of several possible encodings to UTF-8.  That means that it's easy to
> > add more encodings, including DIN_66003.  The encodings that I chose are
> > fairly arbitrary: I took them from the list at
> > http://encoding.spec.whatwg.org/.  I can add DIN_66003; no problem.  Are
> > there other encodings I should add?
> 
> Yes, I found that by "reading" your code... with reading in quotes because of 
> my utter
> lack of C knowledge. At least I can read the commentary, and it's quite 
> thorough. 
> 
> In any case, I indeed missed one codepage on my first tests: IBM850. 
> That is the predecessor to windows-1252, also called ms-dos latin1. 
> To my surprise, it is not listed on the encoding.spec page. 
> I think that would be a worthwile addition. 
> 
> More worthwile than the really strange and old DIN_66003. 
> It would show up everytime the file actually is pure us-ascii.
> But nevertheless, this obviously has been used, so you may want to add it.
> I really leave that up to you, it may be opening a can of worms.
> DIN_66003 is just the german variant of ISO_646 and there are a whole bunch 
> of national variants to it: https://en.wikipedia.org/wiki/ISO/IEC_646
> That may end up in a list from hell for each dataset coded in plain us-ascii.

At long last, I've added IBM850 and DIN_66003 to the encodings that
SYSFILE INFO checks for.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]