[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts
From: |
Ben Pfaff |
Subject: |
Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names |
Date: |
Sat, 26 Apr 2014 13:54:58 -0700 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
On Tue, Feb 18, 2014 at 06:37:37PM +0000, M?ller, Andre wrote:
> > -----Original Message-----
> > From: Ben Pfaff [mailto:address@hidden
> > Sent: Tuesday, February 18, 2014 18:33
> > To: M?ller, Andre
> > Cc: address@hidden
> > Subject: Re: PSPP-BUG: Failure to handle an antique SPSS file containing
> > umlauts in variable names
> >
> > On Tue, Feb 18, 2014 at 03:17:26PM +0000, M??ller, Andre wrote:
> > > Another rather unfair testcase is a failure to identify a source file
> > > in DIN_66003 coding, but that really is to be expected -- DIN_66003 is
> > > a 7-bit-safe codepage for german, where a?????????????? take the place
> > > of us-ascii's {|}[\]~@, respectively. An evil solution for problems
> > > long gone. I think it's sane to not try and handle 7-bit non-ascii
> > > codings, so that's just to let you know. Really I cannot think of any
> > > way of handling them short of looking at oddities in character counts
> > > or success rates with matches against dictionarys.
> >
> > The code that I wrote doesn't really identify encodings at all.
> > Instead, it just tries to recode all the strings in the file from each
> > of several possible encodings to UTF-8. That means that it's easy to
> > add more encodings, including DIN_66003. The encodings that I chose are
> > fairly arbitrary: I took them from the list at
> > http://encoding.spec.whatwg.org/. I can add DIN_66003; no problem. Are
> > there other encodings I should add?
>
> Yes, I found that by "reading" your code... with reading in quotes because of
> my utter
> lack of C knowledge. At least I can read the commentary, and it's quite
> thorough.
>
> In any case, I indeed missed one codepage on my first tests: IBM850.
> That is the predecessor to windows-1252, also called ms-dos latin1.
> To my surprise, it is not listed on the encoding.spec page.
> I think that would be a worthwile addition.
>
> More worthwile than the really strange and old DIN_66003.
> It would show up everytime the file actually is pure us-ascii.
> But nevertheless, this obviously has been used, so you may want to add it.
> I really leave that up to you, it may be opening a can of worms.
> DIN_66003 is just the german variant of ISO_646 and there are a whole bunch
> of national variants to it: https://en.wikipedia.org/wiki/ISO/IEC_646
> That may end up in a list from hell for each dataset coded in plain us-ascii.
At long last, I've added IBM850 and DIN_66003 to the encodings that
SYSFILE INFO checks for.
- Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names,
Ben Pfaff <=