bug-gnu-pspp
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts


From: Ben Pfaff
Subject: Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names
Date: Tue, 18 Feb 2014 09:33:16 -0800
User-agent: Mutt/1.5.21 (2010-09-15)

On Tue, Feb 18, 2014 at 03:17:26PM +0000, M??ller, Andre wrote:
> I have tried the SYSFILE INFO and it works quite well. 
> For now, I have piped some examples of uncommon codepages through it,
> and it does well for SHIFT_JIS and IBM850 (or similar), for example.
> 
> The broken files I have, that actually contain entries in more than
> one codepage are not a valid test, but even then, I found at least
> some of the codepages it contains as suggestions.  That's nice.

I agree that mixed-codepage files are not a valid test ;-)

> Another rather unfair testcase is a failure to identify a source file
> in DIN_66003 coding, but that really is to be expected -- DIN_66003 is
> a 7-bit-safe codepage for german, where a?????????????? take the place
> of us-ascii's {|}[\]~@, respectively. An evil solution for problems
> long gone.  I think it's sane to not try and handle 7-bit non-ascii
> codings, so that's just to let you know.  Really I cannot think of any
> way of handling them short of looking at oddities in character counts
> or success rates with matches against dictionarys.

The code that I wrote doesn't really identify encodings at all.
Instead, it just tries to recode all the strings in the file from each
of several possible encodings to UTF-8.  That means that it's easy to
add more encodings, including DIN_66003.  The encodings that I chose are
fairly arbitrary: I took them from the list at
http://encoding.spec.whatwg.org/.  I can add DIN_66003; no problem.  Are
there other encodings I should add?



reply via email to

[Prev in Thread] Current Thread [Next in Thread]