bug-gnu-pspp
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts


From: Ben Pfaff
Subject: Re: PSPP-BUG: Failure to handle an antique SPSS file containing umlauts in variable names
Date: Mon, 10 Feb 2014 10:52:46 -0800

Here's the approach I'm trying so far, in case you (or anyone) has ideas:

   * Extract all the raw string data from the .sav file, without trying to
     determine its encoding.

   * Try converting all of the raw string data from every significant encoding
     to UTF-8.  Discard any encodings that actually fail.

   * Of the remaining encodings, merge together the equivalence classes
     in which all of the strings are identical in UTF-8.

   * For each equivalence class, present the user with the strings that
     are not all the same, along with the meaning of the string.  Allow the
     user to choose one of the encodings.

So what you end up with is a table.  Along the y axis are string meanings,
e.g. "Variable Name 1", "Variable Name 2", ..., "Value Label 1".  Along
the x axis are encodings.  The entries are the strings for those encodings.
The user should be able to figure out whether which set of variable names
(etc.) makes the most sense and choose that encoding.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]