pspp-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: ZSAV format support [ZCOMPRESSED subcommand]


From: Ben Pfaff
Subject: Re: ZSAV format support [ZCOMPRESSED subcommand]
Date: Thu, 3 Oct 2013 21:38:57 -0700
User-agent: Mutt/1.5.21 (2010-09-15)

On Wed, Oct 02, 2013 at 09:40:03PM -0700, Ben Pfaff wrote:
> On Wed, Oct 02, 2013 at 01:03:27PM -0400, Hugo Alejandro wrote:
> > A few days ago I was recruited to work in the analysis of large surveys,
> > what caught my attention is the use of the format *. zsav above *.sav.
> > 
> > Apparently this file format supports higher compression ratio and is more
> > efficient with large databases to reduce their size on disk and be faster to
> > compress-decompress to create a ZIP file (or other format) with a *.sav file
> > .
> > 
> > This file type is very recent, included in SPSS version 21 and improved in 
> > the
> > current version 22.
> 
> This is very interesting.  Thank you for bringing this to our
> attention.
> 
> The .zsav file format appears to be the same as .sav format up to the
> data portion of the file, except that the "magic" at the beginning of
> the file is $FL3 instead of $FL2.
> 
> The data portion of the file starts at offset 837 (0x345).  Its
> contents, with my speculation about their meaning, is:
> 
> 00000345  45 03 00 00 00 00 00 00 - Byte offset of this block, 0x345.
> 0000034d  14 07 00 00 00 00 00 00 - byte offset of the next block, 0x714.
> 00000355  30 00 00 00 00 00 00 00 - Length of next block's header, 0x30 bytes.
> 
> It is followed by 951 (0x3b7) bytes of data compressed with the
> "deflate" algorithm.  When inflated, these expand to 1120 (0x460) bytes
> that exactly match the data portion of the original physiology.sav,
> which starts at offset 729 (0x2d9) in the original file.
> 
> The file ends with an additional 48 (0x30) bytes starting at offset 1812
> (0x714).  Their contents, with my speculation about their meaning, are:
> 
> 00000714  9c ff ff ff ff ff ff ff - Value -100, dunno why (compression bias?)
> 0000071c  00 00 00 00 00 00 00 00 - ?
> 00000724  00 f0 3f 00 01 00 00 00 - ?
> 0000072c  45 03 00 00 00 00 00 00 - Starting offset of previous block, 0x345.
> 00000734  5d 03 00 00 00 00 00 00 - Starting offset of data block, 0x35d.
> 0000073c  60 04 00 00             - Inflated data size, 0x460 bytes.
> 00000740  b7 03 00 00             - Compressed data size, 0x3b7 bytes.
> 
> From here, I think that the next step would have to be to look at both
> the .sav and .zsav versions of files.  I would be most interested in
> larger files (say, 1 MB in size), because I think that it is likely that
> some of the mysteries above would be cleared up if there were more
> compressed blocks in the file (or perhaps we would find out that there
> is only ever a single compressed block).

Some of this matches up, but some of it is weird:

000035a  5a 03 00 00 00 00 00 00 - Byte offset of this block, 0x35a
0000362  12 94 03 00 00 00 00 00 - Byte offset of the next block, 0x39412.
000036a  48 00 00 00 00 00 00 00 - Length of next block's header, 0x48 bytes.

...then compressed data, then...

0039412  9c ff ff ff ff ff ff ff - Value -100, dunno why (compression bias?)
003941a  00 00 00 00 00 00 00 00 - ?
0039422  00 f0 3f 00 02 00 00 00 - ?
003942a  5a 03 00 00 00 00 00 00 - Starting offset of previous block, 0x35a.
0039432  72 03 00 00 00 00 00 00 - Starting offset of data block, 0x372.
003943a  00 f0 3f 00             - Inflated data size, 0x3ff000 bytes.
003943e  49 7c 03 00             - Compressed data size, 0x37c49 bytes.
0039442  5a f3 3f 00 00 00 00 00 - 0x3ff35a = 0x35a + 0x3ff000
                                   = current byte offset if no compression
003944a  bb 7f 03 00 00 00 00 00 - ?
0039452  00 bf 06 00             - ?
0039456  57 14 00 00             - ?

In particular, when I decompress the compressed data block, only the
beginning of it looks the same as in the not-compressed version of the
file.  There is something weird going on.  Before I go to a lot of
trouble to try to chase that down, would you mind making sure for me
that both versions of the file really have the same data in them?

Thanks,

Ben.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]