pdf-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [pdf-devel] Bug in LZW filter?


From: Georg Gottleuber
Subject: Re: [pdf-devel] Bug in LZW filter?
Date: Tue, 20 Sep 2011 01:11:22 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; de; rv:1.9.2.20) Gecko/20110903 Lightning/1.0b2 Lanikai/3.1.12

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello.

On 19.09.2011 09:59, Aleksander Morgado wrote:
>> Additional tests showed that encoding with EarlyChange = 0 (very
>> unusual) needs a bigger table (LZW_MAX_DICTSIZE + 1). I have done a lot
>> of testing and the LZW encoder (with EarlyChange = 0) now outputs the
>> same files as PDFlib-Lite-5.0.4p1 does. (PDFlib-Lite-5.0.4p1 is the only
>> encoder with EarlyChange = 0 I found)
>>
> 
> So the changes done will make it work both with EarlyChange=0 and
> EarlyChange=1, I am assuming.

Yes.

> Could you maybe try to explain one by one the changes in
> 'src/base/pdf-stm-f-lzw.c'? They seem pretty straightforward, but I
> would like to know the reasoning behind each of them. Are they all due
> to needing a bigger table with EarlyChange=0?

First of all: the standard is very vague. It says:
"[EarlyChange:] An indication of when to increase the code length. If
the value of this entry is 0, code length increases shall be postponed
as long as possible. If the value is 1, code length increases shall
occur one code early."

As I could not figure out which code is meant (input, output, or
dictionary) this remains unclear to me. So I used showpdf(mupdf) to get
(real world) examples and compared the decoded bytes with vbindiff and
decoded examples by hand. It shows that the lzw_buffer_inc_bitsize comes
to early [1].

This fix broke my dec(enc(rand.bin) == rand.bin "test". So I used
showpdf examples again ... It shows that the reset code is encoded two
codes to early[2].

I have done this procedure for EarlyChange=0 as well and I found out
that the dictionary is to small (after looking at the source code of
pdflib5) [3].

At [4],[5] I increased the numbers because of the increased DICTSIZE (to
get the same result) But I am not 100% sure about this. Someone with
experience in LZW maybe double-check this.

[6] is irrelevant because it belongs to an further bug I will commit
later. Sorry for that. I will remove it.

> Are we testing these fixes with more than one dataset? 
The unit test is with two test sets (that has at least one dict reset).
One with EarlyChange == 0, one with EarlyChange == 1;

But I tested LZW decoding with several files by following creators:
* Acrobat PDFWriter 2.01 for Windows
* Acrobat PDFWriter 3.0 for Windows
* Acrobat 3.0 Import Plug-in
* Acrobat Distiller 2.0 for Windows
* GPL Ghostscript 8.71

For encoding see next answer.


How sure are we
> that we're not fixing one test case and breaking all the others?

Good question. As the standard remains unclear (to me) I only can do
testing and looking at other open projects source code. Without my fix I
am able to show you dozens of bad decoded PDFs. With my fix I cannot.

With the encoder it is much harder. Because the decoder listens to the
reset code you can reset to early and the decoded result is correct
anyway. My tests showed that in fact there are different LZW Encoder
implementations (for example (with EarlyChange = 1): GNU PDF is binary
same as Ghostscript 8.71 but "Acrobat Distiller 2.0" or "Acrobat
PDFWriter 2.01" places the reset code later)

I think my tests for the encoder are quite good. I tested (with
EarlyChange = 1) 3 different PDFs created with Ghostscript 8.71 with a
48KB to 50KB LZW-Stream. The decoded bytes (with GNU PDF or pdfshow)
encode to the origin bytes.

To give you some numbers: I tested with 11 different LZW-Streams from 8
different PDFs:
With my fixes all 11 LZW-Streams are decoded well (without: none)
With my fixes all 3 Ghostscript Streams are encoded to the same bytes
(without: none)

But I have to admit that all my test streams are > 2KB and therefore
need increase of bitsize (most of them also dict resets).

Regards,
Georg



Changes sorted by line number (as in patch):
[3]:
- -#define LZW_MAX_DICTSIZE  (1 << LZW_MAX_BITSIZE)
+#define LZW_MAX_DICTSIZE  ((1 << LZW_MAX_BITSIZE) + 1)
 #define LZW_NULL_INDEX    ~0U
- --------------------------------------------------------------------

[6]:
@@ -407,6 +407,8 @@
   if (st->must_reset)
     {
       lzw_buffer_put_code (&st->buffer, LZW_RESET_CODE);
+      lzw_buffer_set_bitsize (&st->buffer, LZW_MIN_BITSIZE);
+      lzw_dict_reset (&st->dict);
       st->must_reset = PDF_FALSE;
     }
- --------------------------------------------------------------------

[2]:
@@ -419,7 +421,7 @@
           lzw_buffer_put_code (&st->buffer, st->string.prefix);
           st->string.prefix = st->string.suffix;

- -          if (st->buffer.maxval - st->early_change == st->dict.size)
+          if (st->buffer.maxval - st->early_change + 2 == st->dict.size)
             {
               if (!lzw_buffer_inc_bitsize (&st->buffer))
                 {

@@ -434,7 +436,7 @@
   if (finish)
     {
       lzw_buffer_put_code (&st->buffer, st->string.prefix);
- -      if ((st->buffer.maxval - st->early_change) == st->dict.size)
+      if ((st->buffer.maxval + st->early_change) == st->dict.size)
         {
           lzw_buffer_inc_bitsize (&st->buffer);
         }
- --------------------------------------------------------------------

[4],[5]:
@@ -530,7 +532,7 @@
   lzw_buffer_init (&filter_state->buffer, LZW_MIN_BITSIZE);
   lzw_dict_init (&filter_state->dict);
   filter_state->old_code = LZW_NULL_INDEX;
- -  filter_state->decoded = filter_state->dec_buf + (LZW_MAX_DICTSIZE-2);
+  filter_state->decoded = filter_state->dec_buf + (LZW_MAX_DICTSIZE - 3);
   filter_state->dec_size = 0;
   filter_state->state_pos = LZWDEC_STATE_START;
   filter_state->tmp_error = NULL;
@@ -664,7 +666,7 @@
       while (st->new_code != LZW_EOD_CODE &&
              st->new_code != LZW_RESET_CODE)
         {
- -          st->decoded = st->dec_buf + (LZW_MAX_DICTSIZE - 2);
+          st->decoded = st->dec_buf + (LZW_MAX_DICTSIZE - 3);

           /* Is new code in the dict? */
           if (st->new_code < st->dict.size)
- --------------------------------------------------------------------

[1]:
@@ -687,7 +689,7 @@
           if (!lzwdec_put_decoded (st, out))
             return PDF_STM_FILTER_APPLY_STATUS_NO_OUTPUT; /* No more
output */

- -          if (st->dict.size == st->buffer.maxval - 1 - st->early_change)
+          if (st->dict.size == st->buffer.maxval + 1 - st->early_change)
             {
               lzw_buffer_inc_bitsize (&st->buffer);


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk53zBoACgkQ5sLITM1qIaIurwCeL3yexCU2idnNh1pbUTfXMhI3
CwMAoKEJxT/V7RO1dQeUC3K8e4tSKxb/
=BLFa
-----END PGP SIGNATURE-----



reply via email to

[Prev in Thread] Current Thread [Next in Thread]