classpath
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: java.util.zip speedup


From: Artur Biesiadowski
Subject: Re: java.util.zip speedup
Date: Mon, 23 Dec 2002 15:41:18 +0100
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.3a) Gecko/20021212

John Leuner wrote:

Did you do any profiling of the code?

Yes, many times, but only under sun jre 1.4.1 client hotspot on windows. Both cpu and memory. For the cpu, I do not think there are any suprises - most of time is spend in Inflater.decodeHuffman, with InflaterHuffmanTree.getSymbol contributing the most there (most methods called from decodeHuffman get inlined probably). Next parts are Adler32.update and OutputWindow.repeat (last one can be an artifact due to System.arraycopy happening there, providing good point for gc break).

I have done small optimalizations in few places, mostly in zip entry reading (it is now about 5 times faster) - but not in actual decompression routines.

That's quite an improvement. Are you saying that the current version
generates 130MB of garbage, and yours creates 1MB?

Yes. This 1MB is not real 'garbage' - most of it are ZipEntry objects, which are actually used by test program.

As for 130MB of garbage, most of it comes from OutputWindow allocation. I'm reading over 3000 files from zip, with 32kb byte[] allocated per file, it is already around 90MB. Next part comes from InflaterInputStream (4kb buffer), and quite a few from InflaterHuffmanTree, where temporary arrays are created each time. There is also an incredible amount of small arrays created in ZipEntry creation (for each read, small 2 or 4 byte array is created) - does not influence byte-count of garbage much, but it means that around 10 arrays are allocated per entry.

I have done improvements in two major ways.

First important one is changing way zip entries are read - it used to read header pieces word after word, I read entry at once - _very_ big improvement at first zip read. I have also removed most of temporary buffer allocations there.

Second is caching Inflater. Unfortunately, I have had to hide this caching from world very deep, so there is no chance of somebody messing with cached Inflater in any way. This basically means that only InputStream created by ZipFile will gain this benefit - with ZipInputStream, GZIPInputStream and generic InflaterInputStream we have almost same situation as before. Inflater in turn cache OutputWindow and HuffmanTrees, which cache their temporary arrays. In reset methods I'm doing quite extensive cleaning - somebody with more zip experience probably will be able to remove some of these precautions (I was not sure if at some place there is not dependency on not-touched array element to be zero, so I clear almost all of them, except obvious cases). Cache works ok with multiple threads - just if you have more than one of InputStreams open at same time, second and following ones will produce garbage (I think that in majority of cases single thread is doing most of reading from zip).

Could you send me a diff of your changes?

Sure, I'm sending them to you in separate mail. Please note that work is not finished - no documentation is done, some in fact is outdated. I would probably also want to move factory methods to separate class.

Bearing in mind that the java.util.zip implementation in Classpath is
also available as a seperate library, we should consider the needs of
people who might want to use this in an embedded system where space is
at a premium.

Three basic choices here.

1) In current state, there is zero problem with making it configurable. Everything boils down to two 'factory' free methods, which needs to be no-op for embedded case - so some kind of system or static variable will do the trick.

2) We can make Inflater cached at ZipFile instead of system level. When ZipFile is closed, Inflater would be freed - until then it would get reused for all streams from this ZipFile (as long as they are in single thread of course). Problem here is with the systems which would like to keep many ZipFile instances open, 'just in case' - they would pay multiple cost with not benefit. Variation of it is having reference counting for open ZipFiles - to have single static Inflater cache, but empty it when last ZipFile is closed (and recreate if new one is opened). But still you pay 36kb for having ZipFile open (in addition to possibly a lot for ZipEntries).

Artur




reply via email to

[Prev in Thread] Current Thread [Next in Thread]