octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Faster imread and imwrite


From: Daniel J Sebald
Subject: Re: Faster imread and imwrite
Date: Mon, 11 Dec 2017 14:28:54 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1

On 12/10/2017 12:35 PM, Mihai Babiac wrote:
Hello everyone,

First of all thanks for making this great app!

Recently I've been working on a project which involves generating a lot of images, processing them externally and then getting some statistics from the resulting images in Octave. The images are 4096x4096 16-bit uncompressed TIFFs, so around 100MB each. I noticed that a big chunk of time in the entire process is taken only by reading and writing the files so I started to look how I could optimize that. Here are my findings.

Currently it's faster to convert the TIFFs to raw RGB images and manually read that with fread (~1.7x faster for a 16bit image and 3x faster for an 8bit image for me)
system(["gm convert " fname " " fname ".rgb"]);
fid = fopen([fname ".rgb"]);
im_fast = permute(reshape(fread(fid, [3 Inf], "*uint16", 'ieee-be')', 4096, 4096, 3), [2 1 3]);
fclose(fid);
delete([fname ".rgb"]);

Thinking that hack was a bit silly, decided to look at the source code of Octave itself. So I did some profiling and noticed that a big part of the read time is spent adjusting the variable range from the GraphicsMagick internal representation to the actual bit depth. This is currently done by first converting everything to double, dividing, rounding and converting to unsigned int (I'm mostly investigating the "TrueColorType" part of read_images). In almost all of the cases (except for 32bit file depth where for some reason Octave wants to give a normalized double result) these operations can be done by bit shifting. If a simple bit shift is done, performance improves by 3.4x for 16bit and 3x for 8bit. If it is done by having an if-branch for each shift amount (0, 8 or 24), constant shift values inside the per-pixel for-loops, it is even faster, around 5x for both 16 and 8bit. I wonder how/if this code could be written in a nice and concise way, so as not to have 3 copies of the same code. I also don't know how the 32 bit case should be handled, but it really looks like a corner case to me. I guess most users don't have GraphicsMagick compiled with quantum-depth 32 and floating-point images are quite rare. Having the user do the conversion from uint32 to float on his own might not be such a bad idea, but the thought of breaking existing code sounds bad

In the case of imwrite, things are much nicer. Here I'm mostly investigating the "TrueColorType" part of encode_uint_image.The big time-waster is the construction and destruction of a Magick::Color object for each and every pixel, which internally calls new and delete. It turns out that the output values can be written directly to the output vector, without an intermediary Color object. That alone improves performance by more than 2.5x. It gets even faster if integer operations are used (4x with multiplications), but there are two cases: the one when the Octave variables are smaller than the quantum depth and the one when they are bigger. In the first case we need to multiply with a constant dependent on the width of template type and the quantum depth, in the second we would need to shift with a value dependent on those depths, just as was done for imread. Unfortunately I don't have a lot of experience with templates so I don't know how this can be done without duplicating code.

Just a final note, I stored and read the images to and from a tmpfs filesystem, so the speedups might be a lot smaller for a HDD. In case anyone wonders (I know I did) whether division and rounding can really be done just by bitshifting here's some proof for one of the cases
  v = uint8(0:2^8-1);
  v_mag = uint16(v)*uint16((2^16-1)/(2^8-1));
  v_rec = bitshift(v_mag, -8);
  isequal(v_rec, v)

What do you think about all this? Is it worth the effort? If yes, I could try cleaning my code up, extending it for all the cases (rgb, grayscale, w/ alpha, w/o alpha etc.) and sending you a changeset? I'm not exactly sure what the process is for getting involved.


All the best,
Mihai B

Hi Mihai,

I wasn't aware that a division was done as part of the main loop in a lot of these image load/save routines. I'm looking at the code right now and I'll give my thoughts.

First, efficiency is always welcome, but keep in mind that these patches often get lost in the ether because bug-fixes seem to take higher priority, so I'd say before diving in to be aware of that. So just try a few things and think it over for now. (Octave has a patch-tracker which can be used to keep track of things.)

Any routine that processes large data should have the most efficient central loop as possible. The code does seem to attempt to shape the row/column loops in that philosophy. But yes, division is the worst thing to do because if it is simulated division it's very slow. If it is co-processor division that's faster but I would think there is still some overhead with that because of loading the co-processor and division might be more machine cycles than say simple multiplication or bitshift/load.

I used the word "load" because sometimes processors can accomplish a 16-bit shift simply by loading a high versus low register. That kind of thing.

Given that, I'd say yes there is room for optimization, and it would entail a bit of analysis at the front of the routine to see whether there are simpler means of accomplishing the division. For example, analyze the division and ask "Is this division base 2, i.e., bit shift?", "Does this division boil down to an 8-bit or 16-bit shift and if so can it be written as some type of integer load that you know will translate to just one or two instructions via an optimizing C compiler?" That sort of thing. So, the code might have a subset of routines or macros that are real efficient bit shifts and register loads, each called appropriate to the analyzed scale factor.

There is another benefit to what I described. A floating point division can introduce some inaccuracy if not done correctly, whereas bit shifts and register loads are sure to be accurate in terms of image data bits. That is, if we were to save image data with imwrite() we'd like to be assured that imread() will read back that data to bit-wise accuracy, i.e., the same exact data...not something for which a LSB was altered do to floating-point math. (There should be a set of tests for write/read accuracy.)

There is one very simple thing you could try here, in terms of efficiency, and evaluate against the other numbers you are telling us. Rather than have that division in the main loop, change it to something like (pseudocode):

loop_multiplier = 1/divisor;
for (column loop)
  for (row loop)
    img_fvec[idx++] = pix->red * loop_multiplier;
  end
end

Multiplies are often faster than divisions. What types of speeds might you see for that kind of construct? If that improves efficiency in and of itself, you could make a very simple patch first and then leave greater improvement for later.

Dan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]