Re: Faster imread and imwrite

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Faster imread and imwrite

From:	Daniel J Sebald
Subject:	Re: Faster imread and imwrite
Date:	Mon, 11 Dec 2017 14:28:54 -0600
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1

On 12/10/2017 12:35 PM, Mihai Babiac wrote:

Hello everyone,

First of all thanks for making this great app!
Recently I've been working on a project which involves generating a lotof images, processing them externally and then getting some statisticsfrom the resulting images in Octave. The images are 4096x4096 16-bituncompressed TIFFs, so around 100MB each. I noticed that a big chunk oftime in the entire process is taken only by reading and writing thefiles so I started to look how I could optimize that. Here are my findings.
Currently it's faster to convert the TIFFs to raw RGB images andmanually read that with fread (~1.7x faster for a 16bit image and 3xfaster for an 8bit image for me)
system(["gm convert " fname " " fname ".rgb"]);
fid = fopen([fname ".rgb"]);
im_fast = permute(reshape(fread(fid, [3 Inf], "*uint16", 'ieee-be')',4096, 4096, 3), [2 1 3]);
fclose(fid);
delete([fname ".rgb"]);
Thinking that hack was a bit silly, decided to look at the source codeof Octave itself. So I did some profiling and noticed that a big part ofthe read time is spent adjusting the variable range from theGraphicsMagick internal representation to the actual bit depth. This iscurrently done by first converting everything to double, dividing,rounding and converting to unsigned int (I'm mostly investigating the"TrueColorType" part of read_images). In almost all of the cases (exceptfor 32bit file depth where for some reason Octave wants to give anormalized double result) these operations can be done by bit shifting.If a simple bit shift is done, performance improves by 3.4x for 16bitand 3x for 8bit. If it is done by having an if-branch for each shiftamount (0, 8 or 24), constant shift values inside the per-pixelfor-loops, it is even faster, around 5x for both 16 and 8bit. I wonderhow/if this code could be written in a nice and concise way, so as notto have 3 copies of the same code. I also don't know how the 32 bit caseshould be handled, but it really looks like a corner case to me. I guessmost users don't have GraphicsMagick compiled with quantum-depth 32 andfloating-point images are quite rare. Having the user do the conversionfrom uint32 to float on his own might not be such a bad idea, but thethought of breaking existing code sounds bad
In the case of imwrite, things are much nicer. Here I'm mostlyinvestigating the "TrueColorType" part of encode_uint_image.The bigtime-waster is the construction and destruction of a Magick::Colorobject for each and every pixel, which internally calls new and delete.It turns out that the output values can be written directly to theoutput vector, without an intermediary Color object. That alone improvesperformance by more than 2.5x. It gets even faster if integer operationsare used (4x with multiplications), but there are two cases: the onewhen the Octave variables are smaller than the quantum depth and the onewhen they are bigger. In the first case we need to multiply with aconstant dependent on the width of template type and the quantum depth,in the second we would need to shift with a value dependent on thosedepths, just as was done for imread. Unfortunately I don't have a lot ofexperience with templates so I don't know how this can be done withoutduplicating code.
Just a final note, I stored and read the images to and from a tmpfsfilesystem, so the speedups might be a lot smaller for a HDD. In caseanyone wonders (I know I did) whether division and rounding can reallybe done just by bitshifting here's some proof for one of the cases
  v = uint8(0:2^8-1);
  v_mag = uint16(v)*uint16((2^16-1)/(2^8-1));
  v_rec = bitshift(v_mag, -8);
  isequal(v_rec, v)
What do you think about all this? Is it worth the effort? If yes, Icould try cleaning my code up, extending it for all the cases (rgb,grayscale, w/ alpha, w/o alpha etc.) and sending you a changeset? I'mnot exactly sure what the process is for getting involved.
All the best,
Mihai B


Hi Mihai,

I wasn't aware that a division was done as part of the main loop in alot of these image load/save routines. I'm looking at the code rightnow and I'll give my thoughts.

First, efficiency is always welcome, but keep in mind that these patchesoften get lost in the ether because bug-fixes seem to take higherpriority, so I'd say before diving in to be aware of that. So just trya few things and think it over for now. (Octave has a patch-trackerwhich can be used to keep track of things.)

Any routine that processes large data should have the most efficientcentral loop as possible. The code does seem to attempt to shape therow/column loops in that philosophy. But yes, division is the worstthing to do because if it is simulated division it's very slow. If itis co-processor division that's faster but I would think there is stillsome overhead with that because of loading the co-processor and divisionmight be more machine cycles than say simple multiplication orbitshift/load.

I used the word "load" because sometimes processors can accomplish a16-bit shift simply by loading a high versus low register. That kind ofthing.

Given that, I'd say yes there is room for optimization, and it wouldentail a bit of analysis at the front of the routine to see whetherthere are simpler means of accomplishing the division. For example,analyze the division and ask "Is this division base 2, i.e., bitshift?", "Does this division boil down to an 8-bit or 16-bit shift andif so can it be written as some type of integer load that you know willtranslate to just one or two instructions via an optimizing C compiler?"That sort of thing. So, the code might have a subset of routines ormacros that are real efficient bit shifts and register loads, eachcalled appropriate to the analyzed scale factor.

There is another benefit to what I described. A floating point divisioncan introduce some inaccuracy if not done correctly, whereas bit shiftsand register loads are sure to be accurate in terms of image data bits.That is, if we were to save image data with imwrite() we'd like to beassured that imread() will read back that data to bit-wise accuracy,i.e., the same exact data...not something for which a LSB was altered doto floating-point math. (There should be a set of tests for write/readaccuracy.)

There is one very simple thing you could try here, in terms ofefficiency, and evaluate against the other numbers you are telling us.Rather than have that division in the main loop, change it to somethinglike (pseudocode):


loop_multiplier = 1/divisor;
for (column loop)
  for (row loop)
    img_fvec[idx++] = pix->red * loop_multiplier;
  end
end

Multiplies are often faster than divisions. What types of speeds mightyou see for that kind of construct? If that improves efficiency in andof itself, you could make a very simple patch first and then leavegreater improvement for later.

Dan

[Prev in Thread]

Current Thread

[Next in Thread]

Faster imread and imwrite, Mihai Babiac, 2017/12/10
- Re: Faster imread and imwrite, Daniel J Sebald <=

Prev by Date: Re: Regarding contribution and GSOC-2018
Next by Date: Octave & Scilab Silicon Valley Meetup
Previous by thread: Faster imread and imwrite
Next by thread: Regarding contribution and GSOC-2018
Index(es):
- Date
- Thread