Hello everyone,
First of all thanks for making this great app!
Recently I've been working on a project which involves generating a lot
of images, processing them externally and then getting some statistics
from the resulting images in Octave. The images are 4096x4096 16-bit
uncompressed TIFFs, so around 100MB each. I noticed that a big chunk of
time in the entire process is taken only by reading and writing the
files so I started to look how I could optimize that. Here are my findings.
Currently it's faster to convert the TIFFs to raw RGB images and
manually read that with fread (~1.7x faster for a 16bit image and 3x
faster for an 8bit image for me)
system(["gm convert " fname " " fname ".rgb"]);
fid = fopen([fname ".rgb"]);
im_fast = permute(reshape(fread(fid, [3 Inf], "*uint16", 'ieee-be')',
4096, 4096, 3), [2 1 3]);
fclose(fid);
delete([fname ".rgb"]);
Thinking that hack was a bit silly, decided to look at the source code
of Octave itself. So I did some profiling and noticed that a big part of
the read time is spent adjusting the variable range from the
GraphicsMagick internal representation to the actual bit depth. This is
currently done by first converting everything to double, dividing,
rounding and converting to unsigned int (I'm mostly investigating the
"TrueColorType" part of read_images). In almost all of the cases (except
for 32bit file depth where for some reason Octave wants to give a
normalized double result) these operations can be done by bit shifting.
If a simple bit shift is done, performance improves by 3.4x for 16bit
and 3x for 8bit. If it is done by having an if-branch for each shift
amount (0, 8 or 24), constant shift values inside the per-pixel
for-loops, it is even faster, around 5x for both 16 and 8bit. I wonder
how/if this code could be written in a nice and concise way, so as not
to have 3 copies of the same code. I also don't know how the 32 bit case
should be handled, but it really looks like a corner case to me. I guess
most users don't have GraphicsMagick compiled with quantum-depth 32 and
floating-point images are quite rare. Having the user do the conversion
from uint32 to float on his own might not be such a bad idea, but the
thought of breaking existing code sounds bad
In the case of imwrite, things are much nicer. Here I'm mostly
investigating the "TrueColorType" part of encode_uint_image.The big
time-waster is the construction and destruction of a Magick::Color
object for each and every pixel, which internally calls new and delete.
It turns out that the output values can be written directly to the
output vector, without an intermediary Color object. That alone improves
performance by more than 2.5x. It gets even faster if integer operations
are used (4x with multiplications), but there are two cases: the one
when the Octave variables are smaller than the quantum depth and the one
when they are bigger. In the first case we need to multiply with a
constant dependent on the width of template type and the quantum depth,
in the second we would need to shift with a value dependent on those
depths, just as was done for imread. Unfortunately I don't have a lot of
experience with templates so I don't know how this can be done without
duplicating code.
Just a final note, I stored and read the images to and from a tmpfs
filesystem, so the speedups might be a lot smaller for a HDD. In case
anyone wonders (I know I did) whether division and rounding can really
be done just by bitshifting here's some proof for one of the cases
v = uint8(0:2^8-1);
v_mag = uint16(v)*uint16((2^16-1)/(2^8-1));
v_rec = bitshift(v_mag, -8);
isequal(v_rec, v)
What do you think about all this? Is it worth the effort? If yes, I
could try cleaning my code up, extending it for all the cases (rgb,
grayscale, w/ alpha, w/o alpha etc.) and sending you a changeset? I'm
not exactly sure what the process is for getting involved.
All the best,
Mihai B