octave-patch-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-patch-tracker] [patch #7668] Enhancement, speedup of loading par


From: anonymous
Subject: [Octave-patch-tracker] [patch #7668] Enhancement, speedup of loading partial data from a hdf5 file
Date: Thu, 17 Nov 2011 09:00:24 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1

URL:
  <http://savannah.gnu.org/patch/?7668>

                 Summary: Enhancement, speedup of loading partial data from a
hdf5 file
                 Project: GNU Octave
            Submitted by: None
            Submitted on: Thu 17 Nov 2011 09:00:22 AM UTC
                Category: None
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
        Originator Email: address@hidden
             Open/Closed: Open
         Discussion Lock: Any

    _______________________________________________________

Details:

I'm working with "big" datasets in hdf5 format. Files being 20-40GB is not
uncommon. 

If possible, I load the entire file at once:

octave:1> tic(); all = load("filename.hdf5"); toc()
Elapsed time is 418.209 seconds.
octave:2>

But when the dataset is bigger than available ram, I want to do partial loads
to get out of core behavior:

octave:1> tic(); extr = load("filename.hdf5", "data000100"); toc()
Elapsed time is 301.926 seconds.
octave:2>

The same file is used in both examples. The file is ~20GB and has 2700 "data
elements" which will be returned as structs. The machine I'm testing on has
24GB ram. Due to other things running, some swapping occurs when reading the
entire file. The numbers should be seen as rough estimates.

My hope was that reading 1/2700th of the data should take roughly that
fraction of time for reading the entire thing. Unfortunately that is not the
case.

Why?

do_load will keep calling read_hdf5_data as long as it can read stuff. After
read_hdf5_data has returned, do_load will check if the data read matches the
variables that should be extracted before calling read_hdf5_data again..

This results in the entire hdf5 file being parsed in both examples above.

I suggest that IF just some variables should be read from a hdf5 file, the
name tests should be done within read_hdf5_data so only the corresponding
nodes in the file are parsed and that will save a lot of time. If the entire
file should be read, things will work just as before.

The patch I've attached has this functionality and if I repeat the test
"tic(); extr = load("filename.hdf5", "data000100"); toc()", it will take less
than 0.2 seconds.

I hope this patch is of interest, and if it needs changes to be considered,
let me know and I'll try to adapt the patch. 

/ Mattias Linde




    _______________________________________________________

File Attachments:


-------------------------------------------------------
Date: Thu 17 Nov 2011 09:00:22 AM UTC  Name: octave-hdf5patch.txt  Size: 3kB  
By: None

<http://savannah.gnu.org/patch/download.php?file_id=24391>

    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/patch/?7668>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]