[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [rdiff-backup-users] Proposal: Storing excess file information

From: Ben Escoto
Subject: Re: [rdiff-backup-users] Proposal: Storing excess file information
Date: Mon, 02 Dec 2002 12:58:55 -0800

Ok, I just ran a few tests, pretending that I was trying to write and
read the user and group ownership information of 500000 files in
order.  I compared Bud Bruegger's suggestion of shelve and a normal
text file.

First the sizes of the files:
-rw-r--r--    1 ben      ben      82911232 Dec  2 12:21 shelf.db
-rw-r--r--    1 ben      ben       1252698 Dec  2 12:17 text.gz

and now the times (ran each one twice):

Writing the text file:

real    0m23.902s
user    0m23.650s
sys     0m0.150s

real    0m24.811s
user    0m23.820s
sys     0m0.180s

Reading from that text file:

real    0m47.725s
user    0m47.310s
sys     0m0.080s

real    0m47.443s
user    0m47.360s
sys     0m0.080s

Writing using python's shelve module:

real    1m51.196s
user    0m34.540s
sys     0m24.880s

real    2m49.347s
user    0m29.180s
sys     0m23.810s

Reading from that shelf file:

real    1m37.679s
user    1m25.480s
sys     0m7.370s

real    1m32.652s
user    1m24.510s
sys     0m7.710s

So it seems the database format took longer to read and write.  At
least in the writing case this was apparently because it took up much
more space on disk and disk writes are relatively slow.

    The shelve format would have done better at random access.  This
doesn't happen much, but could happen at the start of a selective
restore.  However 'zcat text.gz | grep File filename' took about half
a second even for filenames near the bottom, so in theory there
shouldn't be a big speed issue.

    Dave Steinberg mentioned cdb.  This is probably faster than
shelve (assuming the python interface is reasonably fast) but I doubt
it will be must faster than the text version.  Also it doesn't seem to
be released under a GPL compatible license, which is a showstopper for

    The script used is below.  The text reading code could use much
improvement as it is surely faster to read in large blocks instead of
line by line (as was shown by the scan_text function).

So anyway I think I will just use a flat text format, with all the
metadata in one file.  It seems fast enough, takes up very little
disk space, and the format should be easy to process on any platform,
or read by hand.

Ben Escoto

import shelve, re, gzip, sys

user = "larry"
group = "losers"
count = 500000

def write_shelf():
        """Write shelve DB"""
        d = shelve.open("shelf.db")
        for i in xrange(count):
                d["foo/" + str(i)] = {"user": user, "group": group}
                if i % 100000 == 0: print i

def read_shelf():
        """Read every file from shelf"""
        d = shelve.open("shelf.db")
        for i in xrange(count):
                assert d["foo/" + str(i)]['user'] == user
                if i % 100000 == 0: print i
def write_text():
        """Write gzipped text file"""
        fout = gzip.GzipFile("text.gz", "wb")
        for i in xrange(count):
                filename = "foo/" + str(i)
                fout.write("File %s\n" % filename)
                fout.write("    User %s\n" % user)
                fout.write("    Group %s\n" % group)
                if i % 100000 == 0: print i
        assert not fout.close()

def read_text():
        """Read gzipped text file"""
        fin = gzip.GzipFile("text.gz", "rb")
        while 1:
                line = fin.readline()
                if not line: break
                if line.startswith("File "): filename = line[5:-1]
                elif line.startswith("    User "):
                        line_user = line[9:-1]
                        assert line_user == user
        assert not fin.close()

def scan_text():
        """Scan for a given text file"""
        fin = gzip.GzipFile("text.gz", "rb")
        while 1:
                buf = fin.read(64 * 1024)
                assert buf
                if re.search("File " + "foo/482323", buf): break
        assert not fin.close()


Attachment: pgp2YjsKGERGr.pgp
Description: PGP signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]