monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] Not storing hashes in hex


From: Christof Petig
Subject: Re: [Monotone-devel] Not storing hashes in hex
Date: Tue, 16 Jan 2007 12:18:26 +0100
User-agent: Thunderbird 1.5.0.9 (X11/20070104)

Zack Weinberg schrieb:
> It occurred to me that we store a lot of SHA1 hashes in our databases
> and they're all twice as big as they need to be because they're in
> hex.

I added a project like this to the summit projects and really second the
move (I coded the base64->binary move in the past).

> I'm not sure whether this means we actually want to *do* this for
> real.  It will make manual database queries have more binary garbage
> in them; there are a lot of places in the code that will have to
> change; we'll have to jump through hoops in a few places to get the
> hashes to stay the same; we probably don't want to do this to the
> netsync protocol, so there will be more conversions to do.  Still,
> nearly 10% disk space savings is not to sneeze at, and I bet there
> would be speed gains too, just from not having to read so much off the
> disk.

Having to write x'abcdef' instead of 'abcdef' is not that that much
overhead IMHO. Having to write quote(id) hurts a bit, perhaps mtn exec
sql should default to output BLOBs quoted.

> There is another factor to consider.  There are 217,055 hashes in the
> "mtn.ids" file; however, there are only 91,223 *unique* hashes.  (This
> is because many of the hashes are used as pointers between tables.)
> The ratio is similar for OE.ids.  Thus, it might be worthwhile to yank
> all the hashes out into a separate table and reference them by row
> number from the rest of the database.  Depending on how sqlite decides
> to do things, this might be a *lot* better, as we could use INTEGER
> PRIMARY KEYs in a whole bunch of tables where we currently have string
> keys.  Technically this is orthogonal to the idea of storing the
> hashes as raw data, but it might be enough of a gain by itself that we
> don't want to bother with the de-hex-ificcation too (and, while the
> code changes for it would be substantial, I think they'd also be in
> fewer places).

A good thing to talk about on the summit. E.g. revision_certs could
easily refer to revion[_delta]s.
Storing delta and plain objects in one table (plain indicated by a NULL
base) might be a good idea to disambiguate the key and simplify queries.

   Christof




reply via email to

[Prev in Thread] Current Thread [Next in Thread]