Re: [Monotone-devel] speed of 0.99

monotone-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] speed of 0.99

From:	Markus Wanner
Subject:	Re: [Monotone-devel] speed of 0.99
Date:	Mon, 29 Nov 2010 20:05:01 +0100
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.15) Gecko/20101030 Icedove/3.0.10

Timothy,

thanks for measuring.

On 11/29/2010 02:40 PM, Timothy Brownawell wrote:
> When doing an indexed lookup of a row, the usual procedure is to do a binary
> search on the index to find the index entry, then extract the rowid from the
> index and use that rowid to do a binary search on the original table. Thus a
> typical indexed lookup involves two binary searches. If, however, all columns
> that were to be fetched from the table are already available in the index
> itself, SQLite will use the values contained in the index and will never look
> up the original table row. This saves one binary search for each row and can
> make many queries run twice as fast. 

Okay, so SQLite can do index-only lookups - unlike Postgres - cool! Then
it certainly makes more sense to add attributes that aren't part of the
scan key, yes.

> Each revision typically has 4 certs (author, branch, date, changelog), so the
> old index on only rev_id should usually give 5 lookups. Indexing on (rev_id,
> name, value) should usually give 2 lookups, and indexing on everything gives
> 1 lookup.

..at the cost of increasing the size of the index. Which isn't just disk
space, but also costs memory space, bandwidth and affects CPU caching.
These certainly need to be taken into account as well.

> ... strace -c shows 534 calls to read() and 150 to lseek() with everything in
> the index, vs 2050 calls to read() and 1666 to lseek() with only (rev_id, 
> name,
> value). That doesn't make sense, it shouldn't be more than about twice as 
> much.

Maybe because there are more tuples per block, so you need to read fewer
blocks overall.

On a ~ 380 MiB test database, I measured about 6.1 MiB for the index on
(revision_id, name, value, keypair_id, signature), while the one on just
(revision_id, name, value) weighted 2.2 MiB. So a single block covers
almost 3x as many tuples in the later case.

> sqlite> .schema revision_certs
> CREATE TABLE revision_certs
>       (
>       hash not null unique,   -- hash of remaining fields separated by ":"
>       revision_id not null,   -- joins with revisions.id
>       name not null,          -- opaque string chosen by user
>       value not null,         -- opaque blob
>       keypair_id not null,    -- joins with public_keys.id
>       signature not null,     -- RSA/SHA1 signature of "address@hidden:val]"
>       unique(name, value, revision_id, keypair_id, signature)
>       );
> CREATE INDEX revision_certs__revnameval ON revision_certs (revision_id,
>        name, value, keypair_id, signature);
> sqlite> explain query plan select revision_id, name, value, keypair_id, 
> signature from revision_certs where revision_id = 'abc' and name = 'branch';
> 0|0|TABLE revision_certs WITH INDEX revision_certs__revnameval
>
> sqlite> explain query plan select revision_id, name, value, keypair_id, 
> signature from revision_certs where revision_id = 'abc' and name = 'branch';
> 0|0|TABLE revision_certs WITH INDEX sqlite_autoindex_revision_certs_2

I cannot reproduce that behavior. In both cases, I see the manually
created index being used - except if I remove all indexes and only keep
the uniqueness constraint. Then it uses the underlying index for that.
That's with sqlite 3.7.3, always using ANALYZE before EXPLAIN.

Looking at the schema it might make sense to simply rearrange the
uniqueness constraint, because the name attribute isn't very selective
(i.e. most tuples have one of few often used values). Instead, we should
put the revision_id at the first place, as that's way more selective.

I don't even think we need an additional index on (name, ...) anymore.
So let's drop that additional index for space and speed!

> Using the INDEXED BY with an index on (rev_id, name, value) takes 2.5s for 
> 'mtn st',
> vs... now I'm seeing 2.0s instead of the 1.1s I got yesterday.

I'd agree with the warning, that 'hinting' the database is messy and not
something we should need (nor want) to deal with.

> ...time for 'mtn st' varies drastically depending on exactly which revision 
> the
> workspace is at (even with inodeprints on and only a small change in the 
> number of
> files), but INDEXED BY with the 3-field index seems to be consistently .3s - 
> .5s
> slower than with everything in the index and the workspace on the same 
> revision.

I bet that's because of the low selectivity of the name attribute using
the uniqueness index.

Regards

Markus Wanner

[Prev in Thread]

Current Thread

[Next in Thread]

[Monotone-devel] speed of 0.99, Markus Wanner, 2010/11/19
- Re: [Monotone-devel] speed of 0.99, Aaron W. Hsu, 2010/11/27
- Re: [Monotone-devel] speed of 0.99, Timothy Brownawell, 2010/11/27
  - Re: [Monotone-devel] speed of 0.99, Timothy Brownawell, 2010/11/28
    - Re: [Monotone-devel] speed of 0.99, Thomas Keller, 2010/11/28
    - Re: [Monotone-devel] speed of 0.99, Richard Levitte, 2010/11/29
    - Re: [Monotone-devel] speed of 0.99, Markus Wanner, 2010/11/29
    - Re: [Monotone-devel] speed of 0.99, Timothy Brownawell, 2010/11/29
    - Re: [Monotone-devel] speed of 0.99, Markus Wanner <=

Prev by Date: Re: [Monotone-devel] speed of 0.99
Next by Date: [Monotone-devel] Issue 115 - 'mtn pull' over network fails to bring over certs or revision (monotone)
Previous by thread: Re: [Monotone-devel] speed of 0.99
Next by thread: [Monotone-devel] Updated Issue 97 - test diff_on_missing_trailing_newline_at_end_of_file fails on Cygwin (monotone)
Index(es):
- Date
- Thread