[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug#39258] [PATCH v2 0/3] Xapian for Guix package search

From: Arun Isaac
Subject: [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
Date: Sat, 7 Mar 2020 19:01:13 +0530


Here is the second iteration of my Xapian Guix package search patchset. I have
found the reason the earlier patchset did not show significant speedup. It
turns out that most of the time is spent in printing and texinfo rendering of
the search results. So, in this patchset, I pre-render the search results
while building the Xapian index and stuff them into the Xapian database
itself. Therefore, during `guix search`, I just pull out the pre-rendered
search results and print it on the screen. This is much faster. See comparison

--8<---------------cut here---------------start------------->8---
With a warm cache,
$ time guix search inkscape

real    0m1.787s
user    0m1.745s
sys     0m0.111s
--8<---------------cut here---------------end--------------->8---

--8<---------------cut here---------------start------------->8---
$ time /tmp/test/bin/guix search inkscape

real    0m0.199s
user    0m0.182s
sys     0m0.024s
--8<---------------cut here---------------end--------------->8---

If most of the speedup comes from pre-rendering the results, it might seem
that the Xapian search is not so useful. We might as well have stuffed the
pre-rendered search results into the existing package cache generated by
generate-package-cache, or so it might seem. But, there are the following
arguments in favor of Xapian.

- The package cache would grow in size, and lookup would be slowed down
  because we need to load the entire cache into memory. Xapian, on the other
  hand, need only look up the specific packages that match the search query.
- Xapian can provide superior search results due to it stemming and language
- Xapian can provide spelling correction and query expansion -- that is,
  suggest search terms to improve search results. Note that I haven't
  implemented this yet and is out of scope in this patchset.

* Simplify our package search results

Why not use a simpler package search results format like Arch Linux or Debian
does? We could just display the package name, version and synopsis like so.

inkscape 0.92.4
    Vector graphics editor
inklingreader 0.8
    Wacom Inkling sketch format conversion and manipulation

Why do we need the entire recutils format? If the user is interested, they can
always use `guix package --show` to get the full recutils formatted
info. Having shorter search results will make everything even faster and much
more readable. WDYT?

* How to test this patchset

To get guile-xapian, run a `guix pull`, if you haven't already. Then in your
Guix source directory, drop into an environment with guix dependencies and

$ guix environment guix --ad-hoc guile-xapian

Apply patches and build.

$ git am v2-0000-cover-letter.patch 
$ make

Run a test guix pull.

$ ./pre-inst-env guix pull --url=$PWD --branch=xapian -p /tmp/test

where xapian is the name of the branch you committed the patches to.

Then, run the guix search in /tmp/test.

$ /tmp/test/bin/guix search game

* Comments

Pierre Neidhardt <address@hidden> writes:

>> +(define (search-package-index profile querystring)
> Maybe `query-string'?

Done in this patchset.

>> +  (define (regexp? str)
>> +    (string-any
>> +     (char-set #\. #\[ #\{ #\} #\( #\) #\\ #\* #\+ #\? #\| #\^ #\$)
>> +     str))
>> +
>> +  (if (and (current-profile)
>> +           (not (any regexp? patterns)))
> I would not put characters like ".", "$", or "+" here, lest we mistake a
> Xapian pattern for a regexp.
> As you said, I don't think both are compatible without ambiguity
> anyways, so we should probably drop regexp (or at least toggle them with
> a command line argument).

I agree.

zimoun <address@hidden> writes:

> In the commit message, I would capitalize Xapian.

Done in this patchset.

>> +(define (generate-package-search-index directory)
>> +  "Generate under DIRECTORY a xapian index of all the available packages."
> Xapian with capital.

Done in this patchset.

> Is (make-stem "en") for the locale?

I still have English hard-coded. I haven't yet figured out how to detect the
locale and stem accordingly. But, there is a larger problem. Since we cannot
anticipate what locale the user will run guix search with, should we build the
Xapian index for all locales? That is, should we index not only the English
versions of the packages but also all other translations as well?

> package-search-index and package-cache-file could be refactored
> because they share all the same code.

Yes, they could be. However, I'll postpone to the next iteration of the

> I do not know what is the convention for the bindings.
> But there is 'fold-packages' so I would be inclined to 'fold-msets' or
> something in this flavour.

Well, everywhere else in guile we have such things as vhash-fold, string-fold,
hash-fold, stream-fold, etc. That's why I went with mset-fold. Also, we are
folding over a single mset (match-set). So, mset should be in the singular.

> And more importantly, 'make as-derivations' to avoid a "guix pull" breakage,
> Ah do not forget to adapt some tests.

Will do this once we have consensus about the other features of this patchset.

>  b. The xapian relevance should truncated

Done in this patchset.

> Xapian does not return the package 'emacs' itself as the first. And worse,
> it is not returned at all.

In this patchset, since we're indexing the package name as well, emacs is
returned but it is still far from the beginning.

> I propose the value of 4294967295 for pagesize.

In this patchset, I pass (database-document-count db) as the #:maximum-items
keyword argument to enquire-mset. This is the upstream recommended way to get
all search results. I hadn't done this earlier since I hadn't yet wrapped
database-document-count in guile-xapian.

>> In this patchset, I have only indexed the package descriptions. In the next
>> version of this patchset, I will index all other terms as specified in
>> %package-metrics of guix/ui.scm.
> Yes, it appears to me a detail that should be easy to fix. I mean, it
> does not seems blocking.

Done in this patchset.

Ludovic Courtès <address@hidden> writes:

> Note that ‘guix search’ time is largely dominated by I/O.

Yes, `guix search` is I/O intensive. That is why I expect Xapian to do better
since it only needs to access matching packages not all packages. Also, the
Xapian index is fast at all times. It is not very dependent on a warm
filesystem cache.

> On my laptop,
> I get (first measurement is cold cache, second one is warm cache):
> --8<---------------cut here---------------start------------->8---
> $ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
> $ time guix search foo >/dev/null
> real    0m2.631s
> user    0m1.134s
> sys     0m0.124s
> $ time guix search foo >/dev/null
> real    0m0.836s
> user    0m1.027s
> sys     0m0.053s
> --8<---------------cut here---------------end--------------->8---
> It’s hard to do better on the warm cache case because at this level,
> there may be other things to optimize having little to do with searching
> itself.
> Note that this is on an SSD; the cold-cache case must be worse on NFS or
> on a spinning disk, and there we could gain a lot.

My laptop is quite old with a particularly slow HDD. Hence my motivation to
improve guix search performance!

> I think we should weigh the pros and cons on all these aspects: speed,
> complexity and maintenance cost, search result quality, search features,
> etc.

I agree.

> PS: I have not yet looked at the whole series as I’m just coming back to
>     the keyboard.  :-)

Welcome back! :-)

Arun Isaac (3):
  build-self: Add guile-xapian to Guix dependencies.
  gnu: Generate Xapian package search index.
  gnu: Use Xapian index for package search.

 build-aux/build-self.scm | 11 +++++++
 gnu/packages.scm         | 62 +++++++++++++++++++++++++++++++++++++++-
 guix/channels.scm        | 34 +++++++++++++++++++++-
 guix/scripts/package.scm |  7 +++--
 guix/self.scm            |  7 ++++-
 guix/ui.scm              | 37 ++++++++++++++++++++++++
 6 files changed, 153 insertions(+), 5 deletions(-)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]