guix-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug#39258] [PATCH v2 0/3] Xapian for Guix package search


From: zimoun
Subject: [bug#39258] [PATCH v2 0/3] Xapian for Guix package search
Date: Mon, 9 Mar 2020 13:28:08 +0100

Hi,

On Sat, 7 Mar 2020 at 14:31, Arun Isaac <address@hidden> wrote:

> --8<---------------cut here---------------start------------->8---
> With a warm cache,
> $ time guix search inkscape
>
> real    0m1.787s
> user    0m1.745s
> sys     0m0.111s
> --8<---------------cut here---------------end--------------->8---
>
> --8<---------------cut here---------------start------------->8---
> $ time /tmp/test/bin/guix search inkscape
>
> real    0m0.199s
> user    0m0.182s
> sys     0m0.024s
> --8<---------------cut here---------------end--------------->8---

IMHO, it is interesting to compare the list of results and the order
of the both query; as i did with Emacs.
Speed is one thing, the initial motivation. But accuracy is maybe more
important.


> - The package cache would grow in size, and lookup would be slowed down
>   because we need to load the entire cache into memory. Xapian, on the other
>   hand, need only look up the specific packages that match the search query.

I agree that 'fold-packages' could become soon a bottleneck.

IMHO, 'mset-fold' should be a drop-in replacement of 'fold-package' in
the search function.


> - Xapian can provide superior search results due to it stemming and language
>   models.
> - Xapian can provide spelling correction and query expansion -- that is,
>   suggest search terms to improve search results. Note that I haven't
>   implemented this yet and is out of scope in this patchset.

I agree too that Xapian should improve the user experience when searching.


> * Simplify our package search results
>
> Why not use a simpler package search results format like Arch Linux or Debian
> does? We could just display the package name, version and synopsis like so.
>
> inkscape 0.92.4
>     Vector graphics editor
> inklingreader 0.8
>     Wacom Inkling sketch format conversion and manipulation2
>
> Why do we need the entire recutils format? If the user is interested, they can
> always use `guix package --show` to get the full recutils formatted
> info. Having shorter search results will make everything even faster and much
> more readable. WDYT?

I disagree.

What I proposed some time ago was to have different flavour of the
ouput of search; as e.g., 'git log --pretty=oneline' etc..
For example by default, it should be what you suggest. Then "guix
search --format=full" should output the current. And we could imagine
mimick the Git log strategy: "guix search --format="%name
%version\n%license" etc.

WDYT?



> > Is (make-stem "en") for the locale?
>
> I still have English hard-coded. I haven't yet figured out how to detect the
> locale and stem accordingly. But, there is a larger problem. Since we cannot
> anticipate what locale the user will run guix search with, should we build the
> Xapian index for all locales? That is, should we index not only the English
> versions of the packages but also all other translations as well?

I understand. Let consider that for the next round.


> > package-search-index and package-cache-file could be refactored
> > because they share all the same code.
>
> Yes, they could be. However, I'll postpone to the next iteration of the
> patchset.

Ok.


> > I do not know what is the convention for the bindings.
> > But there is 'fold-packages' so I would be inclined to 'fold-msets' or
> > something in this flavour.
>
> Well, everywhere else in guile we have such things as vhash-fold, string-fold,
> hash-fold, stream-fold, etc. That's why I went with mset-fold. Also, we are
> folding over a single mset (match-set). So, mset should be in the singular.

I understand.


> > And more importantly, 'make as-derivations' to avoid a "guix pull" breakage,
> > Ah do not forget to adapt some tests.
>
> Will do this once we have consensus about the other features of this patchset.

And we should test that on different machines and states.



> > Xapian does not return the package 'emacs' itself as the first. And worse,
> > it is not returned at all.
>
> In this patchset, since we're indexing the package name as well, emacs is
> returned but it is still far from the beginning.

This is an issue.

IMHO, it is because of the BM25 score. It is too rough and some weight
should be applied. But that another story.
The fix is:
 a- provide a scoring function to Xapian as the doc explains
 b- adapt 'fold-package' to 'mset-fold' in
'find-packages-by-description' and implement our version of BM25 then
use it in 'relevance'


> > I propose the value of 4294967295 for pagesize.
>
> In this patchset, I pass (database-document-count db) as the #:maximum-items
> keyword argument to enquire-mset. This is the upstream recommended way to get
> all search results. I hadn't done this earlier since I hadn't yet wrapped
> database-document-count in guile-xapian.

Cool!



> My laptop is quite old with a particularly slow HDD. Hence my motivation to
> improve guix search performance!

I agree.
But performance is not all. Accuracy counts more! :-)


> > I think we should weigh the pros and cons on all these aspects: speed,
> > complexity and maintenance cost, search result quality, search features,
> > etc.
>
> I agree.

I agree too.
We should write a benchmark. For example, using Emacs as query or more
complex we could think of.


All the best,
simon





reply via email to

[Prev in Thread] Current Thread [Next in Thread]