guix-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] git-download: Speed up 'git-predicate'.


From: Christopher Baines
Subject: Re: [PATCH] git-download: Speed up 'git-predicate'.
Date: Mon, 19 Jun 2017 08:24:24 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0

On 07/06/17 13:52, Ludovic Courtès wrote:
> Christopher Baines <address@hidden> skribis:
> 
>> Adjust 'git-predicate' to use data structures that perform better when used
>> with git repositories with a large number of files.
>>
>> Previously when matching either a regular file or directory, 'git-predicate'
>> would search a list with a length equal to the number of files in the
>> repository. As a search operation happens for roughly every file in the
>> repository, this meant that the time taken to use 'git-predicate' to traverse
>> all the files in a repository was roughly exponential with respect to the
>> number of files in the repository.
>>
>> Now, for matching regular files or symlinks, 'git-predicate' uses a vhash
>> using the inode value as the key. This should perform roughly in constant
>> amount of time, instead of linear with respect to the number of files in the
>> repository.
>>
>> For matching directories, 'git-predicate' now uses a tree structure stored in
>> association lists. To check if a directory is in the tree, the tree is
>> traversed from the root. The time complexity of this depends on the shape of
>> the tree, but it should be an improvement on searching through the list of 
>> all
>> files.
> 
> Great, more than welcome it seems.  :-)
> 
> Do you know how much the inode optimization vs. the tree structure
> contributes to the performance.

I've got some more data. I ran the test script for smart-answers, and
let it complete this time:

  real    97m21.291s
  user    120m50.400s
  sys     0m31.020s

Just applying the inode optimization gives this result:

  real    90m28.116s
  user    100m44.784s
  sys     0m18.524s

I'm going to assume that using the tree structure for directories makes
up the rest of the difference. This will vary between repositories
though, I think smart answers has a unusually large directory to file ratio.

>> * guix/git-download.scm (git-predicate): Use different data structures to
>>   speed up 'git-predicate' with a large number of files.
> 
> [...]
> 
>> +  (define (create-directory-tree files)
>> +    (define (directory-lists->tree directory-lists)
>> +      (map (lambda (top-level-dir)
>> +             (cons top-level-dir
>> +                   (directory-lists->tree
>> +                    (filter-map
>> +                     (lambda (directory-list)
>> +                       (if (eq? (length directory-list) 1)
>> +                           #f
>> +                           (cdr directory-list)))
>> +                     ;; Find all the directory lists under this 
>> top-level-dir
>> +                     (filter
>> +                      (lambda (directory-list)
>> +                        (equal? (car directory-list)
>> +                                top-level-dir))
>> +                      directory-lists)))))
>> +           (delete-duplicates
>> +            (map car directory-lists))))
>> +
>> +    (directory-lists->tree
>> +     (filter-map (lambda (path)
>> +                   (let ((split-path (string-split path #\/)))
>> +                     ;; If this is a file in the top of the repository?
>> +                     (if (eq? (length split-path) 1)
>> +                         #f
>> +                         ;; drop-right to remove the filename, as it's
>> +                         ;; just the directory tree that's important
>> +                         (drop-right (string-split path #\/) 1))))
>> +                 files)))
>> +
>> +  (define (directory-in-tree? directory tree)
>> +    (define (directory-list-in-tree? directory-list tree)
>> +      (if (eq? (length directory-list) 1)
>> +          (list? (member (car directory-list)
>> +                         (map car tree)))
>> +          (and=> (find (match-lambda
>> +                         ((top-level-dir . subtree)
>> +                          (equal? top-level-dir
>> +                                  (car directory-list))))
>> +                       tree)
>> +                 (match-lambda
>> +                   ((top-level-dir . subtree)
>> +                    (directory-list-in-tree? (cdr directory-list)
>> +                                             subtree))))))
>> +
>> +    (directory-list-in-tree? (string-split directory #\/)
>> +                             tree))
> 
> Note that ‘length’ and ‘list?’ are O(n).  I think ‘directory-in-tree?’
> should be written using ‘match’, which would avoid that altogether.

I've sent an updated patch now, and I think I've made some progress
towards this.

> Likewise, the (map car …) call for ‘match’.  :-)

I'm stuck on this bit though, in the latest patch it reads:

  (list? (member top-directory (map car tree))

The list? call is just to turn the list or #f returned by member to #t
or #f. The (map car tree) converts the tree to a list of top level
directories. This bit of code is used when the last directory in the
input list has been reached (e.g. when checking for foo/bar/baz
top-directory would be baz) so the last check to perform is to check if
baz exists at the current level of the tree. Any suggestions on
restructuring this?

> I find the tree implementation hard to grasp.  Perhaps it would help to
> move it outside of the ‘git-predicate’ function and perhaps decompose it
> a bit more?  Thoughts?

I've moved it directly above git-predicate for now.

>> +         (inodes-vhash   (alist->vhash
>> +                          (map
>> +                           (lambda (file)
>> +                             (let ((stat
>> +                                    (lstat (string-append directory "/" 
>> file))))
>> +                               (cons (stat:ino stat) (stat:dev stat))))
>> +                           files)))
> 
> I would call it ‘inodes’ simply.  Also, we could use ‘list->set’ from
> (guix sets) here.

I've made the inodes-vhash -> inodes rename, but I was undecided about
using (guix sets), is there a reason you recommended it?

Thnaks for your review,

Chris




Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]