[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#24937: "deleting unused links" GC phase is too slow

From: Ludovic Courtès
Subject: bug#24937: "deleting unused links" GC phase is too slow
Date: Tue, 09 Nov 2021 15:44:49 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)


ludo@gnu.org (Ludovic Courtès) skribis:

> ‘LocalStore::removeUnusedLinks’ traverses all the entries in
> /gnu/store/.links and calls lstat(2) on each one of them and checks
> ‘st_nlink’ to determine whether they can be deleted.
> There are two problems: lstat(2) can be slow on spinning disks as found
> on hydra.gnu.org, and the algorithm is proportional in the number of
> entries in /gnu/store/.links, which is a lot on hydra.gnu.org.

Taking a step back, we could perhaps mitigate this with heuristics to
reduce the number of entries in .links:

  1. Do not deduplicate files with a size lower than some threshold;

  2. Delete links with st_nlink <= 3 (instead of <= 2); that would
     prevent *further* deduplication of those files, but they’d already
     have two instances sharing the same inode;

  3. Stop deduplicating once the number of entries in .links has reached
     a certain threshold.

For #1, a key insight is that about 30% of the files actually
deduplicated (in my store, where /gnu/store/.links has 2.2M entries) are
smaller than 1 KiB:

PNG image

… but 85% of them have at most 4 links (thus, saving up to 2 KiB by

PNG image

On my laptop, we’re talking about space savings of 325 MiB, a tiny
fraction of my store:

--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> (saved-space (filter (lambda (file)
                                            (< (deduplicated-file-size file) 
$40 = 325914739
--8<---------------cut here---------------end--------------->8---

Files smaller than 1 KiB represent 35% of the entries in .links:

PNG image

By not deduplicating files smaller than 1 KiB, we’d reduce the number of
entries by 35%, which should already have a tangible impact on
performance.  It’d be a “mitigation” more than a “fix”, but it has a
good work/reward ratio.

We could conduct a similar analysis for #2.

#3 is more difficult to implement because you cannot know the number of
entries in .links until you’ve traversed it (note that currently
deduplication stops when link(2) returns ENOSPC in .links).

I’m attaching the script I’ve used for that, derived from an earlier
experiment¹.  You’re welcome to give it a spin!



¹ https://lists.gnu.org/archive/html/guix-devel/2014-09/msg00422.html

(use-modules (charting)
             ((guix store) #:select (%store-prefix))
             (ice-9 ftw)
             (ice-9 match)
             (srfi srfi-1)
             (srfi srfi-9))

(define-record-type <deduplicated-file>
  (deduplicated-file name size links)
  (name    deduplicated-file-name)
  (size    deduplicated-file-size)
  (links   deduplicated-file-link-count))

(define %links-directory
  (string-append (%store-prefix) "/.links"))

(define (links)
  "Return a list of <deduplicated-file>."
  (file-system-fold (const #t)
                    (lambda (file stat result)      ;leaf
                      (cons (deduplicated-file file
                                               (stat:size stat)
                                               (stat:nlink stat))
                    (lambda (directory stat result) ;down
                    (lambda (directory stat result) ;up
                    (const #f)                      ;skip
                    (lambda (file stat errno result)
                      (error "i/o error" file errno))

(define KiB (expt 2 10))
(define MiB (* KiB KiB))
(define GiB (* KiB MiB))

(define (saved-space files)
  "Return the total amount of saved space given FILES, a list of
  (fold (lambda (df result)
          (match df
            (($ <deduplicated-file> name size links)
             (when (< links 2)
               (error "too few links" name links))
             (+ result (* size (- links 2))))))

(define (cumulative-distribution files property)
  "Return a list of (VALUE . COUNT) pairs representing the number of FILES
whose PROPERTY is VALUE or less."
  (define (file<? df1 df2)
    (< (property df1) (property df2)))

  (fold (lambda (df result)
          (match result
            (((value . count) . rest)
             (let ((current (property df)))
               (if (= value current)
                   (alist-cons value (+ count 1) rest)
                   (alist-cons current (+ count 1) result))))
             (alist-cons (property df) 1 result))))
        (sort files file<?)))

(define* (plot-distribution distribution output
                            #:key (subtitle "SUBTITLE") max-x
                            (group-name "GROUP") x-axis-label)
  "Plot DISTRIBUTION, and produce file OUTPUT."
  (define (log2 n)
    (let loop ((result 1))
      (if (zero? (ash n (- result)))
          (- result 1)
          (loop (+ 1 result)))))

  (define (format-log2-tick tick)
    ;; (string-append "2^"
    ;;                (number->string (log2 (inexact->exact tick))))
    (number->string (inexact->exact tick)))

  (define (adjust-items total)
    (lambda (x)
      (match x
        ;; XXX: Filter out the two cases that would give us a numerical
        ;; overflow.
        ((0 . _) #f)
        ((1 . _) #f)
        ((value . count)
         (and (or (not max-x) (< value max-x))
              (cons value (* 100. (/ count total))))))))

  (match distribution
    (((_ . total) . rest)
     (let ((percent (filter-map (adjust-items total) distribution)))
       (make-scatter-plot #:title (string-append "Cumulative distribution by "
                          #:data `((,group-name ,@percent))
                          #:x-axis-label x-axis-label
                          #:y-axis-label "%"
                          #:tick-label-formatter format-log2-tick
                          #:log-x-base 2
                          #:min-x 1
                          #:max-y 101
                          #:write-to-png output)))))

#! Examples
(define l (links))  ;this is the expensive part
(plot-distribution (cumulative-distribution l deduplicated-file-link-count)
                   "/tmp/nlink.png" #:x-axis-label "number of hard links"
                   #:subtitle "hard link count" #:max-x 2048
                   #:group-name "nlinks")

(plot-distribution (cumulative-distribution
                    (filter (lambda (file)
                              (< (deduplicated-file-size file) 1024))
                   "/tmp/nlink-small.png" #:x-axis-label "number of hard links"
                   #:subtitle "hard link count for files < 1KiB" #:max-x 2048
                   #:group-name "nlinks")

(plot-distribution (cumulative-distribution l deduplicated-file-size)
                   "/tmp/size.png" #:x-axis-label "file size"
                   #:subtitle "file size" #:max-x 32768
                   #:group-name "size (B)")

(plot-distribution (cumulative-distribution
                    (filter (lambda (f)
                              (> (deduplicated-file-link-count f) 2))
                   "/tmp/size-deduplicated.png" #:x-axis-label "file size" 
                   "size for files actually deduplicated" #:max-x 32768
                   #:group-name "size (B)")

(plot-distribution (cumulative-distribution
                    (filter (lambda (file)
                              (< (deduplicated-file-size file) 1024))
                    (lambda (file)
                      (* (deduplicated-file-size file)
                         (- (deduplicated-file-link-count file) 2))))
                   "/tmp/size-savings.png" #:x-axis-label "savings" #:subtitle
                   "savings for files < 1KiB" #:max-x 32768
                   #:group-name "savings (B)")

reply via email to

[Prev in Thread] Current Thread [Next in Thread]