qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v5 08/11] qcow2: Rebuild refcount structure duri


From: Max Reitz
Subject: Re: [Qemu-devel] [PATCH v5 08/11] qcow2: Rebuild refcount structure during check
Date: Sat, 11 Oct 2014 12:17:20 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.0

Am 09.10.2014 um 01:09 schrieb Eric Blake:
On 08/29/2014 03:41 PM, Max Reitz wrote:
The previous commit introduced the "rebuild" variable to qcow2's
implementation of the image consistency check. Now make use of this by
adding a function which creates a completely new refcount structure
based solely on the in-memory information gathered before.

The old refcount structure will be leaked, however.
Might be worth mentioning in the commit message that a later commit will
deal with the leak.

Signed-off-by: Max Reitz <address@hidden>
---
  block/qcow2-refcount.c | 286 ++++++++++++++++++++++++++++++++++++++++++++++++-
  1 file changed, 283 insertions(+), 3 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 6300cec..318c152 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1603,6 +1603,266 @@ static void compare_refcounts(BlockDriverState *bs, 
BdrvCheckResult *res,
  }
/*
+ * Allocates a cluster using an in-memory refcount table (IMRT) in contrast to
+ * the on-disk refcount structures.
+ *
+ * *first_free_cluster does not necessarily point to the first free cluster, 
but
+ * may point to one cluster as close as possible before it. The offset returned
+ * will never be before that cluster.
Took me a couple reads of the comment and code to understand that.  If
I'm correct, this alternative wording may be better:

On input, *first_free_cluster tells where to start looking, and need not
actually be a free cluster; the returned offset will not be before that
cluster.  On output, *first_free_cluster points to the actual first free
cluster found.

Or, depending on the semantics you intended [1]:

On input, *first_free_cluster tells where to start looking, and need not
actually be a free cluster; the returned offset will not be before that
cluster.  On output, *first_free_cluster points to the first gap found,
even if that gap was too small to be used as the returned offset.

Yes, *first_free_cluster has nothing to do with the allocated cluster range. The offset of that allocated range will be returned by the function. *first_free_cluster should merely always point somewhere before or at the first gap (of any size) just so alloc_clusters_imrt() does not have to start searching at the beginning of the IMRT next time it is called. However, this also makes it useful to limit the search of the cluster range to the end of the IMRT (or even after the end) by the caller setting it to some arbitrary value, so it has a dual-use.

+ *
+ * Note that *first_free_cluster is a cluster index whereas the return value is
+ * an offset.
+ */
+static int64_t alloc_clusters_imrt(BlockDriverState *bs,
+                                   int cluster_count,
+                                   uint16_t **refcount_table,
+                                   int64_t *nb_clusters,
+                                   int64_t *first_free_cluster)
+{
+    BDRVQcowState *s = bs->opaque;
+    int64_t cluster = *first_free_cluster, i;
+    bool first_gap = true;
+    int contiguous_free_clusters;
+
+    /* Starting at *first_free_cluster, find a range of at least cluster_count
+     * continuously free clusters */
+    for (contiguous_free_clusters = 0;
+         cluster < *nb_clusters && contiguous_free_clusters < cluster_count;
+         cluster++)
+    {
+        if (!(*refcount_table)[cluster]) {
+            contiguous_free_clusters++;
+            if (first_gap) {
+                /* If this is the first free cluster found, update
+                 * *first_free_cluster accordingly */
+                *first_free_cluster = cluster;
+                first_gap = false;
+            }
+        } else if (contiguous_free_clusters) {
+            contiguous_free_clusters = 0;
+        }
[1] Should you be resetting first_gap in the 'else'?  If you don't, then
*first_free_cluster is NOT the start of the cluster just allocated, but
the first free cluster encountered on the way to the eventual
allocation.  I guess it depends on how the callers are using the
information; since the function is static, I guess I'll find out later
in my review.

Yes, *first_free_cluster is not that allocated cluster. I don't keep the offset in a dedicated variable, because it'll always be cluster - contiguous_free_clusters (see the next comment).

+    }
+
+    /* If contiguous_free_clusters is greater than zero, it contains the number
+     * of continuously free clusters until the current cluster; the first free
+     * cluster in the current "gap" is therefore
+     * cluster - contiguous_free_clusters */
+
+    /* If no such range could be found, grow the in-memory refcount table
+     * accordingly to append free clusters at the end of the image */
+    if (contiguous_free_clusters < cluster_count) {
+        int64_t old_nb_clusters = *nb_clusters;
+
+        /* There already is a gap of contiguous_free_clusters; we need
s/gap/tail/, since we are at the end of the table?

Well, tail doesn't imply that it's empty. I could change it to "contiguous_free_clusters clusters are already empty at the image end".

+         * cluster_count clusters; therefore, we have to allocate
+         * cluster_count - contiguous_free_clusters new clusters at the end of
+         * the image (which is the current value of cluster; note that cluster
+         * may exceed old_nb_clusters if *first_free_cluster pointed beyond the
+         * image end) */
+        *nb_clusters = cluster + cluster_count - contiguous_free_clusters;
+        *refcount_table = g_try_realloc(*refcount_table,
+                                        *nb_clusters * sizeof(uint16_t));
+        if (!*refcount_table) {
+            return -ENOMEM;
+        }
+
+        memset(*refcount_table + old_nb_clusters, 0,
+               (*nb_clusters - old_nb_clusters) * sizeof(uint16_t));
Is this calculation unnecessarily hard-coded to refcount_order==4?

Seems like it. Shame on me. ;-)

+    }
+
+    /* Go back to the first free cluster */
+    cluster -= contiguous_free_clusters;
+    for (i = 0; i < cluster_count; i++) {
+        (*refcount_table)[cluster + i] = 1;
+    }
+
+    return cluster << s->cluster_bits;
+}
+
+/*
+ * Creates a new refcount structure based solely on the in-memory information
+ * given through *refcount_table. All necessary allocations will be reflected
+ * in that array.
+ *
+ * On success, the old refcount structure is leaked (it will be covered by the
+ * new refcount structure).
+ */
+static int rebuild_refcount_structure(BlockDriverState *bs,
+                                      BdrvCheckResult *res,
+                                      uint16_t **refcount_table,
+                                      int64_t *nb_clusters)
+{
+    BDRVQcowState *s = bs->opaque;
+    int64_t first_free_cluster = 0, rt_ofs = -1, cluster = 0;
+    int64_t rb_ofs, rb_start, rb_index;
+    uint32_t reftable_size = 0;
+    uint64_t *reftable = NULL;
+    uint16_t *on_disk_rb;
+    int i, ret = 0;
ret is 0...

+    struct {
+        uint64_t rt_offset;
+        uint32_t rt_clusters;
+    } QEMU_PACKED rt_offset_and_clusters;
+
+    qcow2_cache_empty(bs, s->refcount_block_cache);
+
+write_refblocks:
+    for (; cluster < *nb_clusters; cluster++) {
+        if (!(*refcount_table)[cluster]) {
+            continue;
+        }
+
+        rb_index = cluster >> s->refcount_block_bits;
+        rb_start = rb_index << s->refcount_block_bits;
+
+        /* Don't allocate a cluster in a refblock already written to disk */
+        if (first_free_cluster < rb_start) {
+            first_free_cluster = rb_start;
+        }
+        rb_ofs = alloc_clusters_imrt(bs, 1, refcount_table, nb_clusters,
+                                     &first_free_cluster);
[1] looking back at my earlier question, you are starting each iteration
no earlier than the current rb_start.  But if you end up jumping back to
write_refblocks, are you guaranteed that rb_start is safely far enough
into the file, even if first_free_cluster is pointing to a gap that was
too small for an allocation?

"cluster" is never decreased, it only grows. Therefore, rb_start will also grow or at least stay the same. The if condition before the alloc_clusters_imrt() call should keep first_free_clusters far enough into the file, shouldn't it? Other than that, what do you mean by "safely"? All that's important is that we don't allocate a cluster before the current refblock ("Don't allocate a cluster in a refblock already written to disk"). Apart from that, all allocated offsets are fine; and alloc_clusters_imrt() will always return a correctly allocated range, at or after first_free_clusters, so there shouldn't be any problem.

+        if (rb_ofs < 0) {
+            fprintf(stderr, "ERROR allocating refblock: %s\n", strerror(-ret));
...but if we hit this error on the first time through the for loop,
strerror(0) is NOT what you meant to print.  Did you mean
strerror(-rb_ofs) here?

Yes, it should be -rb_ofs.

+            res->check_errors++;
+            ret = rb_ofs;
Narrowing from int64_t to int; but I guess we know that if rb_ofs < 0,
it is only -1, and not something weird like -0x100000000.  Is the goal
that ret is -1/0, or are you trying to encode negative errno values in
the return?

alloc_clusters_imrt() returns -errno (0/-ENOMEM only, actually), so this is -errno as well.

+            goto fail;
+        }
+
+        if (reftable_size <= rb_index) {
+            uint32_t old_rt_size = reftable_size;
+            reftable_size = ROUND_UP((rb_index + 1) * sizeof(uint64_t),
+                                     s->cluster_size) / sizeof(uint64_t);
+            reftable = g_try_realloc(reftable,
+                                     reftable_size * sizeof(uint64_t));
+            if (!reftable) {
+                res->check_errors++;
+                ret = -ENOMEM;
+                goto fail;
+            }
+
+            memset(reftable + old_rt_size, 0,
+                   (reftable_size - old_rt_size) * sizeof(uint64_t));
+
+            /* The offset we have for the reftable is now no longer valid;
+             * this will leak that range, but we can easily fix that by running
+             * a leak-fixing check after this rebuild operation */
+            rt_ofs = -1;
+        }
+        reftable[rb_index] = rb_ofs;
+
+        /* If this is apparently the last refblock (for now), try to squeeze 
the
+         * reftable in */
+        if (rb_index == (*nb_clusters - 1) >> s->refcount_block_bits &&
+            rt_ofs < 0)
+        {
+            rt_ofs = alloc_clusters_imrt(bs, size_to_clusters(s, reftable_size 
*
+                                                              
sizeof(uint64_t)),
+                                         refcount_table, nb_clusters,
+                                         &first_free_cluster);
+            if (rt_ofs < 0) {
+                fprintf(stderr, "ERROR allocating reftable: %s\n",
+                        strerror(-ret));
Again, -ret looks wrong here.

Yes, should be -rt_ofs.

+                res->check_errors++;
+                ret = rt_ofs;
+                goto fail;
+            }
+        }
+
+        ret = qcow2_pre_write_overlap_check(bs, 0, rb_ofs, s->cluster_size);
+        if (ret < 0) {
+            fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
+            goto fail;
+        }
+
+        on_disk_rb = g_malloc0(s->cluster_size);
Why g_try_malloc earlier, but abort()ing g_malloc0 here?

I use g_try_realloc/g_try_malloc for the reftable and g_malloc for the refblocks. The reftable can be arbitrarily large; a refblock is pretty limited in size (it's a cluster). If g_malloc fails because there's no room for a single cluster anymore, two things will happen: (1) The whole qcow2 driver will explode (because it still uses g_malloc for most if not all cluster allocations, afaik); (2) Linux will kill qemu because of OOM, we won't even be able to catch that by using g_try_malloc. As far as I've seen, using the try variants is only really useful if you may have some absolutely absurd size value.

+        for (i = 0; i < s->cluster_size / sizeof(uint16_t) &&
+                    rb_start + i < *nb_clusters; i++)
+        {
+            on_disk_rb[i] = cpu_to_be16((*refcount_table)[rb_start + i]);
+        }
+
+        ret = bdrv_write(bs->file, rb_ofs / BDRV_SECTOR_SIZE,
+                         (void *)on_disk_rb, s->cluster_sectors);
+        g_free(on_disk_rb);
+        if (ret < 0) {
+            fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
+            goto fail;
+        }
+
+        /* Go to the end of this refblock */
+        cluster = rb_start + s->cluster_size / sizeof(uint16_t) - 1;
+    }
+
+    if (rt_ofs < 0) {
+        int64_t post_rb_start = ROUND_UP(*nb_clusters,
+                                         s->cluster_size / sizeof(uint16_t));
+
+        /* Not pretty but simple */
+        if (first_free_cluster < post_rb_start) {
+            first_free_cluster = post_rb_start;
+        }
+        rt_ofs = alloc_clusters_imrt(bs, size_to_clusters(s, reftable_size *
+                                                          sizeof(uint64_t)),
+                                     refcount_table, nb_clusters,
+                                     &first_free_cluster);
+        if (rt_ofs < 0) {
+            fprintf(stderr, "ERROR allocating reftable: %s\n", strerror(-ret));
Another wrong -ret?

I guess it's just a habit to type strerror(-ret)...



reply via email to

[Prev in Thread] Current Thread [Next in Thread]