[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH 2/2] qcow2: Avoid memory over-allocation on comp
From: |
Eric Blake |
Subject: |
Re: [Qemu-devel] [PATCH 2/2] qcow2: Avoid memory over-allocation on compressed images |
Date: |
Wed, 21 Feb 2018 09:59:25 -0600 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 |
On 02/21/2018 09:00 AM, Eric Blake wrote:
On 02/21/2018 04:04 AM, Alberto Garcia wrote:
On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote:
I was also preparing a patch to change this, but you arrived first :-)
So, it's time to cut back on the waste. A compressed cluster
will NEVER occupy more than an uncompressed cluster
And here, csize can only get smaller than the length picked by
nb_csectors. Therefore, csize is GUARANTEED to be <= c->sector_size.
- Solution a: check that csize < s->cluster_size and return an error if
it's not. However! although QEMU won't produce an image with a
compressed cluster that is larger than the uncompressed one, the qcow2
on-disk format in principle allows for that, so arguably we should
accept it.
No, the qcow2 on-disk format does not have enough bits reserved for
that. It is impossible to store an inflated cluster, because you don't
have enough bits to record it.
Okay, the spec is WRONG, compared to our code base.
That said, we MAY have a bug, more likely to be visible in 512-byte
clusters but possible on any size. While the 'number sectors' field IS
sufficient for any compressed cluster starting at offset 0 in the
cluster, we run into issues if the starting offset is later in the
cluster. That is, even though the compressed length of a 512-byte
cluster is <= 512 (or we wouldn't compress it), if we start at offset
510, we NEED to read the next cluster to get the rest of the compressed
stream - but at 512-byte clusters, there are 0 bits reserved for 'number
sectors'. Per the above calculations with the example offset of 510,
nb_csectors would be 1 (it can't be anything else for a 512-byte cluster
image), and csize would then be 2 bytes, which is insufficient for
reading back enough data to reconstruct the cluster.
In fact, here's a demonstration of a discrepancy, where qemu-img and
John's qcheck tool [1] disagree about the validity of an image created
by qemu (although it may just be that qcheck hasn't yet learned about
compressed clusters):
[1]https://github.com/jnsnow/qcheck
$ f=12345678
$ f=$f$f$f$f # 32
$ f=$f$f$f$f # 128
$ f=$f$f$f$f # 512
$ f=$f$f$f$f # 2k
$ f=$f$f$f$f # 8k
$ f=$f$f$f$f # 32k
$ f=$f$f$f$f # 128k
$ printf "$f" > data
$ qemu-img convert -c -f raw -O qcow2 -o cluster_size=512 \
data data.qcow2
$ qemu-img check data.qcow2
No errors were found on the image.
256/256 = 100.00% allocated, 100.00% fragmented, 100.00% compressed clusters
Image end offset: 18944
$ ./qcheck data.qcow2
...
== L2 Tables ==
== L2 cluster l1[0] : 0x0000000000000800 ==
Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
== L2 cluster l1[1] : 0x0000000000000e00 ==
Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
== L2 cluster l1[2] : 0x0000000000001400 ==
Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
== L2 cluster l1[3] : 0x0000000000001a00 ==
Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
L2 tables: Could not complete analysis, 257 problems found
== Reference Count Analysis ==
Refcount analysis: 00 vacant clusters
Refcount analysis: 04 leaked clusters
Refcount analysis: 00 ghost clusters
Refcount analysis: 04 miscounted clusters
Refcount analysis: 8 problems found
== Cluster Counts ==
Metadata: 0x1000
Data: 0x800
Leaked: 0x800
Vacant: 0x0
total: 0x2000
qcheck: 73 problems found
Not true. It is (cluster_bits - 9) or (cluster_size / 512). Remember,
x = 62 - (cluster_bits - 8); for a 512-byte cluster, x = 61. The
'number sectors' field is then bits x+1 - 61 (but you can't have a
bitfield occupying bit 62 upto 61; especially since bit 62 is the bit
for compressed cluster).
So instead of blindly reading the spec, I'm now going to single-stepping
through the 'qemu-img convert' command line above, to see what REALLY
happens:
Line numbers from commit a6e0344fa0
$ gdb --args ./qemu-img convert -c -f raw -O qcow2 -o cluster_size=512
data data.qcow2
...
(gdb) b qcow2_alloc_bytes
Breakpoint 1 at 0x57610: file block/qcow2-refcount.c, line 1052.
(gdb) r
Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes (
address@hidden, address@hidden)
at block/qcow2-refcount.c:1052
1052 {
(gdb)
So we are compressing 512 bytes down to 15 every time, which means that
after 34 clusters compressed, we should be at offset 510. Let's resume
debugging:
(gdb) c 34
Will ignore next 33 crossings of breakpoint 1. Continuing.
[Thread 0x7fffe3cfe700 (LWP 32229) exited]
[New Thread 0x7fffe3cfe700 (LWP 32300)]
[New Thread 0x7fffe25ed700 (LWP 32301)]
Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes (
address@hidden, address@hidden)
at block/qcow2-refcount.c:1052
1052 {
(gdb) n
1053 BDRVQcow2State *s = bs->opaque;
(gdb)
1058 BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_BYTES);
(gdb)
1059 assert(size > 0 && size <= s->cluster_size);
(gdb) p s->free_byte_offset
$2 = 3070
(gdb) p 3070%512
$3 = 510
...
(gdb)
1076 free_in_cluster = s->cluster_size - offset_into_cluster(s, offset);
(gdb)
1078 if (!offset || free_in_cluster < size) {
(gdb) p free_in_cluster
$4 = 2
1079 int64_t new_cluster = alloc_clusters_noref(bs,
s->cluster_size);
(gdb)
1080 if (new_cluster < 0) {
(gdb)
1084 if (new_cluster == 0) {
(gdb)
1091 if (!offset || ROUND_UP(offset, s->cluster_size) !=
new_cluster) {
(gdb)
1095 free_in_cluster += s->cluster_size;
(gdb)
1099 assert(offset);
so we got a contiguous cluster, and our goal is to let the caller bleed
the compressed cluster into to the tail of the current sector and into
the head of the next cluster. Continuing:
(gdb) fin
Run till exit from #0 qcow2_alloc_bytes (address@hidden,
address@hidden) at block/qcow2-refcount.c:1118
[Thread 0x7fffe25ed700 (LWP 32301) exited]
[Thread 0x7fffe3cfe700 (LWP 32300) exited]
qcow2_alloc_compressed_cluster_offset (address@hidden,
address@hidden, address@hidden)
at block/qcow2-cluster.c:768
768 if (cluster_offset < 0) {
Value returned is $5 = 3070
(gdb) n
773 nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
(gdb)
774 (cluster_offset >> 9);
(gdb)
773 nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
(gdb)
777 ((uint64_t)nb_csectors << s->csize_shift);
(gdb) l
772
773 nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
774 (cluster_offset >> 9);
775
776 cluster_offset |= QCOW_OFLAG_COMPRESSED |
777 ((uint64_t)nb_csectors << s->csize_shift);
778
779 /* update L2 table */
780
781 /* compressed clusters never have the copied flag */
(gdb) p nb_csectors
$6 = 1
(gdb) p s->csize_shift
$7 = 61
(gdb) p/x cluster_offset
$8 = 0xbfe
(gdb) n
776 cluster_offset |= QCOW_OFLAG_COMPRESSED |
(gdb)
783 BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED);
(gdb) p/x cluster_offset
$9 = 0x6000000000000bfe
Where is s->csize_shift initialized? In qcow2_do_open():
s->csize_shift = (62 - (s->cluster_bits - 8));
s->csize_mask = (1 << (s->cluster_bits - 8)) - 1;
s->cluster_offset_mask = (1LL << s->csize_shift) - 1;
Revisiting the wording in the spec:
Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):
Bit 0 - x: Host cluster offset. This is usually _not_ aligned to a
cluster boundary!
x+1 - 61: Compressed size of the images in sectors of 512 bytes
which says bits 0-61 are the host cluster offset, and 62-61 is the
number of sectors. But our code sets s->csize_shift to divide this
differently, at 0-60 and 61-61. Which means your earlier claim that
there are enough 'number sector' bits to allow for up to 2*cluster_size
as the size of the compressed stream (rather than my claim of exactly
cluster_size) is right, and other implementations CAN inflate a cluster
(if we don't document otherwise), and that even if they DON'T inflate,
they can STILL cause a read larger than a cluster size when the offset
is near the tail of one sector (most likely at 512-byte clusters, but
remotely possible at other cluster sizes as well).
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org