[Qemu-devel] [RFC PATCH v2] Specification for qcow2 version 3

qemu-devel
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Qemu-devel] [RFC PATCH v2] Specification for qcow2 version 3

From:	Kevin Wolf
Subject:	[Qemu-devel] [RFC PATCH v2] Specification for qcow2 version 3
Date:	Mon, 27 Jun 2011 17:11:54 +0200
This is the second draft for what I think could be added when we increase 
qcow2's
version number to 3. This includes points that have been made by several people
over the past few months. We're probably not going to implement this next week,
but I think it's important to get discussions started early, so here it is.

Changes implemented in this RFC:

- Added compatible/incompatible/auto-clear feature bits plus an optional
  feature name table to allow useful error messages even if an older version
  doesn't know some feature at all.

- Added a dirty flag which tells that the refcount may not be accurate ("QED
  mode"). This means that we can save writes to the refcount table with
  cache=writethrough, but isn't really useful otherwise since Qcow2Cache.

- Configurable refcount width. If you don't want to use internal snapshots,
  make refcounts one bit and save cache space and I/O.

- Added subclusters. This separate the COW size (one subcluster, I'm thinking
  of 64k default size here) from the allocation size (one cluster, 2M). Less
  fragmentation, less metadata, but still reasonable COW granularity.

  This also allows to preallocate clusters, but none of their subclusters. You
  can have an image that is like raw + COW metadata, and you can also
  preallocate metadata for images with backing files.

- Zero cluster flags. This allows discard even with a backing file that doesn't
  contain zeros. It is also useful for copy-on-read/image streaming, as you'll
  want to keep sparseness without accessing the remote image for an unallocated
  cluster all the time.

- Fixed internal snapshot metadata to use 64 bit VM state size. You can't save
  a snapshot of a VM with >= 4 GB RAM today.

Possible future additions:

- Add per-L2-table dirty flag to L1?
- Add per-refcount-block full flag to refcount table?
---
 docs/specs/qcow2.txt |  135 +++++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 112 insertions(+), 23 deletions(-)

diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
index 8fc3cb2..e4722bc 100644
--- a/docs/specs/qcow2.txt
+++ b/docs/specs/qcow2.txt
@@ -18,7 +18,7 @@ The first cluster of a qcow2 image contains the file header:
                     QCOW magic string ("QFI\xfb")
 
           4 -  7:   version
-                    Version number (only valid value is 2)
+                    Version number (valid values are 2 and 3)
 
           8 - 15:   backing_file_offset
                     Offset into the image file at which the backing file name
@@ -67,12 +67,53 @@ The first cluster of a qcow2 image contains the file header:
                     Offset into the image file at which the snapshot table
                     starts. Must be aligned to a cluster boundary.
 
+If the version is 3 or higher, the header has the following additional fields.
+For version 2, the values are assumed to be zero, unless specified otherwise
+in the description of a field.
+
+         72 -  79:  incompatible_features
+                    Bitmask of incompatible features. An implementation must
+                    fail to open an image if an unknown bit is set.
+
+                    Bit 0:      The reference counts in the image file may be
+                                inaccurate. Implementations must check/rebuild
+                                them if they rely on them.
+
+                    Bit 1:      Enable subclusters. This affects the L2 table
+                                format.
+
+                    Bits 2-31:  Reserved (set to 0)
+
+         80 -  87:  compatible_features
+                    Bitmask of compatible features. An implementation can
+                    safely ignore any unknown bits that are set.
+
+                    Bits 0-31:  Reserved (set to 0)
+
+         88 -  95:  autoclear_features
+                    Bitmask of auto-clear features. An implementation may only
+                    write to an image with unknown auto-clear features if it
+                    clears the respective bits from this field first.
+
+                    Bits 0-31:  Reserved (set to 0)
+
+         96 -  99:  refcount_bits
+                    Size of a reference count block entry in bits. For version 
2
+                    images, the size is always assumed to be 16 bits. The size
+                    must be a power of two.
+                    [ TODO: Define order in sub-byte sizes ]
+
+        100 - 104:  header_length
+                    Length of the header structure in bytes. For version 2
+                    images, the length is always assumed to be 72 bytes.
+
 Directly after the image header, optional sections called header extensions can
 be stored. Each extension has a structure like the following:
 
     Byte  0 -  3:   Header extension type:
                         0x00000000 - End of the header extension area
                         0xE2792ACA - Backing file format name
+                        0x6803f857 - Feature name table
                         other      - Unknown header extension, can be safely
                                      ignored
 
@@ -84,8 +125,32 @@ be stored. Each extension has a structure like the 
following:
                     multiple of 8.
 
 The remaining space between the end of the header extension area and the end of
-the first cluster can be used for other data. Usually, the backing file name is
-stored there.
+the first cluster can be used for the backing file name. It is not allowed to
+store other data here, so that an implementation can safely modify the header
+and add extensions without harming data of compatible features that it
+doesn't support. Compatible features that need space for additional data can
+use a header extension.
+
+
+== Feature name table ==
+
+A feature name table is an optional header extension that contains the name for
+features used by the image. It can be used by applications that don't know
+the respective feature (e.g. because the feature was introduced only later) to
+display a useful error message.
+
+The number of entries in the feature name table is determined by the length of
+the header extension data. Its entries look like this:
+
+    Byte       0:   Type of feature (select feature bitmap)
+                        0: Incompatible feature
+                        1: Compatible feature
+                        2: Autoclear feature
+
+               1:   Bit number within the selected feature bitmap
+
+          2 - 47:   Feature name (padded with zeros, but not necessarily null
+                    terminated if it has full length)
 
 
 == Host cluster management ==
@@ -138,7 +203,8 @@ guest clusters to host clusters. They are called L1 and L2 
table.
 
 The L1 table has a variable size (stored in the header) and may use multiple
 clusters, however it must be contiguous in the image file. L2 tables are
-exactly one cluster in size.
+exactly one cluster in size if subclusters are disabled, and two clusters if
+they are enabled.
 
 Given a offset into the virtual disk, the offset into the image file can be
 obtained as follows:
@@ -168,9 +234,38 @@ L1 table entry:
                     refcount is exactly one. This information is only accurate
                     in the active L1 table.
 
-L2 table entry (for normal clusters):
+L2 table entry:
 
-    Bit  0 -  8:    Reserved (set to 0)
+    Bit  0 -  61:   Cluster descriptor
+
+              62:   0 for standard clusters
+                    1 for compressed clusters
+
+              63:   0 for a cluster that is unused or requires COW, 1 if its
+                    refcount is exactly one. This information is only accurate
+                    in L2 tables that are reachable from the the active L1
+                    table.
+
+        64 - 127:   If subclusters are enabled, this contains a bitmask that
+                    describes the allocation status of all 32 subclusters (two
+                    bits for each). The first subcluster is represented by the
+                    LSB. The values for each subcluster are:
+
+                     0: Subcluster is unallocated
+                     1: Subcluster is allocated
+                     2: Subcluster is unallocated and reads as all zeros
+                        instead of referring to the backing file
+                     3: Reserved
+
+Standard Cluster Descriptor:
+
+    Bit       0:    If set to 1, the cluster reads as all zeros instead of
+                    referring to the backing file if the (sub-)cluster is
+                    unallocated.
+
+                    With version 2, this is always 0.
+
+         1 -  8:    Reserved (set to 0)
 
          9 - 55:    Bits 9-55 of host cluster offset. Must be aligned to a
                     cluster boundary. If the offset is 0, the cluster is
@@ -178,29 +273,17 @@ L2 table entry (for normal clusters):
 
         56 - 61:    Reserved (set to 0)
 
-             62:    0 (this cluster is not compressed)
 
-             63:    0 for a cluster that is unused or requires COW, 1 if its
-                    refcount is exactly one. This information is only accurate
-                    in L2 tables that are reachable from the the active L1
-                    table.
-
-L2 table entry (for compressed clusters; x = 62 - (cluster_size - 8)):
+Compressed Clusters Descriptor (x = 62 - (cluster_size - 8)):
 
     Bit  0 -  x:    Host cluster offset. This is usually _not_ aligned to a
                     cluster boundary!
 
        x+1 - 61:    Compressed size of the images in sectors of 512 bytes
 
-             62:    1 (this cluster is compressed using zlib)
-
-             63:    0 for a cluster that is unused or requires COW, 1 if its
-                    refcount is exactly one. This information is only accurate
-                    in L2 tables that are reachable from the the active L1
-                    table.
-
-If a cluster is unallocated, read requests shall read the data from the backing
-file. If there is no backing file or the backing file is smaller than the 
image,
+If a cluster or a subcluster is unallocated, read requests shall read the data
+from the backing file (except if bit 0 in the Standard Cluster Descriptor is
+set). If there is no backing file or the backing file is smaller than the 
image,
 they shall read zeros for all parts that are not covered by the backing file.
 
 
@@ -253,7 +336,13 @@ Snapshot table entry:
         36 - 39:    Size of extra data in the table entry (used for future
                     extensions of the format)
 
-        variable:   Extra data for future extensions. Must be ignored.
+        variable:   Extra data for future extensions. Unknown fields must be
+                    ignored. Currently defined are (offset relative to snapshot
+                    table entry):
+
+                    Byte 40 - 47:   Size of the VM state in bytes. 0 if no VM
+                                    state is saved. If this field is present,
+                                    the 32-bit value in bytes 32-35 is ignored.
 
         variable:   Unique ID string for the snapshot (not null terminated)
 
-- 
1.7.5.4
[Prev in Thread]
Current Thread
[Next in Thread]
[Qemu-devel] [RFC PATCH v2] Specification for qcow2 version 3, Kevin Wolf <=
- Re: [Qemu-devel] [RFC PATCH v2] Specification for qcow2 version 3, Frediano Ziglio, 2011/06/28
Prev by Date: [Qemu-devel] [PATCH v3] xen_disk: cope with missing xenstore "params" node
Next by Date: Re: [Qemu-devel] [RFC v2 00/20] Memory API
Previous by thread: [Qemu-devel] [PATCH v3] xen_disk: cope with missing xenstore "params" node
Next by thread: Re: [Qemu-devel] [RFC PATCH v2] Specification for qcow2 version 3
Index(es):
- Date
- Thread