[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH 2/2] vmdk: Add read-only support for seSparse sn
From: |
Max Reitz |
Subject: |
Re: [Qemu-devel] [PATCH 2/2] vmdk: Add read-only support for seSparse snapshots |
Date: |
Mon, 27 May 2019 19:39:16 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 |
On 24.04.19 09:49, Sam Eiderman wrote:
> Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
> QEMU).
>
> This format was lacking in the following:
>
> * Grain directory (L1) and grain table (L2) entries were 32-bit,
> allowing access to only 2TB (slightly less) of data.
> * The grain size (default) was 512 bytes - leading to data
> fragmentation and many grain tables.
> * For space reclamation purposes, it was necessary to find all the
> grains which are not pointed to by any grain table - so a reverse
> mapping of "offset of grain in vmdk" to "grain table" must be
> constructed - which takes large amounts of CPU/RAM.
>
> The format specification can be found in VMware's documentation:
> https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf
>
> In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
> introduced: SESparse (Space Efficient).
>
> This format fixes the above issues:
>
> * All entries are now 64-bit.
> * The grain size (default) is 4KB.
> * Grain directory and grain tables are now located at the beginning
> of the file.
> + seSparse format reserves space for all grain tables.
> + Grain tables can be addressed using an index.
> + Grains are located in the end of the file and can also be
> addressed with an index.
> - seSparse vmdks of large disks (64TB) have huge preallocated
> headers - mainly due to L2 tables, even for empty snapshots.
> * The header contains a reverse mapping ("backmap") of "offset of
> grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
> specifies for each grain - whether it is allocated or not.
> Using these data structures we can implement space reclamation
> efficiently.
> * Due to the fact that the header now maintains two mappings:
> * The regular one (grain directory & grain tables)
> * A reverse one (backmap and free bitmap)
> These data structures can lose consistency upon crash and result
> in a corrupted VMDK.
> Therefore, a journal is also added to the VMDK and is replayed
> when the VMware reopens the file after a crash.
>
> Since ESXi 6.7 - SESparse is the only snapshot format available.
>
> Unfortunately, VMware does not provide documentation regarding the new
> seSparse format.
>
> This commit is based on black-box research of the seSparse format.
> Various in-guest block operations and their effect on the snapshot file
> were tested.
>
> The only VMware provided source of information (regarding the underlying
> implementation) was a log file on the ESXi:
>
> /var/log/hostd.log
>
> Whenever an seSparse snapshot is created - the log is being populated
> with seSparse records.
>
> Relevant log records are of the form:
>
> [...] Const Header:
> [...] constMagic = 0xcafebabe
> [...] version = 2.1
> [...] capacity = 204800
> [...] grainSize = 8
> [...] grainTableSize = 64
> [...] flags = 0
> [...] Extents:
> [...] Header : <1 : 1>
> [...] JournalHdr : <2 : 2>
> [...] Journal : <2048 : 2048>
> [...] GrainDirectory : <4096 : 2048>
> [...] GrainTables : <6144 : 2048>
> [...] FreeBitmap : <8192 : 2048>
> [...] BackMap : <10240 : 2048>
> [...] Grain : <12288 : 204800>
> [...] Volatile Header:
> [...] volatileMagic = 0xcafecafe
> [...] FreeGTNumber = 0
> [...] nextTxnSeqNumber = 0
> [...] replayJournal = 0
>
> The sizes that are seen in the log file are in sectors.
> Extents are of the following format: <offset : size>
>
> This commit is a strict implementation which enforces:
> * magics
> * version number 2.1
> * grain size of 8 sectors (4KB)
> * grain table size of 64 sectors
> * zero flags
> * extent locations
>
> Additionally, this commit proivdes only a subset of the functionality
> offered by seSparse's format:
> * Read-only
> * No journal replay
> * No space reclamation
> * No unmap support
>
> Hence, journal header, journal, free bitmap and backmap extents are
> unused, only the "classic" (L1 -> L2 -> data) grain access is
> implemented.
>
> However there are several differences in the grain access itself.
> Grain directory (L1):
> * Grain directory entries are indexes (not offsets) to grain
> tables.
> * Valid grain directory entries have their highest nibble set to
> 0x1.
> * Since grain tables are always located in the beginning of the
> file - the index can fit into 32 bits - so we can use its low
> part if it's valid.
> Grain table (L2):
> * Grain table entries are indexes (not offsets) to grains.
> * If the highest nibble of the entry is:
> 0x0:
> The grain in not allocated.
> The rest of the bytes are 0.
> 0x1:
> The grain is unmapped - guest sees a zero grain.
> The rest of the bits point to the previously mapped grain,
> see 0x3 case.
> 0x2:
> The grain is zero.
> 0x3:
> The grain is allocated - to get the index calculate:
> ((entry & 0x0fff000000000000) >> 48) |
> ((entry & 0x0000ffffffffffff) << 12)
> * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
> grain which results from the guest using sg_unmap to unmap the
> grain - but the grain itself still exists in the grain extent - a
> space reclamation procedure should delete it.
> Unmapping a zero grain has no effect (0x2 will not change to 0x1)
> but unmapping an unallocated grain will (0x0 to 0x1) - naturally.
>
> In order to implement seSparse some fields had to be changed to support
> both 32-bit and 64-bit entry sizes.
>
> Read-only is implemented by failing on pwrite since qemu-img opens the
> vmdk with read-write flags for "convert" (which is a read-only)
> operation.
Does it? The code doesn’t look like it (in img_convert(), src_flags is
never set to anything with BDRV_O_RDWR).
In theory, we have bdrv_apply_auto_read_only() for this case. More on
that below. [1]
> Reviewed-by: Karl Heubaum <address@hidden>
> Reviewed-by: Eyal Moscovici <address@hidden>
> Reviewed-by: Arbel Moshe <address@hidden>
> Signed-off-by: Sam Eiderman <address@hidden>
> ---
> block/vmdk.c | 475
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 459 insertions(+), 16 deletions(-)
>
> diff --git a/block/vmdk.c b/block/vmdk.c
> index fc7378da78..e599c08b95 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c
First, a general observation: BDRV_SECTOR_SIZE is indeed equal to VMDK’s
sector size (both are 512), but I’d consider that a coincidence. This
file defines a SECTOR_SIZE macro. I think you should use that instead
of BDRV_SECTOR_SIZE whenever referring to VMDK’s sector size.
(BDRV_SECTOR_SIZE is actually the sector size the qemu block layer uses.
Yes, I know, there are plenty of places in vmdk.c that use it. I still
think it isn’t entirely correct to use it.)
> @@ -91,6 +91,44 @@ typedef struct {
> uint16_t compressAlgorithm;
> } QEMU_PACKED VMDK4Header;
>
> +typedef struct VMDKSESparseConstHeader {
> + uint64_t magic;
> + uint64_t version;
> + uint64_t capacity;
> + uint64_t grain_size;
> + uint64_t grain_table_size;
> + uint64_t flags;
> + uint64_t reserved1;
> + uint64_t reserved2;
> + uint64_t reserved3;
> + uint64_t reserved4;
> + uint64_t volatile_header_offset;
> + uint64_t volatile_header_size;
> + uint64_t journal_header_offset;
> + uint64_t journal_header_size;
> + uint64_t journal_offset;
> + uint64_t journal_size;
> + uint64_t grain_dir_offset;
> + uint64_t grain_dir_size;
> + uint64_t grain_tables_offset;
> + uint64_t grain_tables_size;
> + uint64_t free_bitmap_offset;
> + uint64_t free_bitmap_size;
> + uint64_t backmap_offset;
> + uint64_t backmap_size;
> + uint64_t grains_offset;
> + uint64_t grains_size;
> + uint8_t pad[304];
> +} QEMU_PACKED VMDKSESparseConstHeader;
> +
> +typedef struct VMDKSESparseVolatileHeader {
Is this the official name? (I know you don’t have a specification, but
maybe you do have some half-official information, I don’t know.)
Generally, I’d call the opposite of “const” “mutable”.
> + uint64_t magic;
> + uint64_t free_gt_number;
> + uint64_t next_txn_seq_number;
> + uint64_t replay_journal;
> + uint8_t pad[480];
> +} QEMU_PACKED VMDKSESparseVolatileHeader;
> +
> #define L2_CACHE_SIZE 16
>
> typedef struct VmdkExtent {
> @@ -99,19 +137,23 @@ typedef struct VmdkExtent {
> bool compressed;
> bool has_marker;
> bool has_zero_grain;
> + bool sesparse;
> + uint64_t sesparse_l2_tables_offset;
> + uint64_t sesparse_clusters_offset;
> + int32_t entry_size;
> int version;
> int64_t sectors;
> int64_t end_sector;
> int64_t flat_start_offset;
> int64_t l1_table_offset;
> int64_t l1_backup_table_offset;
> - uint32_t *l1_table;
> + void *l1_table;
> uint32_t *l1_backup_table;
> unsigned int l1_size;
> uint32_t l1_entry_sectors;
>
> unsigned int l2_size;
> - uint32_t *l2_cache;
> + void *l2_cache;
> uint32_t l2_cache_offsets[L2_CACHE_SIZE];
> uint32_t l2_cache_counts[L2_CACHE_SIZE];
>
> @@ -434,6 +476,8 @@ static int vmdk_add_extent(BlockDriverState *bs,
> * cluster size: 64KB, L2 table size: 512 entries
> * 1PB - for default "ESXi Host Sparse Extent"
> (VMDK3/vmfsSparse)
> * cluster size: 512B, L2 table size: 4096 entries
> + * 8PB - for default "ESXI seSparse Extent"
> + * cluster size: 4KB, L2 table size: 4096 entries
> */
> error_setg(errp, "L1 size too big");
> return -EFBIG;
Hm, now I wonder about this limit. With this new format, it’d be 512M *
8 = 4 GB of RAM usage. That seems an awful lot, and I don’t think we
really need to support 8 PB images. Like, if we want to prevent
unbounded allocations, 4 GB is far above the limit of what I’d consider
sane. (And I don’t know too much above vmdk, but can’t a user also
specify multiple extents with a single descriptor file, and thus make
qemu allocate these 4 GB multiple times?)
Some comment in this file says the maximum supported size is 64 TB
anyway. Shouldn’t we just limit the L1 size accordingly, then?
> @@ -459,6 +503,7 @@ static int vmdk_add_extent(BlockDriverState *bs,
> extent->l2_size = l2_size;
> extent->cluster_sectors = flat ? sectors : cluster_sectors;
> extent->next_cluster_sector = ROUND_UP(nb_sectors, cluster_sectors);
> + extent->entry_size = sizeof(uint32_t);
>
> if (s->num_extents > 1) {
> extent->end_sector = (*(extent - 1)).end_sector + extent->sectors;
> @@ -480,7 +525,7 @@ static int vmdk_init_tables(BlockDriverState *bs,
> VmdkExtent *extent,
> int i;
>
> /* read the L1 table */
> - l1_size = extent->l1_size * sizeof(uint32_t);
> + l1_size = extent->l1_size * extent->entry_size;
Related to the above: This can overflow. (512M * 8 == 4G)
> extent->l1_table = g_try_malloc(l1_size);
> if (l1_size && extent->l1_table == NULL) {
> return -ENOMEM;
> @@ -498,10 +543,15 @@ static int vmdk_init_tables(BlockDriverState *bs,
> VmdkExtent *extent,
> goto fail_l1;
> }
> for (i = 0; i < extent->l1_size; i++) {
> - le32_to_cpus(&extent->l1_table[i]);
> + if (extent->sesparse) {
Shouldn’t matter in practice, but I think evaluating extent->entry_size
would be more natural.
> + le64_to_cpus((uint64_t *)extent->l1_table + i);
> + } else {
> + le32_to_cpus((uint32_t *)extent->l1_table + i);
> + }
> }
>
> if (extent->l1_backup_table_offset) {
> + assert(!extent->sesparse);
> extent->l1_backup_table = g_try_malloc(l1_size);
> if (l1_size && extent->l1_backup_table == NULL) {
> ret = -ENOMEM;
> @@ -524,7 +574,7 @@ static int vmdk_init_tables(BlockDriverState *bs,
> VmdkExtent *extent,
> }
>
> extent->l2_cache =
> - g_new(uint32_t, extent->l2_size * L2_CACHE_SIZE);
> + g_malloc(extent->entry_size * extent->l2_size * L2_CACHE_SIZE);
> return 0;
> fail_l1b:
> g_free(extent->l1_backup_table);
> @@ -570,6 +620,331 @@ static int vmdk_open_vmfs_sparse(BlockDriverState *bs,
> return ret;
> }
>
> +#define SESPARSE_CONST_HEADER_MAGIC 0x00000000cafebabe
> +#define SESPARSE_VOLATILE_HEADER_MAGIC 0x00000000cafecafe
The upper 32 bit are 0, so it doesn’t really matter, but still, if these
are 64-bit constants (which they are), they should have a 64-bit type
(e.g. UINT64_C(0x00000000cafebabe)). (According to my very personal
taste at least.)
> +
> +static const char zero_pad[BDRV_SECTOR_SIZE];
> +
> +/* Strict checks - format not officially documented */
> +static int check_se_sparse_const_header(VMDKSESparseConstHeader *header,
> + Error **errp)
> +{
> + uint64_t grain_table_coverage;
> + uint64_t needed_grain_tables;
> + uint64_t needed_grain_dir_size;
> + uint64_t needed_grain_tables_size;
> + uint64_t needed_free_bitmap_size;
> +
> + header->magic = le64_to_cpu(header->magic);
> + header->version = le64_to_cpu(header->version);
> + header->grain_size = le64_to_cpu(header->grain_size);
> + header->grain_table_size = le64_to_cpu(header->grain_table_size);
> + header->flags = le64_to_cpu(header->flags);
> + header->reserved1 = le64_to_cpu(header->reserved1);
> + header->reserved2 = le64_to_cpu(header->reserved2);
> + header->reserved3 = le64_to_cpu(header->reserved3);
> + header->reserved4 = le64_to_cpu(header->reserved4);
> +
> + header->volatile_header_offset =
> + le64_to_cpu(header->volatile_header_offset);
> + header->volatile_header_size = le64_to_cpu(header->volatile_header_size);
> +
> + header->journal_header_offset =
> le64_to_cpu(header->journal_header_offset);
> + header->journal_header_size = le64_to_cpu(header->journal_header_size);
> +
> + header->journal_offset = le64_to_cpu(header->journal_offset);
> + header->journal_size = le64_to_cpu(header->journal_size);
> +
> + header->grain_dir_offset = le64_to_cpu(header->grain_dir_offset);
> + header->grain_dir_size = le64_to_cpu(header->grain_dir_size);
> +
> + header->grain_tables_offset = le64_to_cpu(header->grain_tables_offset);
> + header->grain_tables_size = le64_to_cpu(header->grain_tables_size);
> +
> + header->free_bitmap_offset = le64_to_cpu(header->free_bitmap_offset);
> + header->free_bitmap_size = le64_to_cpu(header->free_bitmap_size);
> +
> + header->backmap_offset = le64_to_cpu(header->backmap_offset);
> + header->backmap_size = le64_to_cpu(header->backmap_size);
> +
> + header->grains_offset = le64_to_cpu(header->grains_offset);
> + header->grains_size = le64_to_cpu(header->grains_size);
> +
> + if (header->magic != SESPARSE_CONST_HEADER_MAGIC) {
> + error_setg(errp, "Bad const header magic: 0x%016" PRIx64,
> + header->magic);
> + return -EINVAL;
> + }
> +
> + if (header->version != 0x0000000200000001) {
> + error_setg(errp, "Unsupported version: 0x%016" PRIx64,
> + header->version);
> + return -ENOTSUP;
> + }
> +
> + if (header->grain_size != 8) {
> + error_setg(errp, "Unsupported grain size: %" PRIu64,
> + header->grain_size);
> + return -ENOTSUP;
> + }
> +
> + if (header->grain_table_size != 64) {
> + error_setg(errp, "Unsupported grain table size: %" PRIu64,
> + header->grain_table_size);
> + return -ENOTSUP;
> + }
> +
> + if (header->flags != 0) {
> + error_setg(errp, "Unsupported flags: 0x%016" PRIx64,
> + header->flags);
> + return -ENOTSUP;
> + }
> +
> + if (header->reserved1 != 0 || header->reserved2 != 0 ||
> + header->reserved3 != 0 || header->reserved4 != 0) {
> + error_setg(errp, "Unsupported reserved bits:"
> + " 0x%016" PRIx64 " 0x%016" PRIx64
> + " 0x%016" PRIx64 " 0x%016" PRIx64,
> + header->reserved1, header->reserved2,
> + header->reserved3, header->reserved4);
> + return -ENOTSUP;
> + }
> +
> + if (header->volatile_header_offset != 1) {
> + error_setg(errp, "Unsupported volatile header offset: %" PRIu64,
> + header->volatile_header_offset);
> + return -ENOTSUP;
> + }
> +
> + if (header->volatile_header_size != 1) {
> + error_setg(errp, "Unsupported volatile header size: %" PRIu64,
> + header->volatile_header_size);
> + return -ENOTSUP;
> + }
> +
> + if (header->journal_header_offset != 2) {
> + error_setg(errp, "Unsupported journal header offset: %" PRIu64,
> + header->journal_header_offset);
> + return -ENOTSUP;
> + }
> +
> + if (header->journal_header_size != 2) {
> + error_setg(errp, "Unsupported journal header size: %" PRIu64,
> + header->journal_header_size);
> + return -ENOTSUP;
> + }
> +
> + if (header->journal_offset != 2048) {
> + error_setg(errp, "Unsupported journal offset: %" PRIu64,
> + header->journal_offset);
> + return -ENOTSUP;
> + }
> +
> + if (header->journal_size != 2048) {
> + error_setg(errp, "Unsupported journal size: %" PRIu64,
> + header->journal_size);
> + return -ENOTSUP;
> + }
> +
> + if (header->grain_dir_offset != 4096) {
> + error_setg(errp, "Unsupported grain directory offset: %" PRIu64,
> + header->grain_dir_offset);
> + return -ENOTSUP;
> + }
Do these values (metadata structure offsets and sizes) really matter?
> + /* in sectors */
> + grain_table_coverage = ((header->grain_table_size *
Hm, if header->grain_table_size is measured in sectors, it’d probably be
better to rename it to grain_table_sectors or something.
(“size” is usually in bytes; sometimes it’s the number of entries, but
that’s already kind of weird, I think)
> + BDRV_SECTOR_SIZE / sizeof(uint64_t)) *
> + header->grain_size);
> + needed_grain_tables = header->capacity / grain_table_coverage;
> + needed_grain_dir_size = DIV_ROUND_UP(needed_grain_tables *
> sizeof(uint64_t),
> + BDRV_SECTOR_SIZE);
> + needed_grain_dir_size = ROUND_UP(needed_grain_dir_size, 2048);
> +
> + if (header->grain_dir_size != needed_grain_dir_size) {
Wouldn’t it suffice to check that header->grain_dir_size >=
needed_grain_dir_size? (The ROUND_UP() would become unnecessary, then.)
(And also, maybe s/grain_dir_size/grain_dir_sectors/)
> + error_setg(errp, "Invalid grain directory size: %" PRIu64
> + ", needed: %" PRIu64,
> + header->grain_dir_size, needed_grain_dir_size);
> + return -EINVAL;
> + }
> +
> + if (header->grain_tables_offset !=
> + header->grain_dir_offset + header->grain_dir_size) {
> + error_setg(errp, "Grain directory must be followed by grain tables");
> + return -EINVAL;
> + }
> +
> + needed_grain_tables_size = needed_grain_tables *
> header->grain_table_size;
> + needed_grain_tables_size = ROUND_UP(needed_grain_tables_size, 2048);
> +
> + if (header->grain_tables_size != needed_grain_tables_size) {
> + error_setg(errp, "Invalid grain tables size: %" PRIu64
> + ", needed: %" PRIu64,
> + header->grain_tables_size, needed_grain_tables_size);
> + return -EINVAL;
> + }
> +
> + if (header->free_bitmap_offset !=
> + header->grain_tables_offset + header->grain_tables_size) {
> + error_setg(errp, "Grain tables must be followed by free bitmap");
> + return -EINVAL;
> + }
> +
> + /* in bits */
> + needed_free_bitmap_size = DIV_ROUND_UP(header->capacity,
> + header->grain_size);
> + /* in bytes */
> + needed_free_bitmap_size = DIV_ROUND_UP(needed_free_bitmap_size,
> + BITS_PER_BYTE);
> + /* in sectors */
> + needed_free_bitmap_size = DIV_ROUND_UP(needed_free_bitmap_size,
> + BDRV_SECTOR_SIZE);
> + needed_free_bitmap_size = ROUND_UP(needed_free_bitmap_size, 2048);
Er, now this is really confusing. I’d start with the size in bytes:
> needed_free_bitmap_size = DIV_ROUND_UP(header->capacity,
> header->grain_size * BITS_PER_BYTE);
and then translate it to sectors, but use a new variable for it
(needed_free_bitmap_sectors).
> +
> + if (header->free_bitmap_size != needed_free_bitmap_size) {
Again, I suppose just checking that header->free_bitmap_size >=
needed_free_bitmap_size should be enough, I think.
> + error_setg(errp, "Invalid free bitmap size: %" PRIu64
> + ", needed: %" PRIu64,
> + header->free_bitmap_size, needed_free_bitmap_size);
> + return -EINVAL;
> + }
> +
> + if (header->backmap_offset !=
> + header->free_bitmap_offset + header->free_bitmap_size) {
> + error_setg(errp, "Free bitmap must be followed by backmap");
> + return -EINVAL;
> + }
> +
> + if (header->backmap_size != header->grain_tables_size) {
> + error_setg(errp, "Invalid backmap size: %" PRIu64
> + ", needed: %" PRIu64,
> + header->backmap_size, header->grain_tables_size);
> + return -EINVAL;
> + }
> +
> + if (header->grains_offset !=
> + header->backmap_offset + header->backmap_size) {
> + error_setg(errp, "Backmap must be followed by grains");
> + return -EINVAL;
> + }
> +
> + if (header->grains_size != header->capacity) {
> + error_setg(errp, "Invalid grains size: %" PRIu64
> + ", needed: %" PRIu64,
> + header->grains_size, header->capacity);
> + return -EINVAL;
> + }
> +
> + /* check that padding is 0 */
> + if (memcmp(header->pad, zero_pad, sizeof(header->pad))) {
buffer_is_zero() should be sufficient.
> + error_setg(errp, "Unsupported non-zero const header padding");
> + return -ENOTSUP;
> + }
> +
> + return 0;
> +}
> +
> +static int check_se_sparse_volatile_header(VMDKSESparseVolatileHeader
> *header,
> + Error **errp)
> +{
> + header->magic = le64_to_cpu(header->magic);
> + header->free_gt_number = le64_to_cpu(header->free_gt_number);
> + header->next_txn_seq_number = le64_to_cpu(header->next_txn_seq_number);
> + header->replay_journal = le64_to_cpu(header->replay_journal);
> +
> + if (header->magic != SESPARSE_VOLATILE_HEADER_MAGIC) {
> + error_setg(errp, "Bad volatile header magic: 0x%016" PRIx64,
> + header->magic);
> + return -EINVAL;
> + }
> +
> + if (header->replay_journal) {
> + error_setg(errp, "Replaying journal not supported");
Hmmm, maybe "Image is dirty, replaying journal not supported"?
> + return -ENOTSUP;
> + }
> +
> + /* check that padding is 0 */
> + if (memcmp(header->pad, zero_pad, sizeof(header->pad))) {
> + error_setg(errp, "Unsupported non-zero volatile header padding");
> + return -ENOTSUP;
> + }
> +
> + return 0;
> +}
> +
> +static int vmdk_open_se_sparse(BlockDriverState *bs,
> + BdrvChild *file,
> + int flags, Error **errp)
> +{
> + int ret;
> + VMDKSESparseConstHeader const_header;
> + VMDKSESparseVolatileHeader volatile_header;
> + VmdkExtent *extent;
> +
> + assert(sizeof(const_header) == BDRV_SECTOR_SIZE);
> +
[1] I think if you put a
ret = bdrv_apply_auto_read_only(bs,
"No write support for seSparse images available", errp);
if (ret < 0) {
return ret;
}
here, that should do the trick.
> + ret = bdrv_pread(file, 0, &const_header, sizeof(const_header));
> + if (ret < 0) {
> + bdrv_refresh_filename(file->bs);
> + error_setg_errno(errp, -ret,
> + "Could not read const header from file '%s'",
> + file->bs->filename);
> + return ret;
> + }
> +
> + /* check const header */
> + ret = check_se_sparse_const_header(&const_header, errp);
> + if (ret < 0) {
> + return ret;
> + }
> +
> + assert(sizeof(volatile_header) == BDRV_SECTOR_SIZE);
> +
> + ret = bdrv_pread(file,
> + const_header.volatile_header_offset * BDRV_SECTOR_SIZE,
> + &volatile_header, sizeof(volatile_header));
> + if (ret < 0) {
> + bdrv_refresh_filename(file->bs);
> + error_setg_errno(errp, -ret,
> + "Could not read volatile header from file '%s'",
> + file->bs->filename);
> + return ret;
> + }
> +
> + /* check volatile header */
> + ret = check_se_sparse_volatile_header(&volatile_header, errp);
> + if (ret < 0) {
> + return ret;
> + }
> +
> + ret = vmdk_add_extent(bs, file, false,
> + const_header.capacity,
> + const_header.grain_dir_offset * BDRV_SECTOR_SIZE,
> + 0,
> + const_header.grain_dir_size *
> + BDRV_SECTOR_SIZE / sizeof(uint64_t),
> + const_header.grain_table_size *
> + BDRV_SECTOR_SIZE / sizeof(uint64_t),
> + const_header.grain_size,
> + &extent,
> + errp);
> + if (ret < 0) {
> + return ret;
> + }
> +
> + extent->sesparse = true;
> + extent->sesparse_l2_tables_offset = const_header.grain_tables_offset;
> + extent->sesparse_clusters_offset = const_header.grains_offset;
> + extent->entry_size = sizeof(uint64_t);
> +
> + ret = vmdk_init_tables(bs, extent, errp);
> + if (ret) {
> + /* free extent allocated by vmdk_add_extent */
> + vmdk_free_last_extent(bs);
> + }
> +
> + return ret;
> +}
> +
> static int vmdk_open_desc_file(BlockDriverState *bs, int flags, char *buf,
> QDict *options, Error **errp);
>
> @@ -847,6 +1222,7 @@ static int vmdk_parse_extents(const char *desc,
> BlockDriverState *bs,
> * RW [size in sectors] SPARSE "file-name.vmdk"
> * RW [size in sectors] VMFS "file-name.vmdk"
> * RW [size in sectors] VMFSSPARSE "file-name.vmdk"
> + * RW [size in sectors] SESPARSE "file-name.vmdk"
> */
> flat_offset = -1;
> matches = sscanf(p, "%10s %" SCNd64 " %10s \"%511[^\n\r\"]\" %"
> SCNd64,
> @@ -869,7 +1245,8 @@ static int vmdk_parse_extents(const char *desc,
> BlockDriverState *bs,
>
> if (sectors <= 0 ||
> (strcmp(type, "FLAT") && strcmp(type, "SPARSE") &&
> - strcmp(type, "VMFS") && strcmp(type, "VMFSSPARSE")) ||
> + strcmp(type, "VMFS") && strcmp(type, "VMFSSPARSE") &&
> + strcmp(type, "SESPARSE")) ||
> (strcmp(access, "RW"))) {
> continue;
> }
> @@ -922,6 +1299,13 @@ static int vmdk_parse_extents(const char *desc,
> BlockDriverState *bs,
> return ret;
> }
> extent = &s->extents[s->num_extents - 1];
> + } else if (!strcmp(type, "SESPARSE")) {
> + ret = vmdk_open_se_sparse(bs, extent_file, bs->open_flags, errp);
> + if (ret) {
> + bdrv_unref_child(bs, extent_file);
> + return ret;
> + }
> + extent = &s->extents[s->num_extents - 1];
> } else {
> error_setg(errp, "Unsupported extent type '%s'", type);
> bdrv_unref_child(bs, extent_file);
> @@ -956,6 +1340,7 @@ static int vmdk_open_desc_file(BlockDriverState *bs, int
> flags, char *buf,
> if (strcmp(ct, "monolithicFlat") &&
> strcmp(ct, "vmfs") &&
> strcmp(ct, "vmfsSparse") &&
> + strcmp(ct, "seSparse") &&
> strcmp(ct, "twoGbMaxExtentSparse") &&
> strcmp(ct, "twoGbMaxExtentFlat")) {
> error_setg(errp, "Unsupported image type '%s'", ct);
> @@ -1206,10 +1591,12 @@ static int get_cluster_offset(BlockDriverState *bs,
> {
> unsigned int l1_index, l2_offset, l2_index;
> int min_index, i, j;
> - uint32_t min_count, *l2_table;
> + uint32_t min_count;
> + void *l2_table;
> bool zeroed = false;
> int64_t ret;
> int64_t cluster_sector;
> + unsigned int l2_size_bytes = extent->l2_size * extent->entry_size;
>
> if (m_data) {
> m_data->valid = 0;
> @@ -1224,7 +1611,31 @@ static int get_cluster_offset(BlockDriverState *bs,
> if (l1_index >= extent->l1_size) {
> return VMDK_ERROR;
> }
> - l2_offset = extent->l1_table[l1_index];
> + if (extent->sesparse) {
Maybe add assertions that extent->entry_size == 8 here and... [2]
> + uint64_t l2_offset_u64 = ((uint64_t *)extent->l1_table)[l1_index];
> + if (l2_offset_u64 == 0) {
> + l2_offset = 0;
> + } else if ((l2_offset_u64 & 0xffffffff00000000) !=
> 0x1000000000000000) {
> + /*
> + * Top most nibble is 0x1 if grain table is allocated.
> + * strict check - top most 4 bytes must be 0x10000000 since max
> + * supported size is 64TB for disk - so no more than 64TB / 16MB
> + * grain directories which is smaller than uint32,
> + * where 16MB is the only supported default grain table coverage.
> + */
> + return VMDK_ERROR;
> + } else {
> + l2_offset_u64 = l2_offset_u64 & 0x00000000ffffffff;
> + l2_offset_u64 = extent->sesparse_l2_tables_offset +
> + l2_offset_u64 * l2_size_bytes / BDRV_SECTOR_SIZE;
> + if (l2_offset_u64 > 0x00000000ffffffff) {
> + return VMDK_ERROR;
> + }
> + l2_offset = (unsigned int)(l2_offset_u64);
> + }
> + } else {
[2] ...that extent->entry_size == 4 here?
> + l2_offset = ((uint32_t *)extent->l1_table)[l1_index];
> + }
> if (!l2_offset) {
> return VMDK_UNALLOC;
> }
> @@ -1236,7 +1647,7 @@ static int get_cluster_offset(BlockDriverState *bs,
> extent->l2_cache_counts[j] >>= 1;
> }
> }
> - l2_table = extent->l2_cache + (i * extent->l2_size);
> + l2_table = (char *)extent->l2_cache + (i * l2_size_bytes);
> goto found;
> }
> }
> @@ -1249,13 +1660,13 @@ static int get_cluster_offset(BlockDriverState *bs,
> min_index = i;
> }
> }
> - l2_table = extent->l2_cache + (min_index * extent->l2_size);
> + l2_table = (char *)extent->l2_cache + (min_index * l2_size_bytes);
> BLKDBG_EVENT(extent->file, BLKDBG_L2_LOAD);
> if (bdrv_pread(extent->file,
> (int64_t)l2_offset * 512,
> l2_table,
> - extent->l2_size * sizeof(uint32_t)
> - ) != extent->l2_size * sizeof(uint32_t)) {
> + l2_size_bytes
> + ) != l2_size_bytes) {
> return VMDK_ERROR;
> }
>
> @@ -1263,16 +1674,45 @@ static int get_cluster_offset(BlockDriverState *bs,
> extent->l2_cache_counts[min_index] = 1;
> found:
> l2_index = ((offset >> 9) / extent->cluster_sectors) % extent->l2_size;
> - cluster_sector = le32_to_cpu(l2_table[l2_index]);
>
> - if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
> - zeroed = true;
> + if (extent->sesparse) {
> + cluster_sector = le64_to_cpu(((uint64_t *)l2_table)[l2_index]);
> + switch (cluster_sector & 0xf000000000000000) {
> + case 0x0000000000000000:
> + /* unallocated grain */
> + if (cluster_sector != 0) {
> + return VMDK_ERROR;
> + }
> + break;
> + case 0x1000000000000000:
> + /* scsi-unmapped grain - fallthrough */
> + case 0x2000000000000000:
> + /* zero grain */
> + zeroed = true;
> + break;
> + case 0x3000000000000000:
> + /* allocated grain */
> + cluster_sector = (((cluster_sector & 0x0fff000000000000) >> 48) |
> + ((cluster_sector & 0x0000ffffffffffff) << 12));
That’s just insane. I trust you, though.
> + cluster_sector = extent->sesparse_clusters_offset +
> + cluster_sector * extent->cluster_sectors;
> + break;
> + default:
> + return VMDK_ERROR;
> + }
> + } else {
> + cluster_sector = le32_to_cpu(((uint32_t *)l2_table)[l2_index]);
> +
> + if (extent->has_zero_grain && cluster_sector == VMDK_GTE_ZEROED) {
> + zeroed = true;
> + }
> }
>
> if (!cluster_sector || zeroed) {
> if (!allocate) {
> return zeroed ? VMDK_ZEROED : VMDK_UNALLOC;
> }
> + assert(!extent->sesparse);
>
> if (extent->next_cluster_sector >= VMDK_EXTENT_MAX_SECTORS) {
> return VMDK_ERROR;
> @@ -1296,7 +1736,7 @@ static int get_cluster_offset(BlockDriverState *bs,
> m_data->l1_index = l1_index;
> m_data->l2_index = l2_index;
> m_data->l2_offset = l2_offset;
> - m_data->l2_cache_entry = &l2_table[l2_index];
> + m_data->l2_cache_entry = ((uint32_t *)l2_table) + l2_index;
> }
> }
> *cluster_offset = cluster_sector << BDRV_SECTOR_BITS;
> @@ -1622,6 +2062,9 @@ static int vmdk_pwritev(BlockDriverState *bs, uint64_t
> offset,
> if (!extent) {
> return -EIO;
> }
> + if (extent->sesparse) {
> + return -ENOTSUP;
> + }
([1] This should still stay here, even with bdrv_apply_auto_read_only()
above.)
> offset_in_cluster = vmdk_find_offset_in_cluster(extent, offset);
> n_bytes = MIN(bytes, extent->cluster_sectors * BDRV_SECTOR_SIZE
> - offset_in_cluster);
>
Overall, looks reasonable to me. Then again, I have no clue of the
format and I trust you that it works.
Max
signature.asc
Description: OpenPGP digital signature
- Re: [Qemu-devel] [PATCH 2/2] vmdk: Add read-only support for seSparse snapshots,
Max Reitz <=