[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Open qcow2 on multiple hosts simultaneously.
From: |
kvaps |
Subject: |
Re: Open qcow2 on multiple hosts simultaneously. |
Date: |
Wed, 21 Jun 2023 11:28:02 +0200 |
> Good to hear that. Alberto also has been working on a CSI driver
> which makes use of qemu-storage-daemon and qcwo2 files either with
> local storage or shared storage like NFS. At this point of time it
> focusses on filesystem backends as that's where it is easiest to
> manage qcow2 files. But I think that could be extended to support
> block device backends (ex. LVM) too.
>
> https://gitlab.com/subprovisioner/subprovisioner
>
> This is still work in progress. But I think there might be some overlaps
> in your work and subprovisoner project.
>
Wow that is amazing! Yeah, Alice told me about this project, thanks
for the link.
I wasn't able to reach Alberto. But now I'm happy that I can see the code.
> Hmm..., I will need to spned more time going through numbers and setup.
> This result is little surprising to me though. If you are using
> vduse, nbk, ublk kind of exports, that means all IO will go to kernel
> first, then to userspace(qsd) and then back into kernel. But with
> pure LVM based approach, I/O path is much shorter (user space to
> kernel). Given that, its little surprising that qcow2 is still
> faster as compared to LVM.
RAW LVM, of course, is faster than qcow2, but performance drops when
creating any snapshots.
This test compared three technologies: LVM, LVMThin, and QCOW2 to
determinine which technology provides snapshots with less impact on
perfomance.
> If you somehow managed to use vhost-user-blk export instead, then I/O
> path is shorter for qcow2 as well and that might perform well.
Yeah, but problem is that I need a solution for both containers and
virtual machines.
You must use kernel to provide block devices for containers. As well
KubeVirt has no vhost-user support (yet).
> NBD will be slow. I am curious to know how do UBLK and VDUSE block
> compare. Technically there does not seem to be any reason by VDUSE
> virtio-vdpa device will be faster as compared to ublk. But I could
> be wrong.
Tests on this page represents performance of the local block device
not virtual machines.
UBLK is faster in 4K T1Q128 and much faster 4K T1Q1 with fsync, but
slower in 4K T4Q128 and for 4M sequential write operations.
> What about vhost-user-blk export. Have you considered that? That
> probably will be fastest.
For VMs yeah, but I need to align the design with KubeVirt, thus I
decided to do this later.
> > Despite two independent instances of qemu-storage-daemon for same
> > qcow2 disk running successfully on different hosts, I have concerns
> > about their proper functioning. Similar to live migration, I think
> > they should share the state between each other.
>
> Is it same LV on both the nodes? How are you activating same LV on
> two nodes? IIUC, LVM does not allow that.
LVM can live and work on multiple nodes even without clustered extension.
If there is no competition for modifying LVM metadata, and you perform
metadata refresh before every operation you can live without locks.
The locks can be implemented on CSI-driver side using Kubernetes API.
Some solutions already implements this approach, eg. OpenNebula and Proxmox.
But I'm going to use traditional approach with lvmlockd with sanlock backend.
> >
> > The question is how to make qemu-storage-daemon to share the state
> > between multiple nodes, or is qcow2 format inherently stateless and
> > does not requires this?
>
> That's a good question. For simplicity we could think of NFS backed
> storage and a qcow2 file providing storage. Can two QSD instances
> actually work with same qcow2 file?
>
> I am not sure this can be made to work with writable storage. Read-only
> storage, probably yes.
I already found an answer, on the qemu wiki page "Migration with
shared storage". It describes how live-migration works for various
formats:
QCOW2 caches two forms of data, cluster metadata (L1/L2 data, refcount
table, etc) and mutable header information (file size, snapshot
entries, etc). This data is discarded after the last piece of incoming
migration data is received but before the guest starts, hence QCOW2
images are safe to use with migration.
Thus, qemu does not perform the actual state sharing, instead it just
drops the cache after live-migration.
The problem that CSI driver can't actually catch the moment when VM is
finished migration, and perform the same, because it knows nothing
about workload which uses the block device.
But Kubernetes expecting that driver provides full featured
ReadWriteMany on both nodes.
I did some experiments:
experiment 1:
I exported qcow2 via nbd from the target node, and attached it on the
source node, then started 'blockdev-mirror' with `sync=none` from
local qcow to target node.
It worked. The target node block device got aware aware about the
changes on the source. But I think since this still asyncronios
operation, it will not work for intensive workloads. As well the
reverse 'blockdev-mirror' from the target node to the source node was
not working due to fact I actually made a loop (the bytes were moving
from source to target and vise-versa continiusly).
Thus I decided that qemu does not really support bidirectional sync of
the changes for the block device.
experiment 2:
I exported local qcow2 via nbd from the source node and attached it on
the target node. Thus target node used qcow2 from the source, and not
used it's local qcow2. Unfortunately writes on target didn't
syncronised to source node for some reason.
Also there were a problems with the stability of such configuration.
Eg. when source node got disconnected, the target node got stuck
waiting for nbd export.
I was thinking to play with `quorum` driver in qemu-storage-daemon to
specify two backends (remote and local). But I understood that it will
not gurantee the consistency of metadata any way. The actual user of
qcow2 must be only one.
Thus I made conclusion that whole this scheme with replication over
nbd will not work fine and I need to find another workaround.
One of them might be to identify the moment when KubeVirt ends
live-migration and perform disconnect/connect qcow2 file to drop and
reload the caches. But this will require modification of the code on
the KubeVirt side, and also make the driver dependent on the workload.
Which is completely wrong from the point of view of Kubernetes and
KubeVirt.
Storage must be storage. Workload must be workload. They should
communicate with each other only via dedicated interface (the block
device) and know nothing about control-plane of each other.
> For example, even if QSD could handle that, we will be having some
> local filesystem visible to client on this block device (say
> ext4/xfs/btrfs). These are built for one user and they don't expect
> any other client is changing the blocks at the same time.
>
> So I am not sure how one can export ReadWriteMany volumes using qcow2
> or LVM for that matter. We probably need a shared filesystem for that
> (NFS, GFS etc).
>
> Am I missing something?
I'm talking only about ReadWriteMany for block devices, not for
filesystem. The main purpose for this is live-migration of virtual
machines.
In theory there might be others, eg, highly available iSCSI-targets in
Pods, cluster-filesystems, but at now I don't consider them.