[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistenc
From: |
Walid Nouri |
Subject: |
Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency |
Date: |
Wed, 17 Sep 2014 22:53:32 +0200 |
User-agent: |
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 |
Thank you for your Time and the detailed answer!
I have needed some time to work through your answer ;-)
What MC needs is a block device agnostic, controlled and asynchronous
approach for replicating the contents of block devices and its state changes
to the secondary VM while the primary VM is running. Asynchronous block
transfer is important to allow maximum performance for the primary VM, while
keeping the secondary VM updated with state changes.
The block device replication should be possible in two stages or modes.
The first stage is the live copy of all block devices of the primary to the
secondary. This is necessary if the secondary doesn???t have an existing
image which is in sync with the primary at the time MC has started. This is
not very convenient but as far as I know actually there is no mechanism for
persistent dirty bitmap in QEMU.
I think you are trying to address the non-shared storage cause where the
secondary needs to acquire the initial state of the primary.
That's correct!
drive-mirror copies the contents of a source disk image to a
destination. If the guest is running while copying takes place then new
writes will also be mirrored.
drive-mirror should be sufficient for the initial phase where primary
and secondary get in sync.
Fam Zheng sent a patch series earlier this year to add dirty bitmaps for
block devices to QEMU. It only supported in-memory bitmaps but
persistent bitmaps are fairly straightforward to implement. I'm
interested in these patches for the incremental backup use case.
https://lists.gnu.org/archive/html/qemu-devel/2014-03/msg05250.html
I guess the reason you mention persistent bitmaps is to save time when
adding a host that previously participated and has an older version of
the disk image?
Yes, it is desirable not to always mirror the whole image before the MC
protection can become active. This would save time in case of a lost
communication, shutdown or maintenance on the secondary.
The persistent dirty bitmap must have a mechanism to identify that a
pair of images belong to each other and which of both is the primary
with the actual valid data. I think that's self-sufficient "little"
project...but the next logical step :)
The second stage (mode) is the replication of block device state changes
(modified blocks) to keep the image on the secondary in sync with the
primary. The mirrored blocks must be buffered in ram (block buffer) until
the complete Checkpoint (RAM, vCPU, device state) can be committed.
For keeping the complete system state consistent on the secondary system
there must be a possibility for MC to commit/discard block device state
changes. In normal operation the mirrored block device state changes (block
buffer) are committed to disk when the complete checkpoint is committed. In
case of a crash of the primary system while transferring a checkpoint the
data in the block buffer corresponding to the failed Checkpoint must be
discarded.
Thoughts:
Writing data safely to disk can take milliseconds. Not sure how that
figures into your commit step, but I guess commit needs to be fast.
We have no time to waste ;) but the disk semantic at the primary should
be kept as expected from the primary. The time to acknowledge a
checkpoint from the secondary will be delayed for the time needed to
write all pending I/O requests of a checkpoint to disk.
I think for normal operation (just replication) the secondary can use
the same semantics for the disc writes as the primary. Wouldn't that be
safe enough?
I/O requests happen in parallel with CPU execution, so could an I/O
request be pending across a checkpoint commit? Live migration does not
migrate inflight requests, although it has special case code for
migration requests that have failed at the host level and need to be
retried. Another way of putting this is that live migration uses
bdrv_drain_all() to quiesce disks before migrating device state - I
don't think you have that luxury since bdrv_drain_all() can take a long
time and is not suitable for microcheckpointing.
Block devices have the following semantics:
1. There is no ordering between parallel in-flight I/O requests.
2. The guest sees the disk state for completed writes but it may not see
disk state of in-flight writes (due to #1).
3. Completed writes are only guaranteed to be persistent across power
failure if a disk cache flush was submitted and completed after the
writes completed.
I'm not sure if I got your point.
The proposed MC block device protocol sends all block device state
updates to the secondary directly after writing them to the primary
block devices. This keeps the disc semantics for the primary and the
secondary stays updated with the disc state changes of the actual epoch.
At the end of an epoch the primary gets paused to create a system state
snapshot. A this moment there could be some pending write I/O requests
on the primary which overlap with the generation of the system state
snapshot? Do you meant a situation like that?
If this is your point then I think you are right, this is possible...and
that raises your interesting question: How to deal with pending requests
at the end of an epoch or how to be sure that all disc state changes of
an epoch have been replicated?
Currently the MC protocol only cares about a part of the system state
(RAM,vCPU,devices) and excludes the block device state changes.
To correctly use drive-mirror functionality the MC protocol must also be
extended to check that all disc state changes of the primary
corresponding to the current epoch have been delivered to the secondary.
When all state data is completely sent the checkpoint transaction can be
committed.
When the checkpoint transaction is complete the secondary commits its
disc state buffer and the rest (RAM, vCPU,devices) of the checkpoint and
ACKS the complete checkpoint to the primary.
IMHO the easiest way for MC to track that all block device changes have
been replicated would be to ask drive-mirror if the paused primary has
unprocessed write requests.
As long as there are dirty blocks or in-flights, the checkpoint
transaction of the current epoch is not complete.
Maybe you can give me a hint what you think is the best way (api
call(s)) to ask drive-mirror if there are pending write operations???
I think this can be achieved by drive-mirror and a filter block driver.
Another approach could be to exploit the block migration functionality of
live migration with a filter block driver.
block-migration.c should be avoided because it may be dropped from QEMU.
It is unloved code and has been replaced by drive-mirror.
Good to know!!!
I will avoid using block-migration.c.
The drive-mirror (and live migration) does not rely on shared storage and
allow live block device copy and incremental syncing.
A block buffer can be implemented with a QEMU filter block driver. It should
sit at the same position as the Quorum driver in the block driver hierarchy.
When using block filter approach MC will be transparent and block device
agnostic.
The block buffer filter must have an Interface which allows MC control the
commits or discards of block device state changes. I have no idea where to
put such an interface to stay conform with QEMU coding style.
I???m sure there are alternative and better approaches and I???m open for
any ideas
You can use drive-mirror and the run-time NBD server in QEMU without
modification:
Primary (drive-mirror) ---writes---> Secondary (NBD export in QEMU)
Your block filter idea can work and must have the logic so that a commit
operation sent via the microcheckpointing protocol causes the block
filter to write buffered data to disk and flush the host disk cache.
That's exactly what the block filter has to do. Where would be the right
place to put the api call to the block filter flush logic "blockdev.c"?
To ensure that the disk image on the secondary is always in a crash
consistent state (i.e. the state you get from power failure), the
secondary needs to know when disk cache flush requests were sent and the
write ordering. That way, even if there is a power failure while the
secondary is committing, the disk will be in a crash consistent state.
After the secondary (or primary) is booted again file systems or
databases will be able to fsck and resume.
(In other words, in a catastrophic failure you won't be any worse off
than with a power failure on an unprotected single machine.)
I case of a fail over the secondary must drain all discs before becoming
the new primary even if there is delay caused by flushing disc buffers.
Otherwise the state of the block device could be not consistent with the
rest of the system state when (the new) primary starts processing.
Walid