[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH VERSION 3] Disk image exclusive and shared locks
From: |
Anthony Liguori |
Subject: |
Re: [Qemu-devel] [PATCH VERSION 3] Disk image exclusive and shared locks. |
Date: |
Tue, 15 Dec 2009 12:45:13 -0600 |
User-agent: |
Thunderbird 2.0.0.23 (X11/20090825) |
Richard W.M. Jones wrote:
Thanks for looking at this.
Thanks for following up :-)
The use case is "cluster filesystem with an admin tool that must be
run exclusively". Cluster nodes open the block device for write, but
with a shared lock. The admin tool needs exclusive access (no nodes
must be writing), so it tries to open the device with lock=exclusive.
This only succeeds if the normal cluster nodes have backed off.
This seems like a reasonable uncommon scenario and my concern is that it
really hurts the interface. lock=on|off I think qualifies as "The
obvious use is the correct one" whereas I think
lock=exclusive|shared|none would qualify as "Read the implementation and
you'll get it right."
And that's mainly due to the number of corner cases that have to be
considered in order for exclusive to degrade into shared.
However, the patch would be a bit simpler if we just had lock=on|off
at the moment, and it wouldn't stop us from adding the shared case in
future. (For my needs I don't care about the shared case).
Let's stick with lock=on|off. If anyone comes along and actually wants
shared|exclusive, we can revisit the thread :-)
I really dislike this as an interface. I think we need to make a
decision about whether we delay open until after migration has
completed. I think unless there's a really compelling argument against
it, this is probably what we should do.
I'm guessing this needs some quite major surgery to qemu so that block
device opening is delayed in the case of migration. I can take a look
at this.
I think it's actually easier than I expected it to be.
We really just need to delay lock acquisition until it's really needed
(read/write op). For probing, we can get away without having a lock
held since it's read only and we only use it to decide which bdrv to use.
So what I'd expect is that we would to acquire a lock until we actually
ran the vm. We would basically have a bdrv_lock_all() that could fail
if some lock couldn't be acquired. We would execute that right before
running a VM (which means it happens after incoming migration completes).
After we completed migration (and did an fsck), we would call
bdrv_unlock_all() to drop the lock. Before we did 'cont' again, we
would issue bdrv_lock_all().
As long as we do a bdrv_unlock_all() before writing the last byte of the
migration stream, it's not at all racy. We also don't have to close the
fd on the source which means that we maintain the same level of robustness.
As it stands, we cannot make lock=!none the default if it takes an extra
monitor command to allow for live migration. I think if we're going to
introduce this functionality, we probably should be enabling it by
default.
Xen has a similar feature, and it is enabled by default.
And I really, really dislike it :-) There are a lot of corner cases
with requiring the use of '!' to the point where we ended up just using
it for everything in our management tools. IIRC, they don't use
advisory locking and instead try to be clever with something similar to
lsof.
Regards,
Anthony Liguori