[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] Block Filters
From: |
Fam Zheng |
Subject: |
Re: [Qemu-devel] Block Filters |
Date: |
Fri, 6 Sep 2013 17:18:20 +0800 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
On Fri, 09/06 10:45, Kevin Wolf wrote:
> Am 06.09.2013 um 09:56 hat Fam Zheng geschrieben:
> > On Tue, 09/03 18:24, Benoît Canet wrote:
> > >
> > > Hello list,
> > >
> > > I am thinking about QEMU block filters lately.
> > >
> > > I am not a block.c/blockdev.c expert so tell me what you think of the
> > > following.
> > >
> > > The use cases I see would be:
> > >
> > > -$user want to have some real cryptography on top of qcow2/qed or another
> > > format.
> > > snapshots and other block features should continue to work
> > >
> > > -$user want to use a raid like feature like QUORUM in QEMU.
> > > other features should continue to work
> > >
> > > -$user want to use the future SSD deduplication implementation with
> > > metadata on
> > > SSD and data on spinning disks.
> > > other features should continue to work
> > >
> > > -$user want to I/O throttle one drive of his vm.
> > >
> > > -$user want to do Copy On Read
> > >
> > > -$user want to do a combination of the above
> > >
> > > -$developer want to make the minimum of required steps to keep changes
> > > small
> > >
> > > -$developer want to keep user interface changes for later
> > >
> > > Lets take a example case of an user wanting to do I/O throttled encrypted
> > > QUORUM
> > > on top of QCOW2.
> > >
> > > Assuming we want to implement throttle and encryption as something
> > > remotely
> > > being like a block filter this makes a pretty complex BlockDriverState
> > > tree.
> > >
> > > The tree would look like the following:
> > >
> > > I/O throttling BlockDriverState (bs)
> > > |
> > > |
> > > |
> > > |
> > > Encryption BlockDriverState (bs)
> > > |
> > > |
> > > |
> > > |
> > > Quorum BlockDriverState (bs)
> > > / | \
> > > / | \
> > > / | \
> > > / | \
> > > QCOW2 bs QCOW2 b s QCOW2 bs
> > > | | |
> > > | | |
> > > | | |
> > > | | |
> > > RAW bs RAW bs RAW bs
> > >
> > > An external snapshot should result in a tree like the following.
> > > I/O throttling BlockDriverState (bs)
> > > |
> > > |
> > > |
> > > |
> > > Encryption BlockDriverState (bs)
> > > |
> > > |
> > > |
> > > |
> > > Quorum BlockDriverState (bs)
> > > / | \
> > > / | \
> > > / | \
> > > / | \
> > > QCOW2 bs QCOW2 bs QCOW2 bs
> > > | | |
> > > | | |
> > > | | |
> > > | | |
> > > QCOW2 bs QCOW2 bs QCOW2 bs
> > > | | |
> > > | | |
> > > | | |
> > > | | |
> > > RAW bs RAW bs RAW bs
> > >
> > > In the current state of QEMU we can code some block drivers to implement
> > > this
> > > tree.
> > >
> > > However when doing operations like snapshots blockdev.c would have no
> > > real idea
> > > of what should be snapshotted and how. (The 3 top bs should be kept on
> > > top)
> > >
> > > Moreover it would have no way to manipulate easily this tree of
> > > BlockDriverState
> > > has each one is encapsulated in it's parent.
> > >
> > > Also there no generic way to tell the block layer that two or more
> > > BlockDriverState
> > > are siblings.
> > >
> > > The current mail is here to propose some additionals structures in order
> > > to cope
> > > with these problems.
> > >
> > > The overall strategy of the proposed structures is to push out the
> > > BlockDriverStates relationships out of each BlockDriverState.
> > >
> > > The idea is that it would make it easier for the block layer to
> > > manipulate a
> > > well known structure instead of being forced to enter into each
> > > BlockDriverState
> > > specificity.
> > >
> > > The first structure is the BlockStackNode.
> > >
> > > The BlockStateNode would be used to represent the relationship between the
> > > various BlockDriverStates
> > >
> > > struct BlockStackNode {
> > > BlockDriverState *bs; /* the BlockDriverState holded by this node */
> > >
> > > /* this doubly linked list entry points to the child node and the
> > > parent
> > > * node
> > > */
> > > QLIST_ENTRY(BlockStateNode) down;
> > >
> > > /* This doubly linked list entry point to the siblings of this node
> > > */
> > > QLIST_ENTRY(BlockStateNode) siblings;
> > >
> > > /* a hash or an array of the sibbling of this node for fast access
> > > * should be recomputed when updating the tree */
> > > QHASH_ENTRY<BlockStateNode, index> sibblings_hash;
> > > }
> > >
> > > The BlockBackend would be the structure used to hold the "drive" the
> > > guest use.
> > >
> > > struct BlockBackend {
> > > /* the following doubly linked list header point to the top
> > > BlockStackNode
> > > * in our case it's the one containing the I/O throttling bs
> > > */
> > > QLIST_HEAD(, BlockStateNode) block_stack_head;
> > > /* this is a pointer to the topest node below the block filter chain
> > > * in our case the first QCOW2 sibling
> > > */
> > > BlockStackNode *top_node_below_filters;
> > > }
> > >
> > >
> > > Updated diagram:
> > >
> > > (Here bsn means BlockStacknode)
> > >
> > > ------------------------BlockBackend
> > > | |
> > > | block_stack_head
> > > | |
> > > | |
> > > | I/O throttling BlockStackNode (contains it's
> > > bs)
> > > | |
> > > | down
> > > | |
> > > | |
> > > top_node_below_filter Encryption BlockStacknode (contains it's bs)
> > > | |
> > > | down
> > > | |
> > > | |
> > > | Quorum BlockStackNode (contain's it's bs)
> > > | /
> > > | down
> > > | /
> > > | / S S
> > > ------ QCOW2 bsn--i---QCOW2 bsn--i------ QCOW2 bsn (each bsn
> > > contains a bs)
> > > | b | b |
> > > down l down l down
> > > | i | i |
> > > | n | n |
> > > | g | g |
> > > | s | s |
> > > | | |
> > > RAW bsn RAW bsn RAW bsn (each bsn contains
> > > a bs)
> > >
> > >
> > > Block driver point of view:
> > >
> > > to construct the tree each BlockDriver would have some utility functions
> > > looking
> > > like.
> > >
> > > bdrv_register_child_bs(bs, child_bs, int index);
> > >
> > > multiples calls to this function could be done to register multiple
> > > siblings
> > > childs identified by their index.
> > >
> > > This way something like quorum could register multiple QCOW2 instances.
> > >
> > > driver would have a
> > > BlockDriverSTate *bdrv_access_child(bs, int index);
> > >
> > > to access their childs.
> > >
> > > These functions can be implemented without the driver knowing about
> > > BlockStateNodes using container_of.
> > >
> > > blockdev point of view: (here I need your help)
> > >
> > > When doing a snapshot blockdev.c would access
> > > BlockBackend->top_node_below_filter and make a snapshot of the bs
> > > contained in
> > > this node and it's sibblings.
> > >
> > Since BlockDriver.bdrv_snapshot_create() is an optional operation,
> > blockdev.c
> > can navigate down the tree from top node, until hitting some layer where
> > the op
> > is implemented (the QCow2 bs), so we get rid of this top_node_below_filter
> > pointer.
>
> Is it even inherent to a block driver (like a filter), if a snapshot is
> to be taken at its level? Or is it rather a policy decision that should
> be made by the user?
>
OK, getting the point that user should have full flexibility and fine operation
granularity. It also stands against block_backend->top_node_below_filter. Do we
really have the assumption that all the filters are on top of the tree and
linear?
Shouldn't this be possible?
Block Backend
|
|
Quodrum BDS
/ | \
iot filter | \
/ | \
qcow2 qcow2 qcow2
So we throttle only a particular image, not the whole device. But this will
make a top_node_below_filter pointer impossible.
> In our example, the quorum driver, it's not at all clear to me that you
> want to snapshot all children. In order to roll back to a previous
> state, one snapshot is enough, you don't need multiple copies of the
> same one. Perhaps you want two so that we can still compare them for
> verification. Or all of them because you can afford the disk space and
> want ultimate safety. I don't think qemu can know which one is true.
>
Only if quorum ever knows about and operates on snapshots, it should be
considered specifically, but no. So we need to achieve this in the general
design: allow user to take snapshot, or set throttle limits on particular
BDSes, as above graph.
> In the same way, in a typical case you may want to keep I/O throttling
> for the whole drive, including the new snapshot. But what if the
> throttling was used in order to not overload the network where the image
> is stored, and you're now doing a local snapshot, to which you want to
> stream the image? The I/O throttling should apply only to the backing
> file, not the new snapshot.
>
Yes, and OTOH, throttling really suits to be a filter only if it can be a non
top one, otherwise it's no better than what we have now.
> So perhaps what we really need is a more flexible snapshot/BDS tree
> manipulation command that describes in detail which structure you want
> to have in the end.
>
Fam
- Re: [Qemu-devel] Block Filters, (continued)
- Re: [Qemu-devel] Block Filters, Benoît Canet, 2013/09/04
- Re: [Qemu-devel] Block Filters, Kevin Wolf, 2013/09/05
- Re: [Qemu-devel] Block Filters, Stefan Hajnoczi, 2013/09/05
- Re: [Qemu-devel] Block Filters, Fam Zheng, 2013/09/05
- Re: [Qemu-devel] Block Filters, Stefan Hajnoczi, 2013/09/05
- Re: [Qemu-devel] Block Filters, Benoît Canet, 2013/09/05
- Re: [Qemu-devel] Block Filters, Kevin Wolf, 2013/09/06
- Re: [Qemu-devel] Block Filters, Fam Zheng, 2013/09/06
Re: [Qemu-devel] Block Filters, Fam Zheng, 2013/09/06
- Re: [Qemu-devel] Block Filters, Kevin Wolf, 2013/09/06
- Re: [Qemu-devel] Block Filters,
Fam Zheng <=
- Re: [Qemu-devel] Block Filters, Kevin Wolf, 2013/09/06
- Re: [Qemu-devel] Block Filters, Fam Zheng, 2013/09/06
- Re: [Qemu-devel] Block Filters, Benoît Canet, 2013/09/15
- Re: [Qemu-devel] Block Filters, Fam Zheng, 2013/09/16
- Re: [Qemu-devel] Block Filters, Benoît Canet, 2013/09/16
- Re: [Qemu-devel] Block Filters, Benoît Canet, 2013/09/16