[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [RFC] qcow2 journalling draft
From: |
Fam Zheng |
Subject: |
Re: [Qemu-devel] [RFC] qcow2 journalling draft |
Date: |
Fri, 6 Sep 2013 17:20:53 +0800 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
On Wed, 09/04 11:39, Kevin Wolf wrote:
> First of all, excuse any inconsistencies in the following mail. I wrote
> it from top to bottom, and there was some thought process involved in
> almost every paragraph...
>
> Am 04.09.2013 um 10:03 hat Stefan Hajnoczi geschrieben:
> > On Tue, Sep 03, 2013 at 03:45:52PM +0200, Kevin Wolf wrote:
> > > @@ -103,7 +107,11 @@ in the description of a field.
> > > write to an image with unknown auto-clear features
> > > if it
> > > clears the respective bits from this field first.
> > >
> > > - Bits 0-63: Reserved (set to 0)
> > > + Bit 0: Journal valid bit. This bit indicates
> > > that the
> > > + image contains a valid main journal
> > > starting at
> > > + journal_offset.
> >
> > Whether the journal is used can be determined from the journal_offset
> > value (header length must be large enough and journal offset must be
> > valid).
> >
> > Why do we need this autoclear bit?
>
> Hm, I introduced this one first and the journal dirty incompatible bit
> later, perhaps it's unnecessary now. Let's check...
>
> The obvious thing we need to protect against is applying stale journal
> data to an image that has been changed by an older version. As long as
> the journal is clean, this can't happen, and the journal dirty bit will
> ensure that the old version can only open the image if it is clean.
>
> However, what if we run 'qemu-img check -r leaks' with an old qemu-img
> version? It will reclaim the clusters used by the journal, and if we
> continue using the journal we'll corrupt whatever new data is there
> now.
>
Why can old version qemu-img open the image with dirty journal in the first
place? It's incompatible bit.
> Can we protect against this without using an autoclear bit?
>
> > > +Journals are used to allow safe updates of metadata without impacting
> > > +performance by requiring flushes to order updates to different parts of
> > > the
> > > +metadata.
> >
> > This sentence is hard to parse. Maybe something shorter like this:
> >
> > Journals allow safe metadata updates without the need for carefully
> > ordering and flushing between update steps.
>
> Okay, I'll update the text with your proposal.
>
> > > +They consist of transactions, which in turn contain operations that
> > > +are effectively executed atomically. A qcow2 image can have a main image
> > > +journal that deals with cluster management operations, and additional
> > > specific
> > > +journals can be used by other features like data deduplication.
> >
> > I'm not sure if multiple journals will work in practice. Doesn't this
> > re-introduce the need to order update steps and flush between them?
>
> This is a question for Benoît, who made this requirement. I asked him
> the same a while ago and apparently his explanation made some sense to
> me, or I would have remembered that I don't want it. ;-)
>
> It might have something to do with the fact that deduplication uses the
> journal more as a kind of cache for hash values that can be dropped and
> rebuilt after a crash.
>
> > > +A journal is organised in journal blocks, all of which have a reference
> > > count
> > > +of exactly 1. It starts with a block containing the following journal
> > > header:
> > > +
> > > + Byte 0 - 7: Magic ("qjournal" ASCII string)
> > > +
> > > + 8 - 11: Journal size in bytes, including the header
> > > +
> > > + 12 - 15: Journal block size order (block size in bytes = 1 <<
> > > order)
> > > + The block size must be at least 512 bytes and must
> > > not
> > > + exceed the cluster size.
> > > +
> > > + 16 - 19: Journal block index of the descriptor for the last
> > > + transaction that has been synced, starting with 1
> > > for the
> > > + journal block after the header. 0 is used for empty
> > > + journals.
> > > +
> > > + 20 - 23: Sequence number of the last transaction that has been
> > > + synced. 0 is recommended as the initial value.
> > > +
> > > + 24 - 27: Sequence number of the last transaction that has been
> > > + committed. When replaying a journal, all transactions
> > > + after the last synced one up to the last commit one
> > > must be
> > > + synced. Note that this may include a wraparound of
> > > sequence
> > > + numbers.
> > > +
> > > + 28 - 31: Checksum (one's complement of the sum of all bytes
> > > in the
> > > + header journal block except those of the checksum
> > > field)
> > > +
> > > + 32 - 511: Reserved (set to 0)
> >
> > I'm not sure if these fields are necessary. They require updates (and
> > maybe flush) after every commit and sync.
> >
> > The fewer metadata updates, the better, not just for performance but
> > also to reduce the risk of data loss. If any metadata required to
> > access the journal is corrupted, the image will be unavailable.
> >
> > It should be possible to determine this information by scanning the
> > journal transactions.
>
> This is rather handwavy. Can you elaborate how this would work in detail?
>
>
> For example, let's assume we get to read this journal (a journal can be
> rather large, too, so I'm not sure if we want to read it in completely):
>
> - Descriptor, seq 42, 2 data blocks
> - Data block
> - Data block
> - Data block starting with "qjbk"
> - Data block
> - Descriptor, seq 7, 0 data blocks
> - Descriptor, seq 8, 1 data block
> - Data block
>
> Which of these have already been synced? Which have been committed?
>
>
> I guess we could introduce an is_commited flag in the descriptor, but
> wouldn't correct operation look like this then:
>
> 1. Write out descriptor commit flag clear and any data blocks
> 2. Flush
> 3. Rewrite descriptor with commit flag set
>
> This ensures that the commit flag is only set if all the required data
> is indeed stable on disk. What has changed compared to this proposal is
> just the offset at which you write in step 3 (header vs. descriptor).
>
> For sync I suppose the same option exists.
>
>
> One unrelated thing we need to take care of is this 'data block starting
> with "qjbk"' thing I mentioned in the example. I can see two solutions:
> The first is never creating such a journal, by always blanking the next
> block after a transaction that we write, so that the descriptor chain is
> terminated. The second one is never creating such a journal, by
> zeroing an initial "qjbk" in data blocks and setting a flag in the
> descriptor instead.
>
> I was thinking of the first, but I guess it would be worth at least
> mentioning the problem in the spec.
>
> > > +A wraparound may not occur in the middle of a single transaction, but
> > > only
> > > +between two transactions. For the necessary padding an empty descriptor
> > > with
> > > +any number of data blocks can be used as the last entry of the ring.
> >
> > Why have this limitation?
>
> I thought it would make implementations easier if they didn't have to
> read/write in data blocks in two parts in some cases.
>
> > > +All descriptors start with a common part:
> > > +
> > > + Byte 0 - 1: Descriptor type
> > > + 0 - No-op descriptor
> > > + 1 - Write data block
> > > + 2 - Copy data
> > > + 3 - Revoke
> > > + 4 - Deduplication hash insertion
> > > + 5 - Deduplication hash deletion
> > > +
> > > + 2 - 3: Size of the descriptor in bytes
> >
> > Data blocks are not included in the descriptor size? I just want to
> > make sure that we don't be limited to 64 KB for the actual data.
>
> Right, this only for the descriptors, without data. It implies a maximum
> journal block size of 64k, which I think is okay.
>
> > > +
> > > + 4 - n: Type-specific data
> > > +
> > > +The following section specifies the purpose (i.e. the action that is to
> > > be
> > > +performed when syncing) and type-specific data layout of each descriptor
> > > type:
> > > +
> > > + * No-op descriptor: No action is to be performed when syncing this
> > > descriptor
> > > +
> > > + 4 - n: Ignored
> > > +
> > > + * Write data block: Write literal data associated with this
> > > transaction from
> > > + the journal to a given offset.
> > > +
> > > + 4 - 7: Length of the data to write in bytes
> > > +
> > > + 8 - 15: Offset in the image file to write the data to
> > > +
> > > + 16 - 19: Index of the journal block at which the data to write
> > > + starts. The data must be stored sequentially and be
> > > fully
> > > + contained in the data blocks associated with the
> > > + transaction.
> > > +
> > > + The type-specific data can be repeated, specifying multiple chunks
> > > of data
> > > + to be written in one operation. This means the size of the
> > > descriptor must
> > > + be 4 + 16 * n.
> >
> > Why is the necessary? Multiple data descriptors could be used, is it
> > worth the additional logic and testing?
>
> What does a typical journal transaction look like?
>
> My assumption is that initially it has lots of L2 table and refcount
> block updates and little else. All of these are data writes. Once we
> implement Delayed COW using the journal, we get some data copy
> operations as well. Assuming a stupid guest, we get two copies and two
> writes for each cluster allocation.
>
> The space required for these is 2*(4 + 16n) + 2*(4 + 20n) = 16 + 72n. In
> one 4k descriptor block you can fit 56 cluster allocations.
>
> If you used separate descriptors, you get 2*20n + 2*24n = 88n. In one 4k
> descriptor block you could fit 46 cluster allocations.
>
> Considering that in both cases the space used by descriptors is dwarved
> by the updated tables to be written, I guess it's not worth it for the
> main journal. Things may look different for the dedpulication journal,
> where no data blocks are used and the space taken is determined only by
> the descriptors.
>
> And in fact... All of the above sounded great, but is in fact wrong.
> Typically you'd get _one_ L2 update for all (sequential) allocations,
> the COW entries without data should dominate in the main journal as well.
>
> > > +
> > > + * Copy data: Copy data from one offset in the image to another one.
> > > This can
> > > + be used for journalling copy-on-write operations.
> >
> > This reminds me to ask what the plan is for journal scope: metadata only
> > or also data? For some operations like dedupe it seems that full data
> > journalling may be necessary. But for an image without dedupe it would
> > not be necessary to journal the rewrites to an already allocated
> > cluster, for example.
>
> Generally only metadata.
>
> The COW case is a bit tricky. Here we really store only metadata in the
> journal as well ("this COW is still pending"), but it directly affects
> data and the qemu read/write paths for data must take such pending COWs
> into consideration. This will require some ordering between the journal
> and data (e.g. only free the COWed cluster after the journal is
> committed), but I think that's okay.
>
> > > + 4 - 7: Length of the data to write in bytes
> > > +
> > > + 8 - 15: Target offset in the image file
> > > +
> > > + 16 - 23: Source offset in the image file
> >
> > Source and target cannot overlap?
>
> I don't think there's a use case for it, so let's forbid it.
>
> > > +
> > > + The type-specific data can be repeated, specifying multiple chunks
> > > of data
> > > + to be copied in one operation. This means the size of the descriptor
> > > must
> > > + be 4 + 20 * n.
> > > +
> > > + * Revoke: Marks operations on a given range in the imag file invalid
> > > for all
> >
> > s/imag/image/
> >
> > > + earlier transactions (this does not include the transaction
> > > containing the
> > > + revoke). They must not be executed on a sync operation (e.g. because
> > > the
> > > + range in question has been freed and may have been reused for other,
> > > not
> > > + journalled data structures that must not be overwritten with stale
> > > data).
> > > + Note that this may mean that operations are to be executed partially.
> >
> > Example scenario?
>
> Once I had one... Well, let's try this:
>
> 1. Cluster allocation that allocates a new L2 table, i.e. L1 update gets
> journalled
>
> 2. Another cluster allocation, this time the L1 has to grow. The write
> to the new L1 position is journalled, the old table is freed.
>
> 3. Next allocation reuses the clusters where the old L1 table was stored
>
> 4. Journal sync. The L1 update from step 1 is written out to the
> clusters that are now used for data. Oops.
>
> Kevin
- Re: [Qemu-devel] [RFC] qcow2 journalling draft, (continued)
- Re: [Qemu-devel] [RFC] qcow2 journalling draft, Stefan Hajnoczi, 2013/09/05
- Re: [Qemu-devel] [RFC] qcow2 journalling draft, Kevin Wolf, 2013/09/05
- Re: [Qemu-devel] [RFC] qcow2 journalling draft, Stefan Hajnoczi, 2013/09/05
- Re: [Qemu-devel] [RFC] qcow2 journalling draft, Kevin Wolf, 2013/09/05
- Re: [Qemu-devel] [RFC] qcow2 journalling draft, Eric Blake, 2013/09/05
- Re: [Qemu-devel] [RFC] qcow2 journalling draft,
Fam Zheng <=
- Re: [Qemu-devel] [RFC] qcow2 journalling draft, Kevin Wolf, 2013/09/06
- Re: [Qemu-devel] [RFC] qcow2 journalling draft, Fam Zheng, 2013/09/06
Re: [Qemu-devel] [RFC] qcow2 journalling draft, Max Reitz, 2013/09/04
Re: [Qemu-devel] [RFC] qcow2 journalling draft, Stefan Hajnoczi, 2013/09/05
Re: [Qemu-devel] [RFC] qcow2 journalling draft, Fam Zheng, 2013/09/06