qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)


From: Kevin Wolf
Subject: Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1)
Date: Wed, 21 Nov 2012 13:37:00 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120605 Thunderbird/13.0

Am 21.11.2012 12:10, schrieb Dietmar Maurer:
>>> +Note: It turned out that taking a qcow2 snapshot can take a very long
>>> +time on larger files.
>>
>> Hm, really? What are "larger files"? It has always been relatively quick 
>> when I
>> tested it, though internal snapshots are not my focus, so that need not mean
>> much.
> 
> 300GB or larger
>  
>> If this is really an important use case for someone, I think qcow2 internal
>> snapshots still have some potential for relatively easy performance
>> optimisations.
> 
> I guess the problem is the small cluster size, so the reference table gets 
> quite large
> (for example fvd uses 2GB to minimize table size).

qemu-img check gives an idea of what it costs to read in the whole
metadata of an image. Updating some of it should mean not more than a
factor of two. I'm seeing much bigger differences, so I suspect there's
something wrong.

Somebody should probably try tracing where the performance is lost.

>> But that just as an aside...
>>
>>> +
>>> +=Make it more efficient=
>>> +
>>> +The be more efficient, we simply need to avoid unnecessary steps. The
>>> +following steps are always required:
>>> +
>>> +1.) read old data before it gets overwritten
>>> +2.) write that data into the backup archive
>>> +3.) write new data (VM write)
>>> +
>>> +As you can see, this involves only one read, an two writes.
>>
>> Looks like a nice approach to backup indeed.
>>
>> The question is how to fit this into the big picture of qemu's live block
>> operations. Much of it looks like an active mirror (which is still to be
>> implemented), with the difference that it doesn't write the new, but the old
>> data, and that it keeps a bitmap of clusters that should not be mirrored.
>>
>> I'm not sure if this means that code should be shared between these two or
>> if the differences are too big. However, both of them have things in common
>> regarding the design. For example, both have a background part (copying the
>> existing data) and an active part (mirroring/backing up data on writes). 
>> Block
>> jobs are the right tool for the background part.
> 
> I already use block jobs. Or do you want to share more?

I was thinking about sharing code between a future active mirror and the
backup job. Which may or may not make sense. I'm mostly hoping for input
from Paolo here.

>> The active part is a bit more tricky. You're putting some code into block.c 
>> to
>> achieve it, which is kind of ugly. 
> 
> yes. but I tried to keep that small ;-)

Yup, it's already not too bad. I haven't looked into it in much detail,
but I'd like to reduce it even a bit more. In particular, the
backup_info field in the BlockDriverState feels wrong to me. In the long
term the generic block layer shouldn't know at all what a backup is, and
baking it into BDS couples it very tightly.

>> We have been talking about "block filters"
>> previously that would provide a generic infrastructure, and at least in the 
>> mid
>> term the additions to block.c must disappear.
>> (Same for block.h and block_int.h - keep things as separated from the core as
>> possible) Maybe we should introduce this infrastructure now.
> 
> I have no idea what you talk about? Can you point me to the relevant 
> discussion?

Not sure if a single discussion explains it, and I can't even find one
at the moment.

In short, the idea is that you can stick filters on top of a
BlockDriverState, so that any read/writes (and possibly more requests,
if necessary) are routed through the filter before they are passed to
the block driver of this BDS. Filters would be implemented as
BlockDrivers, i.e. you could implement .bdrv_co_write() in a filter to
intercept all writes to an image.

>> Another interesting point is how (or whether) to link block jobs with block
>> filters. I think when the job is started, the filter should be inserted
>> automatically, and when you cancel it, it should be stopped.
>> When you pause the job... no idea. :-)

>> Essentially, what you need is an image format. You want to be independent
>> from the source image formats, but you're okay with using a specific format
>> for the backup (or you wouldn't have defined a new format for it).
>>
>> The one special thing that you need is storing multiple images in one file.
>> There's something like this already in qemu: qcow2 with its internal
>> snapshots is basically a flat file system.
>>
>> Not saying that this is necessarily the best option, but I think reusing 
>> existing
>> formats and implementation is always a good thing, so it's an idea to
>> consider.
> 
> AFAIK qcow2 file cannot store data out of order. In general, an backup fd is 
> not seekable, 
> and we only want to do sequential writes. Image format always requires 
> seekable fds?

Ah, this is what you mean by "out of order". Just out of curiosity, what
are these non-seekable backup fds usually?

In principle even for this qcow2 could be used as an image format,
however the existing implementation wouldn't be of much use for you, so
it loses quite a bit of its attractiveness.

> Anyways, a qcow2 file is really complex beast - I am quite unsure if I would 
> use 
> that for backup if it is possible. 
> 
> That would require any external tool to include >=50000 LOC
> 
> The vma reader code is about 700 LOC (quite easy).

So what? qemu-img is already there.

Kevin



reply via email to

[Prev in Thread] Current Thread [Next in Thread]