qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 0/7] qcow2: compressed write cache


From: Denis V. Lunev
Subject: Re: [PATCH 0/7] qcow2: compressed write cache
Date: Tue, 9 Feb 2021 19:52:59 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.1

On 2/9/21 5:47 PM, Max Reitz wrote:
> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>> 09.02.2021 16:25, Max Reitz wrote:
>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>> Hi all!
>>>>
>>>> I know, I have several series waiting for a resend, but I had to
>>>> switch
>>>> to another task spawned from our customer's bug.
>>>>
>>>> Original problem: we use O_DIRECT for all vm images in our product,
>>>> it's
>>>> the policy. The only exclusion is backup target qcow2 image for
>>>> compressed backup, because compressed backup is extremely slow with
>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>> produces a lot of pagecache.
>>>>
>>>> So we can either implement some internal cache or use fadvise somehow.
>>>> Backup has several async workes, which writes simultaneously, so in
>>>> both
>>>> ways we have to track host cluster filling (before dropping the cache
>>>> corresponding to the cluster).  So, if we have to track anyway, let's
>>>> try to implement the cache.
>>>
>>> I wanted to be excited here, because that sounds like it would be
>>> very easy to implement caching.  Like, just keep the cluster at
>>> free_byte_offset cached until the cluster it points to changes, then
>>> flush the cluster.
>>
>> The problem is that chunks are written asynchronously.. That's why
>> this all is not so easy.
>>
>>>
>>> But then I see like 900 new lines of code, and I’m much less excited...
>>>
>>>> Idea is simple: cache small unaligned write and flush the cluster when
>>>> filled.
>>>>
>>>> Performance result is very good (results in a table is time of
>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>
>>> “Filled with ones” really is an edge case, though.
>>
>> Yes, I think, all clusters are compressed to rather small chunks :)
>>
>>>
>>>> ---------------  -----------  -----------
>>>>                   backup(old)  backup(new)
>>>> ssd:hdd(direct)  3e+02        4.4
>>>>                                  -99%
>>>> ssd:hdd(cached)  5.7          5.4
>>>>                                  -5%
>>>> ---------------  -----------  -----------
>>>>
>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>> cache by default (which is done by the series).
>>>
>>> First, I’m not sure how O_DIRECT really is relevant, because I don’t
>>> really see the point for writing compressed images.
>>
>> compressed backup is a point
>
> (Perhaps irrelevant, but just to be clear:) I meant the point of using
> O_DIRECT, which one can decide to not use for backup targets (as you
> have done already).
>
>>> Second, I find it a bit cheating if you say there is a huge
>>> improvement for the no-cache case, when actually, well, you just
>>> added a cache.  So the no-cache case just became faster because
>>> there is a cache now.
>>
>> Still, performance comparison is relevant to show that O_DIRECT as is
>> unusable for compressed backup.
>
> (Again, perhaps irrelevant, but:) Yes, but my first point was exactly
> whether O_DIRECT is even relevant for writing compressed images.
>
>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much
>>> sense for compressed images, qemu’s format drivers are free to
>>> introduce some caching (because technically the cache.direct option
>>> only applies to the protocol driver) for collecting compressed writes.
>>
>> Yes I thought in this way, enabling the cache by default.
>>
>>> That conclusion makes both of my complaints kind of moot.
>>>
>>> *shrug*
>>>
>>> Third, what is the real-world impact on the page cache?  You
>>> described that that’s the reason why you need the cache in qemu,
>>> because otherwise the page cache is polluted too much.  How much is
>>> the difference really?  (I don’t know how good the compression ratio
>>> is for real-world images.)
>>
>> Hm. I don't know the ratio.. Customer reported that most of RAM is
>> polluted by Qemu's cache, and we use O_DIRECT for everything except
>> for target of compressed backup.. Still the pollution may relate to
>> several backups and of course it is simple enough to drop the cache
>> after each backup. But I think that even one backup of 16T disk may
>> pollute RAM enough.
>
> Oh, sorry, I just realized I had a brain fart there.  I was referring
> to whether this series improves the page cache pollution.  But
> obviously it will if it allows you to re-enable O_DIRECT.
>
>>> Related to that, I remember a long time ago we had some discussion
>>> about letting qemu-img convert set a special cache mode for the
>>> target image that would make Linux drop everything before the last
>>> offset written (i.e., I suppose fadvise() with
>>> POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
>>> that implementing a cache in qemu would be simple, but it isn’t,
>>> really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
>>> advantage of using that would be that we could reuse it for
>>> non-compressed images that are written by backup or qemu-img convert.)
>>
>> The problem is that writes are async. And therefore, not sequential.
>
> In theory, yes, but all compressed writes still goes through
> qcow2_alloc_bytes() right before submitting the write, so I wonder
> whether in practice the writes aren’t usually sufficiently sequential
> to make POSIX_FADV_SEQUENTIAL work fine.
>
>> So
>> I have to track the writes and wait until the whole cluster is
>> filled. It's simple use fadvise as an option to my cache: instead of
>> caching data and write when cluster is filled we can instead mark
>> cluster POSIX_FADV_DONTNEED.
>>
>>>
>>> (I don’t remember why that qemu-img discussion died back then.)
>>>
>>>
>>> Fourth, regarding the code, would it be simpler if it were a pure
>>> write cache?  I.e., on read, everything is flushed, so we don’t have
>>> to deal with that.  I don’t think there are many valid cases where a
>>> compressed image is both written to and read from at the same time. 
>>> (Just asking, because I’d really want this code to be simpler.  I
>>> can imagine that reading from the cache is the least bit of
>>> complexity, but perhaps...)
>>>
>>
>> Hm. I really didn't want to support reads, and do it only to make it
>> possible to enable the cache by default.. Still read function is
>> really simple, and I don't think that dropping it will simplify the
>> code significantly.
>
> That’s too bad.
>
> So the only question I have left is what POSIX_FADV_SEQUENTIAL
> actually would do in practice.
>
> (But even then, the premise just doesn’t motivate me sufficiently yet...)
>
POSIX_FADV_SEQUENTIAL influences the amount of the read-ahead
only.

int generic_fadvise(struct file *file, loff_t offset, loff_t len, int
advice)
{
.....
    case POSIX_FADV_NORMAL:
        file->f_ra.ra_pages = bdi->ra_pages;
        spin_lock(&file->f_lock);
        file->f_mode &= ~FMODE_RANDOM;
        spin_unlock(&file->f_lock);
        break;
    case POSIX_FADV_SEQUENTIAL:
        file->f_ra.ra_pages = bdi->ra_pages * 2;
        spin_lock(&file->f_lock);
        file->f_mode &= ~FMODE_RANDOM;
        spin_unlock(&file->f_lock);
        break;

thus the only difference from the usual NORMAL mode would be
doubled amount of read-ahead pages.

Den



reply via email to

[Prev in Thread] Current Thread [Next in Thread]