[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 0/7] qcow2: compressed write cache
From: |
Denis V. Lunev |
Subject: |
Re: [PATCH 0/7] qcow2: compressed write cache |
Date: |
Tue, 9 Feb 2021 21:41:00 +0300 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.1 |
On 2/9/21 9:36 PM, Vladimir Sementsov-Ogievskiy wrote:
> 09.02.2021 19:39, Vladimir Sementsov-Ogievskiy wrote:
>> 09.02.2021 17:47, Max Reitz wrote:
>>> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>>>> 09.02.2021 16:25, Max Reitz wrote:
>>>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> Hi all!
>>>>>>
>>>>>> I know, I have several series waiting for a resend, but I had to
>>>>>> switch
>>>>>> to another task spawned from our customer's bug.
>>>>>>
>>>>>> Original problem: we use O_DIRECT for all vm images in our
>>>>>> product, it's
>>>>>> the policy. The only exclusion is backup target qcow2 image for
>>>>>> compressed backup, because compressed backup is extremely slow with
>>>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>>>> produces a lot of pagecache.
>>>>>>
>>>>>> So we can either implement some internal cache or use fadvise
>>>>>> somehow.
>>>>>> Backup has several async workes, which writes simultaneously, so
>>>>>> in both
>>>>>> ways we have to track host cluster filling (before dropping the
>>>>>> cache
>>>>>> corresponding to the cluster). So, if we have to track anyway,
>>>>>> let's
>>>>>> try to implement the cache.
>>>>>
>>>>> I wanted to be excited here, because that sounds like it would be
>>>>> very easy to implement caching. Like, just keep the cluster at
>>>>> free_byte_offset cached until the cluster it points to changes,
>>>>> then flush the cluster.
>>>>
>>>> The problem is that chunks are written asynchronously.. That's why
>>>> this all is not so easy.
>>>>
>>>>>
>>>>> But then I see like 900 new lines of code, and I’m much less
>>>>> excited...
>>>>>
>>>>>> Idea is simple: cache small unaligned write and flush the cluster
>>>>>> when
>>>>>> filled.
>>>>>>
>>>>>> Performance result is very good (results in a table is time of
>>>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>>>
>>>>> “Filled with ones” really is an edge case, though.
>>>>
>>>> Yes, I think, all clusters are compressed to rather small chunks :)
>>>>
>>>>>
>>>>>> --------------- ----------- -----------
>>>>>> backup(old) backup(new)
>>>>>> ssd:hdd(direct) 3e+02 4.4
>>>>>> -99%
>>>>>> ssd:hdd(cached) 5.7 5.4
>>>>>> -5%
>>>>>> --------------- ----------- -----------
>>>>>>
>>>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>>>> cache by default (which is done by the series).
>>>>>
>>>>> First, I’m not sure how O_DIRECT really is relevant, because I
>>>>> don’t really see the point for writing compressed images.
>>>>
>>>> compressed backup is a point
>>>
>>> (Perhaps irrelevant, but just to be clear:) I meant the point of
>>> using O_DIRECT, which one can decide to not use for backup targets
>>> (as you have done already).
>>>
>>>>> Second, I find it a bit cheating if you say there is a huge
>>>>> improvement for the no-cache case, when actually, well, you just
>>>>> added a cache. So the no-cache case just became faster because
>>>>> there is a cache now.
>>>>
>>>> Still, performance comparison is relevant to show that O_DIRECT as
>>>> is unusable for compressed backup.
>>>
>>> (Again, perhaps irrelevant, but:) Yes, but my first point was
>>> exactly whether O_DIRECT is even relevant for writing compressed
>>> images.
>>>
>>>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much
>>>>> sense for compressed images, qemu’s format drivers are free to
>>>>> introduce some caching (because technically the cache.direct
>>>>> option only applies to the protocol driver) for collecting
>>>>> compressed writes.
>>>>
>>>> Yes I thought in this way, enabling the cache by default.
>>>>
>>>>> That conclusion makes both of my complaints kind of moot.
>>>>>
>>>>> *shrug*
>>>>>
>>>>> Third, what is the real-world impact on the page cache? You
>>>>> described that that’s the reason why you need the cache in qemu,
>>>>> because otherwise the page cache is polluted too much. How much
>>>>> is the difference really? (I don’t know how good the compression
>>>>> ratio is for real-world images.)
>>>>
>>>> Hm. I don't know the ratio.. Customer reported that most of RAM is
>>>> polluted by Qemu's cache, and we use O_DIRECT for everything except
>>>> for target of compressed backup.. Still the pollution may relate to
>>>> several backups and of course it is simple enough to drop the cache
>>>> after each backup. But I think that even one backup of 16T disk may
>>>> pollute RAM enough.
>>>
>>> Oh, sorry, I just realized I had a brain fart there. I was
>>> referring to whether this series improves the page cache pollution.
>>> But obviously it will if it allows you to re-enable O_DIRECT.
>>>
>>>>> Related to that, I remember a long time ago we had some discussion
>>>>> about letting qemu-img convert set a special cache mode for the
>>>>> target image that would make Linux drop everything before the last
>>>>> offset written (i.e., I suppose fadvise() with
>>>>> POSIX_FADV_SEQUENTIAL). You discard that idea based on the fact
>>>>> that implementing a cache in qemu would be simple, but it isn’t,
>>>>> really. What would the impact of POSIX_FADV_SEQUENTIAL be? (One
>>>>> advantage of using that would be that we could reuse it for
>>>>> non-compressed images that are written by backup or qemu-img
>>>>> convert.)
>>>>
>>>> The problem is that writes are async. And therefore, not sequential.
>>>
>>> In theory, yes, but all compressed writes still goes through
>>> qcow2_alloc_bytes() right before submitting the write, so I wonder
>>> whether in practice the writes aren’t usually sufficiently
>>> sequential to make POSIX_FADV_SEQUENTIAL work fine.
>>
>> Yes, allocation is sequential. But writes are not.. Reasonable, I
>> should at least bench it. So we should set POSIX_FADV_SEQUENTIAL for
>> the whole backup target before the backup? Will try. Still, I expect
>> that my cache will show better performance anyway. Actually,
>> comparing cached (by pagecache) vs my cache we have 5.7 -> 4.4, i.e.
>> 20% faster, which is significant (still, yes, would be good to check
>> it on more real case than all-ones).
>>
>>>
>>>> So
>>>> I have to track the writes and wait until the whole cluster is
>>>> filled. It's simple use fadvise as an option to my cache: instead
>>>> of caching data and write when cluster is filled we can instead
>>>> mark cluster POSIX_FADV_DONTNEED.
>>>>
>>>>>
>>>>> (I don’t remember why that qemu-img discussion died back then.)
>>>>>
>>>>>
>>>>> Fourth, regarding the code, would it be simpler if it were a pure
>>>>> write cache? I.e., on read, everything is flushed, so we don’t
>>>>> have to deal with that. I don’t think there are many valid cases
>>>>> where a compressed image is both written to and read from at the
>>>>> same time. (Just asking, because I’d really want this code to be
>>>>> simpler. I can imagine that reading from the cache is the least
>>>>> bit of complexity, but perhaps...)
>>>>>
>>>>
>>>> Hm. I really didn't want to support reads, and do it only to make
>>>> it possible to enable the cache by default.. Still read function is
>>>> really simple, and I don't think that dropping it will simplify the
>>>> code significantly.
>>>
>>> That’s too bad.
>>>
>>> So the only question I have left is what POSIX_FADV_SEQUENTIAL
>>> actually would do in practice.
>>
>> will check.
>>
>
> Checked that if I mark the whole file by FADV_SEQUENTIAL, cache is not
> removed.
>
> Test:
> [root@kvm fadvise]# cat a.c
> #define _GNU_SOURCE
> #include <fcntl.h>
> #include <unistd.h>
> #include <stdio.h>
> #include <getopt.h>
> #include <string.h>
> #include <stdbool.h>
>
> int main(int argc, char *argv[])
> {
> int fd;
> int i;
> char mb[1024 * 1024];
> int open_flags = O_RDWR | O_CREAT | O_EXCL;
> int fadv_flags = 0;
> int fadv_final_flags = 0;
> int len = 1024 * 1024;
> bool do_fsync = false;
>
> for (i = 1; i < argc - 1; i++) {
> const char *arg = argv[i];
>
> if (!strcmp(arg, "direct")) {
> open_flags |= O_DIRECT;
> } else if (!strcmp(arg, "seq")) {
> fadv_flags = POSIX_FADV_SEQUENTIAL;
> } else if (!strcmp(arg, "dontneed")) {
> fadv_flags = POSIX_FADV_DONTNEED;
> } else if (!strcmp(arg, "final-dontneed")) {
> fadv_final_flags = POSIX_FADV_DONTNEED;
> } else if (!strcmp(arg, "fsync")) {
> do_fsync = true;
> } else {
> fprintf(stderr, "unknown: %s\n", arg);
> return 1;
> }
> }
>
> fd = open(argv[argc - 1], open_flags);
>
> if (fd < 0) {
> fprintf(stderr, "failed to open\n");
> return 1;
> }
>
> if (fadv_flags) {
> posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_flags);
> }
> for (i = 0; i < 100; i++) {
> write(fd, mb, len);
> }
> if (fadv_final_flags) {
> posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_final_flags);
> }
> if (do_fsync) {
> fsync(fd);
> }
>
> close(fd);
> }
>
>
>
> [root@kvm fadvise]# gcc a.c
> [root@kvm fadvise]# rm -f x; ./a.out seq x; fincore x
> RES PAGES SIZE FILE
> 100M 25600 100M x
> [root@kvm fadvise]# rm -f x; ./a.out dontneed x; fincore x
> RES PAGES SIZE FILE
> 100M 25600 100M x
> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed x; fincore x
> RES PAGES SIZE FILE
> 36M 9216 100M x
> [root@kvm fadvise]# rm -f x; ./a.out seq fsync x; fincore x
> RES PAGES SIZE FILE
> 100M 25600 100M x
> [root@kvm fadvise]# rm -f x; ./a.out dontneed fsync x; fincore x
> RES PAGES SIZE FILE
> 100M 25600 100M x
> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed fsync x; fincore x
> RES PAGES SIZE FILE
> 36M 9216 100M x
> [root@kvm fadvise]# rm -f x; ./a.out direct x; fincore x
> RES PAGES SIZE FILE
> 0B 0 0B x
>
>
>
> Backup-generated pagecache is a formal trash, it will be never used.
> And it's bad that it can displace another good pagecache. So the best
> thing for backup is direct mode + proposed cache.
>
>
I think that the original intention of Max is about POSIX_FADV_DONTNEED
One should call this fadvise for just fully written cluster. Though this is
a bit buggy and from performance point of view will be slower.
This call could be slow and thus it should be created as delegate to
co-routine. We have though on this, but the amount of work to
implement even if seems lower, is not really lower and the result
would not be as great.
Den
- Re: [PATCH 0/7] qcow2: compressed write cache, Vladimir Sementsov-Ogievskiy, 2021/02/01
- Re: [PATCH 0/7] qcow2: compressed write cache, Max Reitz, 2021/02/09
- Re: [PATCH 0/7] qcow2: compressed write cache, Vladimir Sementsov-Ogievskiy, 2021/02/09
- Re: [PATCH 0/7] qcow2: compressed write cache, Max Reitz, 2021/02/09
- Re: [PATCH 0/7] qcow2: compressed write cache, Vladimir Sementsov-Ogievskiy, 2021/02/09
- Re: [PATCH 0/7] qcow2: compressed write cache, Vladimir Sementsov-Ogievskiy, 2021/02/09
- Re: [PATCH 0/7] qcow2: compressed write cache,
Denis V. Lunev <=
- Re: [PATCH 0/7] qcow2: compressed write cache, Vladimir Sementsov-Ogievskiy, 2021/02/09
- Re: [PATCH 0/7] qcow2: compressed write cache, Max Reitz, 2021/02/10
- Re: [PATCH 0/7] qcow2: compressed write cache, Vladimir Sementsov-Ogievskiy, 2021/02/10
- Re: [PATCH 0/7] qcow2: compressed write cache, Denis V. Lunev, 2021/02/09
- Re: [PATCH 0/7] qcow2: compressed write cache, Max Reitz, 2021/02/10
Re: [PATCH 0/7] qcow2: compressed write cache, Kevin Wolf, 2021/02/10