|
From: | Max Reitz |
Subject: | Re: [PATCH v4 00/23] backup performance: block_status + async |
Date: | Wed, 20 Jan 2021 14:50:29 +0100 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0 |
On 20.01.21 11:39, Max Reitz wrote:
On 19.01.21 20:22, Vladimir Sementsov-Ogievskiy wrote:19.01.2021 21:40, Max Reitz wrote:On 16.01.21 22:46, Vladimir Sementsov-Ogievskiy wrote:Hi Max!I applied my series onto yours 129-fixing and found, that 129 fails for backup. And setting small max-chunk and even max-workers to 1 doesn't help! (settingspeed like in v3 still helps).And I found, that the problem is that really, the whole backup job goes during drain, because in new architecture we do just job_yield() during the wholebackground block-copy.This leads to modifying the existing patch in the series, which does job_enter() from job_user_pause: we just need call job_enter() from job_pause() to covernot only user pauses but also drained_begin. So, now I don't need any additional fixing of 129. Changes in v4: - add a lot of Max's r-b's, thanks! 03: fix over-80 line (in comment), add r-b 09: was "[PATCH v3 10/25] job: call job_enter from job_user_pause", now changed to finally fix 129 iotest, drop r-b10: squash-in additional wording on max-chunk, fix error message, keep r-b 17: drop extra include, assert job_is_cancelled() instead of check, add r-b18: adjust commit message, add r-b23: add comments and assertion, about the fact that test doesn't supportpaths with colon inside fix s/disable-copy-range/use-copy-range/Hmmm, for me, 129 sometimes fails still, because it completes too quickly... (The error then is that 'return[0]' does not exist in query-block-jobs’s result, because the job is already gone.)When I insert a print(result) after the query-block-jobs, I can see that the job has always progressed really far, even if its still running. (Like, generally the offset is just one MB shy of 1G.)I suspect the problem is that block-copy just copies too much from the start (by default); i.e., it starts 64 workers with, hm, well, 1 MB of chunk size? Shouldn’t fill the 128 MB immediately...Anyway, limiting the number of workers (to 1) and the chunk size (to 64k) with x-perf does ensure that the backup job’s progress is limited to 1 MB or so, which looks fine to me.I suppose we should do that, then (in 129), before patch 17?Yes, that sounds reasonable(PS: I can also see a MacOS failure in iotest 256. I suspect it’s related to this series, because 256 is a backup test (with iothreads), but I’m not sure yet. The log is here:https://cirrus-ci.com/task/5276331753603072 )qemu received signal 31 ? googling for MacOS... 31 SIGUSR2 terminate process User defined signal 2coroutine-sigaltstack uses SIGUSR2 to set up new coroutines. Perhaps it’s unrelated to backup? Guess I’ll just run the test one more time. O:)
I ran it again, got the same error. There is no error on master, or before backup uses block_copy.
I’m trying to run a test directly on the “move to block-copy” commit, but so far Cirrus doesn’t seem to want me to do another test run right now.
(Though I’m pretty sure if there is no error before the block-copy commit, then using block-copy must be the problem. The remaining patches in my block branch are just disabling copy_range, some clean-up, the simplebench patches, the locking code error reporting change, and a new iotest.)
Max
[Prev in Thread] | Current Thread | [Next in Thread] |