[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] dirty page count problem
From: |
Alex Bennée |
Subject: |
Re: [Qemu-devel] dirty page count problem |
Date: |
Mon, 24 Jul 2017 15:17:20 +0100 |
User-agent: |
mu4e 0.9.19; emacs 25.2.50.3 |
Dr. David Alan Gilbert <address@hidden> writes:
> * Dr. David Alan Gilbert (address@hidden) wrote:
>> Hi,
>> Git bisect is pointing to your patch 084140bd49:
>> exec: fix access to ram_list.dirty_memory when sync dirty bitmap
>>
>> trying to diagnose a bug I'm seeing; it looks like the dirty page count
>> is wrong for some reason.
>>
>> Alex Bennée spotted a problem where the postcopy test would occasionally
>> fail under very heavy load; attaching a debugger and it looks like
>> the problem is we have a migration_dirty_page count stuck at 2;
>> in the normal migration tests we don't spot this, because 2 pages is
>> smaller than the threshold to end migration and so an extra 2 pages
>> doesn't block it finishing. However, with a very
>> small downtime setting (like we use in the postcopy test) and with
>> very low bandwidth (as when Alex ran the test on a very heavily loaded
>> machine) we end up never calling the bitmap sync again and never
>> completing the iteration.
>>
>> I'm using the following addition to spot the problem:
>
> I think I have the fix for this, or at least the reason; the alignment
> check also needs fixing up, but the 08414 patch missed it.
>
> Alex: Please test the fix included at the bottom
>
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index e75f1050e4..3ddf884952 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1350,6 +1350,13 @@ static int ram_find_and_save_block(RAMState *rs, bool
>> last_stage)
>> }
>> } while (!pages && again);
>>
>> + if (!pages && !again && pss.complete_round && rs->migration_dirty_pages)
>> + {
>> + /* Should make this fail migration ? */
>> + fprintf(stderr, "%s: no page found, yet dirty_pages=%"PRIu64"\n",
>> + __func__, rs->migration_dirty_pages);
>> + }
>> +
>> rs->last_seen_block = pss.block;
>> rs->last_page = pss.page;
>>
>> (which I might add as a test to fail a migration)
>>
>> That test fails easily even on an unloaded machine:
>> tests/postcopy-test
>> /x86_64/postcopy: ram_find_and_save_block: no page found, yet dirty_pages=2
>> ram_find_and_save_block: no page found, yet dirty_pages=2
>> ram_find_and_save_block: no page found, yet dirty_pages=2
>> OK
>>
>>
>> I'll try and debug where our extra two pages are coming from.
>
> From b2c86818eaaaf790246f21ae21fead903c1d64f0 Mon Sep 17 00:00:00 2001
> From: "Dr. David Alan Gilbert" <address@hidden>
> Date: Fri, 21 Jul 2017 19:59:01 +0100
> Subject: [PATCH] cpu_physical_memory_sync_dirty_bitmap: Fix alignment check
>
> This code has an optimised, word aligned version, and a boring
> unaligned version. Recently 084140bd498909 fixed a missing offset
> addition from the core of both versions. However, the offset isn't
> necessarily aligned and thus the choice between the two versions
> needs fixing up to also include the offset.
>
> DANGER: Friday evening, post-dinner lightly tested patch.
>
> Symptom:
> A few stuck unsent pages during migration; not normally noticed
> unless under very low bandwidth.
>
> Fixes: 084140bd498909
> ---
> include/exec/ram_addr.h | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> index c04f4f6..c802f7f 100644
> --- a/include/exec/ram_addr.h
> +++ b/include/exec/ram_addr.h
> @@ -378,15 +378,16 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock
> *rb,
> {
> ram_addr_t addr;
> unsigned long page = BIT_WORD(start >> TARGET_PAGE_BITS);
> + unsigned long word = BIT_WORD((start + rb->offset) >> TARGET_PAGE_BITS);
> uint64_t num_dirty = 0;
> unsigned long *dest = rb->bmap;
>
> /* start address is aligned at the start of a word? */
> - if (((page * BITS_PER_LONG) << TARGET_PAGE_BITS) == start) {
> + if (((word * BITS_PER_LONG) << TARGET_PAGE_BITS) ==
> + (start + rb->offset)) {
> int k;
> int nr = BITS_TO_LONGS(length >> TARGET_PAGE_BITS);
> unsigned long * const *src;
> - unsigned long word = BIT_WORD((start + rb->offset) >>
> TARGET_PAGE_BITS);
> unsigned long idx = (word * BITS_PER_LONG) / DIRTY_MEMORY_BLOCK_SIZE;
> unsigned long offset = BIT_WORD((word * BITS_PER_LONG) %
> DIRTY_MEMORY_BLOCK_SIZE);
> --
> 1.8.3.1
>
>> Dave
>> --
>> Dr. David Alan Gilbert / address@hidden / Manchester, UK
I saw no hangs, there where 4 times it failed but that might be another
test crashing. I'm investigating that so a cautious:
Tested-by: Alex Bennée <address@hidden>
--
Alex Bennée