qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] dirty page count problem


From: Alex Bennée
Subject: Re: [Qemu-devel] dirty page count problem
Date: Mon, 24 Jul 2017 15:17:20 +0100
User-agent: mu4e 0.9.19; emacs 25.2.50.3

Dr. David Alan Gilbert <address@hidden> writes:

> * Dr. David Alan Gilbert (address@hidden) wrote:
>> Hi,
>>   Git bisect is pointing to your patch 084140bd49:
>>   exec: fix access to ram_list.dirty_memory when sync dirty bitmap
>>
>> trying to diagnose a bug I'm seeing; it looks like the dirty page count
>> is wrong for some reason.
>>
>> Alex Bennée spotted a problem where the postcopy test would occasionally
>> fail under very heavy load;    attaching a debugger and it looks like
>> the problem is we have a migration_dirty_page count stuck at 2;
>> in the normal migration tests we don't spot this, because 2 pages is
>> smaller than the threshold to end migration and so an extra 2 pages
>> doesn't block it finishing.   However, with a very
>> small downtime setting (like we use in the postcopy test) and with
>> very low bandwidth (as when Alex ran the test on a very heavily loaded
>> machine) we end up never calling the bitmap sync again and never
>> completing the iteration.
>>
>> I'm using the following addition to spot the problem:
>
> I think I have the fix for this, or at least the reason; the alignment
> check also needs fixing up, but the 08414 patch missed it.
>
>   Alex: Please test the fix included at the bottom
>
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index e75f1050e4..3ddf884952 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1350,6 +1350,13 @@ static int ram_find_and_save_block(RAMState *rs, bool 
>> last_stage)
>>          }
>>      } while (!pages && again);
>>
>> +    if (!pages && !again && pss.complete_round && rs->migration_dirty_pages)
>> +    {
>> +        /* Should make this fail migration ? */
>> +        fprintf(stderr, "%s: no page found, yet dirty_pages=%"PRIu64"\n",
>> +                __func__, rs->migration_dirty_pages);
>> +    }
>> +
>>      rs->last_seen_block = pss.block;
>>      rs->last_page = pss.page;
>>
>> (which I might add as a test to fail a migration)
>>
>> That test fails easily even on an unloaded machine:
>> tests/postcopy-test
>> /x86_64/postcopy: ram_find_and_save_block: no page found, yet dirty_pages=2
>> ram_find_and_save_block: no page found, yet dirty_pages=2
>> ram_find_and_save_block: no page found, yet dirty_pages=2
>> OK
>>
>>
>> I'll try and debug where our extra two pages are coming from.
>
> From b2c86818eaaaf790246f21ae21fead903c1d64f0 Mon Sep 17 00:00:00 2001
> From: "Dr. David Alan Gilbert" <address@hidden>
> Date: Fri, 21 Jul 2017 19:59:01 +0100
> Subject: [PATCH] cpu_physical_memory_sync_dirty_bitmap: Fix alignment check
>
> This code has an optimised, word aligned version, and a boring
> unaligned version.  Recently 084140bd498909 fixed a missing offset
> addition from the core of both versions.  However, the offset isn't
> necessarily aligned and thus the choice between the two versions
> needs fixing up to also include the offset.
>
> DANGER: Friday evening, post-dinner lightly tested patch.
>
> Symptom:
>   A few stuck unsent pages during migration; not normally noticed
> unless under very low bandwidth.
>
> Fixes: 084140bd498909
> ---
>  include/exec/ram_addr.h | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> index c04f4f6..c802f7f 100644
> --- a/include/exec/ram_addr.h
> +++ b/include/exec/ram_addr.h
> @@ -378,15 +378,16 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock 
> *rb,
>  {
>      ram_addr_t addr;
>      unsigned long page = BIT_WORD(start >> TARGET_PAGE_BITS);
> +    unsigned long word = BIT_WORD((start + rb->offset) >> TARGET_PAGE_BITS);
>      uint64_t num_dirty = 0;
>      unsigned long *dest = rb->bmap;
>
>      /* start address is aligned at the start of a word? */
> -    if (((page * BITS_PER_LONG) << TARGET_PAGE_BITS) == start) {
> +    if (((word * BITS_PER_LONG) << TARGET_PAGE_BITS) ==
> +         (start + rb->offset)) {
>          int k;
>          int nr = BITS_TO_LONGS(length >> TARGET_PAGE_BITS);
>          unsigned long * const *src;
> -        unsigned long word = BIT_WORD((start + rb->offset) >> 
> TARGET_PAGE_BITS);
>          unsigned long idx = (word * BITS_PER_LONG) / DIRTY_MEMORY_BLOCK_SIZE;
>          unsigned long offset = BIT_WORD((word * BITS_PER_LONG) %
>                                          DIRTY_MEMORY_BLOCK_SIZE);
> --
> 1.8.3.1
>
>> Dave
>> --
>> Dr. David Alan Gilbert / address@hidden / Manchester, UK

I saw no hangs, there where 4 times it failed but that might be another
test crashing. I'm investigating that so a cautious:

Tested-by: Alex Bennée <address@hidden>

--
Alex Bennée



reply via email to

[Prev in Thread] Current Thread [Next in Thread]