[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] postcopy migration hangs while loading virtio state
From: |
Christian Borntraeger |
Subject: |
Re: [Qemu-devel] postcopy migration hangs while loading virtio state |
Date: |
Wed, 26 Apr 2017 13:45:14 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 |
On 04/26/2017 01:01 PM, Dr. David Alan Gilbert wrote:
> * Christian Borntraeger (address@hidden) wrote:
>> On 04/24/2017 09:12 PM, Dr. David Alan Gilbert wrote:
>>> * Christian Borntraeger (address@hidden) wrote:
>>>> On 04/24/2017 04:35 PM, Dr. David Alan Gilbert wrote:
>>>>> * Christian Borntraeger (address@hidden) wrote:
>>>>>> On 04/24/2017 12:53 PM, Dr. David Alan Gilbert wrote:
>>>>>>> * Christian Borntraeger (address@hidden) wrote:
>>>>>>>> David, Juan,
>>>>>>>>
>>>>>>>> I can trigger a hang of postcopy migration (when I do it early) so
>>>>>>>> that both sides are in paused state. Looks like one thread is still
>>>>>>>> loading vmstates for virtio and this load accesses guest memory and
>>>>>>>> triggers a userfault.
>>>>>>>
>>>>>>> It's perfectly legal for the destination to cause a userfault as it
>>>>>>> loads the virtio queue - the virtio queue should be being loaded by
>>>>>>> the main thread from the buffer while the 'listening' thread is
>>>>>>> waiting for the incoming page data.
>>>>>>>
>>>>>>> Can you turn on the following tracing please: destination:
>>>>>>> postcopy_ram_fault_thread_request postcopy_place_page
>>>>>>> postcopy_place_page_zero
>>>>>>>
>>>>>>> source: migrate_handle_rp_req_pages ram_save_queue_pages
>>>>>>>
>>>>>>> You should see: virtio does the access userfault generates a fault
>>>>>>> postcopy_ram_fault_thread_request sends the request to the source
>>>>>>>
>>>>>>> the source sees migrate_handle_rp_req_pages queues it with
>>>>>>> ram_save_queue_pages
>>>>>>>
>>>>>>> the destination sees the page arrive and postcopy_place_page or
>>>>>>> postcopy_place_page_zero
>>>>>>>
>>>>>>> some of that might disappear if the page was already on it's way.
>>>>>>
>>>>>>
>>>>>> the last event on the source are
>>>>>> [..]
>>>>>> address@hidden:postcopy_place_page host=0x3ff92246000
>>>>>> address@hidden:postcopy_place_page host=0x3ff92247000
>>>>>
>>>>> How do you see those on the source???
>>>>
>>>> It was the previous migrate (I did it in a loop forth and back)
>>>> The problem happens on the migrate back.
>>>
>>> OK, good.
>>>
>>>>>
>>>>>> address@hidden:migrate_handle_rp_req_pages in s390.ram at 41d9000 len
>>>>>> 1000
>>>>>> address@hidden:ram_save_queue_pages s390.ram: start: 41d9000 len: 100
>>>>>>
>>>>>
>>>>> Is that a typo? I'm expecting those two 'len' fields to be the same?
>>>>
>>>> Yes, its a cut'n' paste "miss the last byte"
>>>
>>> Good.
>>>
>>> Ok, before digging further, is this a new bug or does it happen on older
>>> QEMU? Have you got a rune to reproduce it on x86?
>>>
>>>>
>>>>>
>>>>>> On the target a see lots of
>>>>>>
>>>>>> address@hidden:postcopy_place_page_zero host=0x3ff9befa000
>>>>>> address@hidden:postcopy_place_page host=0x3ff9befb000
>>>>>> address@hidden:postcopy_place_page host=0x3ff9befc000
>>>>>> address@hidden:postcopy_place_page host=0x3ff9befd000
>>>>>> address@hidden:postcopy_place_page_zero host=0x3ff9befe000
>>>>>> address@hidden:postcopy_place_page host=0x3ff9beff000
>>>>>>
>>>>>> So we have about 2 seconds of traffic going on after that request,
>>>>>> I assume its precopy related.
>>>>>>
>>>>>> Looking back on the target history there was
>>>>>> address@hidden:postcopy_ram_fault_thread_request Request for
>>>>>> HVA=3ff618d9000 rb=s390.ram offset=41d9000
>>>>>>
>>>>>> In fact it seems to be the first and only request:
>>>>>>
>>>>>> # cat /tmp/test0.trace | grep -v postcopy_place_page
>>>>>>
>>>>>> address@hidden:postcopy_ram_fault_thread_request Request for
>>>>>> HVA=3ff618d9000 rb=s390.ram offset=41d9000
>>>>>
>>>>> OK, does the HVA there correspond to the address that your virtio device
>>>>> is blocking on?
>>>>
>>>> yes it is the same page.
>>>>
>>>>
>>>>> (or at least the start of the page)
>>>>> Do you see a postcopy_place_page with a host= the same HVA ?
>>>>
>>>> no
>>>
>>> Hmm, that's bad.
>>> The flow is:
>>> Precopy
>>> (a) Send pages
>>> Switch to postcopy
>>> (b) Send discards for pages that have been changed after
>>> (a)
>>> Postcopy
>>> (c) Keep sending pages until we run out
>>> (d) But send pages we're asked for first
>>>
>>> (d) might be ignored if the source thinks the page was already sent or not
>>> dirty.
>>>
>>> So we need to figure out:
>>> 1) If the source sent the pages during (a)
>>> 2) If the source discarded it during (b)
>>> 3) If it sent it again any time in c/d
>>> 4) If it ignored the request from d
>>>
>>>
>>> So please turn on the traces:
>>> get_queued_page_not_dirty should help with (4)
>>> get_queued_page should help with (4)
>>> ram_discard_range should help with (2)
>>> loadvm_postcopy_ram_handle_discard should help with (2)
>>> qemu_savevm_send_postcopy_ram_discard should help with (2)
>>>
>>> add near the top of ram_save_page in ram.c:
>>> fprintf(stderr, "%s: %s:%zx\n", __func__, block->idstr, (size_t)offset);
>>>
>>> should help with 1, 3
>>
>>
>> source:
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 12d
>> address@hidden:qemu_savevm_send_postcopy_ram_discard s390.ram: 9d
>> address@hidden:migrate_handle_rp_req_pages in s390.ram at 3c6b000 len 1000
>> address@hidden:ram_save_queue_pages s390.ram: start: 3c6b000 len: 1000
>> address@hidden:get_queued_page_not_dirty s390.ram/3c6b000 ram_addr=3c6b000
>> (sent=0)
>
> OK, the source is ignoring the request because it thinks the page has
> already been sent and hasn't been redirtied.
>
>> I also get
>> ram_save_page: s390.ram:3c6b000
>
> and you said in your other message that happens before the discards, which
> means it's precopy, and that's why you don't see the 'place_page'.
>
>> target (without postcopy_place)
>> address@hidden:ram_discard_range s390.ram: start: 0 5dc00000
>
> That's start of day I think, so we can ignore that (I think!)
>
>> address@hidden:loadvm_postcopy_ram_handle_discard
>
> ....
>
>> address@hidden:ram_discard_range s390.ram: start: 3bf6000 1000
>> address@hidden:ram_discard_range s390.ram: start: 3bfd000 1000
>> address@hidden:ram_discard_range s390.ram: start: 3c06000 4000
>> address@hidden:loadvm_postcopy_ram_handle_discard
>> address@hidden:ram_discard_range s390.ram: start: 3c79000 3000
>
> Well, it's not discarded the 3c6b000 page, so the page we received
> during precopy should still be there, and it shouldn't fault.
>
> In migration/ram.c in ram_load, before the line:
> switch (flags & ~RAM_SAVE_FLAG_CONTINUE) {
>
> can you add:
> fprintf(stderr, "%s: addr=%zx flags=%zx host=%p\n", __func__, addr,
> flags, host);
>
> that should show the incoming blocks during precopy and we should be able
> to see the page coming in.
> Hmm, I have a theory, if the flags field has bit 1 set, i.e.
> RAM_SAVE_FLAG_COMPRESS
> then try changing ram_handle_compressed to always do the memset.
FWIW, changing ram_handle_compressed to always memset makes the problem go away.
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, (continued)
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Dr. David Alan Gilbert, 2017/04/24
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/24
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Dr. David Alan Gilbert, 2017/04/24
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/24
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Dr. David Alan Gilbert, 2017/04/24
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/25
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/26
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/26
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Dr. David Alan Gilbert, 2017/04/26
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/26
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state,
Christian Borntraeger <=
- Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/26
Re: [Qemu-devel] postcopy migration hangs while loading virtio state, Christian Borntraeger, 2017/04/24