qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Question] An issue when repeat reboot in guest during migration


From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Subject: [Question] An issue when repeat reboot in guest during migration
Date: Mon, 9 Mar 2020 12:25:49 +0800
User-agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1

Hi guys,

We find an issue when repeat reboot in guest during migration, it cause the
migration thread never be waken up again.

<main loop>                        |<migration_thread>
                                   |
main_loop_should_exit [BQL LOCK]   |
 pause_all_vcpus                   |
  1. set all cpus ->stop=true      |
     and then kick                 |
  2. return if all cpus is paused  |
     (by '->stopped == true'), else|
  3. qemu_cond_wait [BQL UNLOCK]   |
                                   |LOCK BQL
                                   |...
                                   |do_vm_stop
                                   | pause_all_vcpus
                                   |  (A)set all cpus ->stop=true
                                   |     and then kick
                                   |  (B)return if all cpus is paused
                                   |     (by '->stopped == true'), else
                                   |  (C)qemu_cond_wait [BQL UNLOCK]
  4. be waken up and LOCK BQL      |  (D)be waken up BUT wait for  BQL
  5. goto 2.                       |
 (BQL is still LOCKed)             |
 resume_all_vcpus                  |
  1. set all cpus ->stop=false     |
     and ->stopped=false           |
...                                |
BQL UNLOCK                         |  (E)LOCK BQL
                                   |  (F)goto B. [but stopped is false now!]
                                   |Finally, sleep at step 3 forever.

I've not test the latest QEMU yet, but after take a quick look at codes, it
seems also has this issue.

Currently I just find two potential approaches to fix it:

1. Just retry. Check if all vcpus are paused before (F), if not then goto (A) to
send the stop request again. It's very simple and safe but it just reduce the
probability.

2. To support nest pause. Use a refcount to instead of the 'bool stop', switch
the 'stopped' to false only when the refcount reduce to zero. Maybe this can
completely solve the problem but it seems easy to introduce other bugs if
imprudence.

Any suggestions ?

Thanks.

---
Regards,
Longpeng(Mike)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]