qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH COLO-Frame v11 25/39] COLO: implement default fa


From: Hailiang Zhang
Subject: Re: [Qemu-devel] [PATCH COLO-Frame v11 25/39] COLO: implement default failover treatment
Date: Fri, 11 Dec 2015 17:48:35 +0800
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0

On 2015/12/11 3:01, Dr. David Alan Gilbert wrote:
* zhanghailiang (address@hidden) wrote:
If we detect some error in colo,  we will wait for some time,
hoping users also detect it. If users don't issue failover command.
We will go into default failover procedure, which the PVM will takeover
work while SVM is exit in default.

I'm not sure this is needed; especially on the SVM.  I don't see any harm
in the SVM waiting forever to be told what to do - it could be told to
failover or quit; I don't see any benefit to it automatically exiting.

In the primary, I can see if you didn't have some automated error
detection system then I can understand it (but I think it's rare);
but you really would want to make that failover delay configurable
so that you could turn it off in a system that did have failure detection;
because automatically restarting the primary after it had caused a failover
to the secondary would be very bad.

Yes, automatically restarting the PVM may cause split-brain. I'll drop
this patch temporarily.

Thanks.
Hailiang


Signed-off-by: zhanghailiang <address@hidden>
Signed-off-by: Li Zhijian <address@hidden>
---
  migration/colo.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
  1 file changed, 46 insertions(+)

diff --git a/migration/colo.c b/migration/colo.c
index f31e957..1e6d3dd 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -19,6 +19,14 @@
  #include "qemu/sockets.h"
  #include "migration/failover.h"

+/*
+ * The delay time before qemu begin the procedure of default failover 
treatment.
+ * Unit: ms
+ * Fix me: This value should be able to change by command
+ * 'migrate-set-parameters'
+ */
+#define DEFAULT_FAILOVER_DELAY 2000
+
  /* colo buffer */
  #define COLO_BUFFER_BASE_SIZE (4 * 1024 * 1024)

@@ -264,6 +272,7 @@ static void colo_process_checkpoint(MigrationState *s)
  {
      QEMUSizedBuffer *buffer = NULL;
      int64_t current_time, checkpoint_time = 
qemu_clock_get_ms(QEMU_CLOCK_HOST);
+    int64_t error_time;
      int ret = 0;
      uint64_t value;

@@ -322,8 +331,25 @@ static void colo_process_checkpoint(MigrationState *s)
      }

  out:
+    current_time = error_time = qemu_clock_get_ms(QEMU_CLOCK_HOST);
      if (ret < 0) {
          error_report("%s: %s", __func__, strerror(-ret));
+        /* Give users time to get involved in this verdict */
+        while (current_time - error_time <= DEFAULT_FAILOVER_DELAY) {
+            if (failover_request_is_active()) {
+                error_report("Primary VM will take over work");
+                break;
+            }
+            usleep(100 * 1000);
+            current_time = qemu_clock_get_ms(QEMU_CLOCK_HOST);
+        }
+
+        qemu_mutex_lock_iothread();
+        if (!failover_request_is_active()) {
+            error_report("Primary VM will take over work in default");
+            failover_request_active(NULL);
+        }
+        qemu_mutex_unlock_iothread();
      }

      qsb_free(buffer);
@@ -384,6 +410,7 @@ void *colo_process_incoming_thread(void *opaque)
      QEMUFile *fb = NULL;
      QEMUSizedBuffer *buffer = NULL; /* Cache incoming device state */
      uint64_t  total_size;
+    int64_t error_time, current_time;
      int ret = 0;
      uint64_t value;

@@ -499,9 +526,28 @@ void *colo_process_incoming_thread(void *opaque)
      }

  out:
+    current_time = error_time = qemu_clock_get_ms(QEMU_CLOCK_HOST);
      if (ret < 0) {
          error_report("colo incoming thread will exit, detect error: %s",
                       strerror(-ret));
+        /* Give users time to get involved in this verdict */
+        while (current_time - error_time <= DEFAULT_FAILOVER_DELAY) {
+            if (failover_request_is_active()) {
+                error_report("Secondary VM will take over work");
+                break;
+            }
+            usleep(100 * 1000);
+            current_time = qemu_clock_get_ms(QEMU_CLOCK_HOST);
+        }
+        /* check flag again*/
+        if (!failover_request_is_active()) {
+            /*
+            * We assume that Primary VM is still alive according to
+            * heartbeat, just kill Secondary VM
+            */
+            error_report("SVM is going to exit in default!");
+            exit(1);
+        }
      }

      if (fb) {
--
1.8.3.1


--
Dr. David Alan Gilbert / address@hidden / Manchester, UK

.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]