qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 0/4] colo: Introduce resource agent and high-level test


From: Lukas Straub
Subject: Re: [PATCH 0/4] colo: Introduce resource agent and high-level test
Date: Wed, 27 Nov 2019 22:11:34 +0100

On Fri, 22 Nov 2019 09:46:46 +0000
"Dr. David Alan Gilbert" <address@hidden> wrote:

> * Lukas Straub (address@hidden) wrote:
> > Hello Everyone,
> > These patches introduce a resource agent for use with the Pacemaker CRM and 
> > a
> > high-level test utilizing it for testing qemu COLO.
> >
> > The resource agent manages qemu COLO including continuous replication.
> >
> > Currently the second test case (where the peer qemu is frozen) fails on 
> > primary
> > failover, because qemu hangs while removing the replication related block 
> > nodes.
> > Note that this also happens in real world test when cutting power to the 
> > peer
> > host, so this needs to be fixed.
>
> Do you understand why that happens? Is this it's trying to finish a
> read/write to the dead partner?
>
> Dave

I haven't looked into it too closely yet, but it's often hanging in bdrv_flush()
while removing the replication blockdev and of course thats probably because the
nbd client waits for a reply. So I tried with the workaround below, which will
actively kill the TCP connection and with it the test passes, though I haven't
tested it in real world yet.

A proper solution to this would probably be a "force" parameter for 
blockdev-del,
which skips all flushing and aborts all inflight io. Or we could add a timeout
to the nbd client.

Regards,
Lukas Straub

diff --git a/scripts/colo-resource-agent/colo b/scripts/colo-resource-agent/colo
index 5fd9cfc0b5..62210af2a1 100755
--- a/scripts/colo-resource-agent/colo
+++ b/scripts/colo-resource-agent/colo
@@ -935,6 +935,7 @@ def qemu_colo_notify():
            and HOSTNAME == str.strip(OCF_RESKEY_CRM_meta_notify_master_uname):
             fd = qmp_open()
             peer = qmp_get_nbd_remote(fd)
+            os.system("sudo ss -K dst %s dport = %s" % (peer, NBD_PORT))
             if peer == str.strip(OCF_RESKEY_CRM_meta_notify_stop_uname):
                 if qmp_check_resync(fd) != None:
                     qmp_cancel_resync(fd)





reply via email to

[Prev in Thread] Current Thread [Next in Thread]