qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH for-5.0?] nbd: Attempt reconnect after server error of ESHUTDOWN


From: Eric Blake
Subject: [PATCH for-5.0?] nbd: Attempt reconnect after server error of ESHUTDOWN
Date: Wed, 1 Apr 2020 17:38:41 -0500

I was trying to test qemu's reconnect-delay parameter by using nbdkit
as a server that I could easily make disappear and resume.  A bit of
experimenting shows that when nbdkit is abruptly killed (SIGKILL),
qemu detects EOF on the socket and manages to reconnect just fine; but
when nbdkit is gracefully killed (SIGTERM), it merely fails all
further guest requests with NBD_ESHUTDOWN until the client disconnects
first, and qemu was blindly failing the I/O request with ESHUTDOWN
from the server instead of attempting to reconnect.

While most NBD server failures are unlikely to change by merely
retrying the same transaction, our decision to not start a retry loop
in the common case is correct.  But NBD_ESHUTDOWN is rare enough, and
really is indicative of a transient situation, that it is worth
special-casing.

Here's the test setup I used: in one terminal, kick off a sequence of
nbdkit commands that has a temporary window where the server is
offline; in another terminal (and within the first 5 seconds) kick off
a qemu-img convert with reconnect enabled.  If the qemu-img process
completes successfully, the reconnect worked.

$ #term1
$ MYSIG=    # or MYSIG='-s KILL'
$ timeout $MYSIG 5s ~/nbdkit/nbdkit -fv --filter=delay --filter=noextents \
  null 200M delay-read=1s; sleep 5; ~/nbdkit/nbdkit -fv --filter=exitlast \
  --filter=delay --filter=noextents null 200M delay-read=1s

$ #term2
$ MYCONN=server.type=inet,server.host=localhost,server.port=10809
$ qemu-img convert -p -O raw --image-opts \
  driver=nbd,$MYCONN,,reconnect-delay=60 out.img

See also: https://bugzilla.redhat.com/show_bug.cgi?id=1819240#c8

Signed-off-by: Eric Blake <address@hidden>
---

This is not a regression, per se, as reconnect-delay has been unchanged
since 4.2; but I'd like to consider this as an interoperability bugfix
worth including in the next rc.

 block/nbd.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index 2906484390f9..576b95fb8753 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -863,6 +863,15 @@ static coroutine_fn int nbd_co_receive_one_chunk(
     if (ret < 0) {
         memset(reply, 0, sizeof(*reply));
         nbd_channel_error(s, ret);
+    } else if (s->reconnect_delay && *request_ret == -ESHUTDOWN) {
+        /*
+         * Special case: if we support reconnect and server is warning
+         * us that it wants to shut down, then treat this like an
+         * abrupt connection loss.
+         */
+        memset(reply, 0, sizeof(*reply));
+        *request_ret = 0;
+        nbd_channel_error(s, -EIO);
     } else {
         /* For assert at loop start in nbd_connection_entry */
         *reply = s->reply;
-- 
2.26.0.rc2




reply via email to

[Prev in Thread] Current Thread [Next in Thread]