Re: [PATCH v3 1/1] vhost-user-fs: add migration type property

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 1/1] vhost-user-fs: add migration type property

From:	Anton Kuchin
Subject:	Re: [PATCH v3 1/1] vhost-user-fs: add migration type property
Date:	Mon, 27 Feb 2023 12:19:53 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.1


On 24/02/2023 06:14, Anton Kuchin wrote:

On 23/02/2023 23:24, Stefan Hajnoczi wrote:
On Thu, Feb 23, 2023 at 02:36:33AM -0500, Michael S. Tsirkin wrote:
On Wed, Feb 22, 2023 at 03:21:42PM -0500, Michael S. Tsirkin wrote:
On Wed, Feb 22, 2023 at 08:25:19PM +0200, Anton Kuchin wrote:
On 22/02/2023 19:12, Michael S. Tsirkin wrote:
On Wed, Feb 22, 2023 at 07:05:47PM +0200, Anton Kuchin wrote:
On 22/02/2023 18:51, Michael S. Tsirkin wrote:
On Wed, Feb 22, 2023 at 06:49:10PM +0200, Anton Kuchin wrote:
On 22/02/2023 17:14, Vladimir Sementsov-Ogievskiy wrote:
On 22.02.23 17:25, Anton Kuchin wrote:
+static int vhost_user_fs_pre_save(void *opaque)
+{
+    VHostUserFS *fs = opaque;
+ g_autofree char *path =object_get_canonical_path(OBJECT(fs));
+
+    switch (fs->migration_type) {
+    case VHOST_USER_MIGRATION_TYPE_NONE:
+ error_report("Migration is blocked by device%s", path);
+        break;
+    case VHOST_USER_MIGRATION_TYPE_EXTERNAL:
+        return 0;
+    default:
+        error_report("Migration type '%s' is not
supported by device %s",
+ VhostUserMigrationType_str(fs->migration_type), path);
+        break;
+    }
+
+    return -1;
+}
Should we also add this as .pre_load, to force user select
correct migration_type on target too?
In fact, I would claim we only want pre_load.
When qemu is started on destination we know where it'smigrated
from so this flag can be set.
When qemu is started on source we generally do not yet know so
we don't know whether it's safe to set this flag.
But destination is a "source" for next migration, so thereshouldn't be
real difference.
The new property has ".realized_set_allowed = true", so, as Iunderstand
it may be changed at any time, so that's not a problem.
Yes, exactly. So destination's property sets not how it willhandle this
incoming
migration but the future outgoing one.
How do you know where you are going to migrate though?
I think you don't.
Setting it on source is better since we know where we
are migrating from.
Yes, I don't know where I'm going to migrate to. This is whyproperty
affects only how source saves state on outgoing migration.
Um. I don't get the logic.
For this feature to work we need orchestrator to manage themigration. And
we
generally assume that it is responsibility of orchestrator to ensure
matching
properties on source and destination.
As orchestrator manages both sides of migration it can set option(and we
can
check it) on either source or destination. Now it's not importantwhich side
we
select, because now the option is essentially binary allow/deny(but IMHO itis much better to refuse source to migrate than find later thatstate can't
be
loaded by destination, in case of file migration this becomesespecially
painful).
But there are plans to add internal migration option (extract FUSEstate
from
backend and transfer it in QEMU migration stream), and that's where
setting/checking
on source becomes important because it will rely on this propertyto decide
if
extra state form backend needs to be put in the migration streamsubsection.
If we do internal migration that will be a different property
which has to match on source *and* destination.
If you are concerned about orchestrator breaking assumption ofmatching
properties
on source and destination this is not really supported AFAIK but Idon't
think we
need to punish it for this, maybe it has its reasons: I canimagine scenario
where orchestrator could want to migrate from source with
'migration=external'
to destination with 'migration=none' to ensure that destinationcan't be
migrated further.
No. I am concerned about a simple practical matter:
- I decide to restart qemu on the same host - so I need to enable
   migration
- Later I decide to migrate qemu to another host - this should be
   blocked


Property on source does not satisfy both at the same time.
Property on destination does.
Stefan what's your take on this? Should we move this from
save to load hook?
This property can be changed on the source at runtime via qom-set, so
you don't need to predict the future. The device can be started from an
incoming migration with "external" and then set to "stateful" migration
to migrate to another host later on.

Anton, can you share the virtiofsd patches so we have a full picture of
how "external" migration works? I'd like to understand the workflow and
also how it can be extended when "stateful" migration is added.
Unfortunately internal implementation is relying heavily on ourinfrastructure,and rust virtiofsd still lacks dirty tracking so it is not ready yet.But I did
have a PoC for deprecated now C virtiofsd that I didn't bother to prepare
for upstreaming because C version was declared unsupported. Itessentially adds
reconnect and this was the only thing required from virtiofsd to support
migration via file.
If this helps I'll try to find patches or recreate then and will behappy to share.

Found my stash with PoC. It has some TODOs and needs more work to becomesubmission

quality, but I hope this can be helpful to show the idea.

I checked it with file migration on the same host to update qemu binarylike this:

 1. Run tools/virtiofsd/virtiofsd with "--reconnect" flag
 2. Run src qemu VM with vhost-user-fs device

3. Mount fs in guest and run fio with verification on test file inshared directory

 4. Command via QMP to migrate VM to file
 5. Run dst qemu VM with vhost-user-fs device and "-incoming defer" flag
 6. Command via QMP to load migration stream from file and continue VM
 7. Check that fio keeps running in guest with no errors

[PATCH] tools/virtiofsd: reconnect PoC

Keep daemon listening after disconnect so session can be continued
after VM is restored.

Signed-off-by: Anton Kuchin <antonkuchin@yandex-team.ru>
---
 tools/virtiofsd/fuse_lowlevel.h       |  1 +
 tools/virtiofsd/fuse_virtio.c         | 59 +++++++++++++++++----------
 tools/virtiofsd/helper.c              |  1 +
 tools/virtiofsd/passthrough_ll.c      | 49 ++++++++++++++++------
 tools/virtiofsd/passthrough_seccomp.c | 12 +++++-
 tools/virtiofsd/passthrough_seccomp.h |  2 +-
 6 files changed, 87 insertions(+), 37 deletions(-)

diff --git a/tools/virtiofsd/fuse_lowlevel.hb/tools/virtiofsd/fuse_lowlevel.h

index b889dae4de..d5f3bf05ba 100644
--- a/tools/virtiofsd/fuse_lowlevel.h
+++ b/tools/virtiofsd/fuse_lowlevel.h
@@ -1795,6 +1795,7 @@ struct fuse_cmdline_opts {
     int log_level;
     unsigned int max_idle_threads;
     unsigned long rlimit_nofile;
+    int reconnect;
 };

 /**
diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
index 9368e292e4..64782cd78d 100644
--- a/tools/virtiofsd/fuse_virtio.c
+++ b/tools/virtiofsd/fuse_virtio.c
@@ -837,6 +837,7 @@ static const VuDevIface fv_iface = {
 int virtio_loop(struct fuse_session *se)
 {
     fuse_log(FUSE_LOG_INFO, "%s: Entry\n", __func__);
+    int ret = 0;

     while (!fuse_session_exited(se)) {
         struct pollfd pf[1];
@@ -858,7 +859,13 @@ int virtio_loop(struct fuse_session *se)
             break;
         }
         assert(poll_res == 1);
-        if (pf[0].revents & (POLLERR | POLLHUP | POLLNVAL)) {
+        if (pf[0].revents & POLLHUP) {

+ fuse_log(FUSE_LOG_ERR, "%s: Client disconnected: %x\n",__func__,

+                     pf[0].revents);
+            ret = -1;
+            break;
+        }
+        if (pf[0].revents & (POLLERR | POLLNVAL)) {

fuse_log(FUSE_LOG_ERR, "%s: Unexpected poll revents %x\n",__func__,

                      pf[0].revents);
             break;
@@ -885,7 +892,7 @@ int virtio_loop(struct fuse_session *se)
     stop_all_queues(se->virtio_dev);
     fuse_log(FUSE_LOG_INFO, "%s: Exit\n", __func__);

-    return 0;
+    return ret;
 }

 static void strreplace(char *s, char old, char new)
@@ -1015,25 +1022,31 @@ int virtio_session_mount(struct fuse_session *se)
 {
     int ret;

-    /*

- * Test that unshare(CLONE_FS) works. fv_queue_worker() will needit. It's- * an unprivileged system call but some Docker/Moby versions areknown to

-     * reject it via seccomp when CAP_SYS_ADMIN is not given.
-     *
-     * Note that the program is single-threaded here so this syscall has no
-     * visible effect and is safe to make.
-     */
-    ret = unshare(CLONE_FS);
-    if (ret == -1 && errno == EPERM) {
-        fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed with EPERM. If "
-                "running in a container please check that the container "
-                "runtime seccomp policy allows unshare.\n");
-        return -1;
-    }

-    ret = fv_create_listen_socket(se);
-    if (ret < 0) {
-        return ret;
+    if (se->vu_listen_fd == -1) {
+        fuse_log(FUSE_LOG_INFO, "listenfd is closed. set it up\n");
+        /*

+ * Test that unshare(CLONE_FS) works. fv_queue_worker() willneed it. It's+ * an unprivileged system call but some Docker/Moby versionsare known to

+         * reject it via seccomp when CAP_SYS_ADMIN is not given.
+         *

+ * Note that the program is single-threaded here so thissyscall has no

+         * visible effect and is safe to make.
+         */
+        ret = unshare(CLONE_FS);
+        if (ret == -1 && errno == EPERM) {

+ fuse_log(FUSE_LOG_ERR, "unshare(CLONE_FS) failed withEPERM. If "+ "running in a container please check that thecontainer "

+                    "runtime seccomp policy allows unshare.\n");
+            return -1;
+        }
+
+        ret = fv_create_listen_socket(se);
+        if (ret < 0) {
+            return ret;
+        }
+    } else {
+        fuse_log(FUSE_LOG_INFO, "listenfd is open. reconnecting\n");
     }

     se->fd = -1;
@@ -1046,8 +1059,8 @@ int virtio_session_mount(struct fuse_session *se)
         close(se->vu_listen_fd);
         return -1;
     }
-    close(se->vu_listen_fd);
-    se->vu_listen_fd = -1;
+
+    /* TODO: close vu_listen_fd here if not reconnect */
     fuse_log(FUSE_LOG_INFO, "%s: Received vhost-user socket connection\n",
              __func__);

@@ -1068,6 +1081,8 @@ int virtio_session_mount(struct fuse_session *se)

 void virtio_session_close(struct fuse_session *se)
 {
+    /* TODO: vu_listen_fd can be closed in session_mount */
+    close(se->vu_listen_fd);
     close(se->vu_socketfd);

     if (!se->virtio_dev) {
diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
index f5f66f292c..236a58f109 100644
--- a/tools/virtiofsd/helper.c
+++ b/tools/virtiofsd/helper.c
@@ -53,6 +53,7 @@ static const struct fuse_opt fuse_helper_opts[] = {
     FUSE_HELPER_OPT_VALUE("log_level=info", log_level, FUSE_LOG_INFO),
     FUSE_HELPER_OPT_VALUE("log_level=warn", log_level, FUSE_LOG_WARNING),
     FUSE_HELPER_OPT_VALUE("log_level=err", log_level, FUSE_LOG_ERR),
+    FUSE_HELPER_OPT_VALUE("--reconnect", reconnect, 0),
     FUSE_OPT_END
 };

diff --git a/tools/virtiofsd/passthrough_ll.cb/tools/virtiofsd/passthrough_ll.c

index 40ea2ed27f..487d320c70 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -3819,7 +3819,7 @@ static void setup_wait_parent_capabilities(void)
 /*
  * Move to a new mount, net, and pid namespaces to isolate this process.
  */
-static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)

+static void setup_namespaces(struct lo_data *lo, struct fuse_session*se, bool enable_reconnect)

 {
     pid_t child;

@@ -3829,12 +3829,21 @@ static void setup_namespaces(struct lo_data *lo,struct fuse_session *se)

      * is also needed so that we can remount /proc for the new pid
      * namespace.
      *
-     * Our UNIX domain sockets have been created.  Now we can move to
-     * an empty network namespace to prevent TCP/IP and other network
-     * activity in case this process is compromised.

+ * Our UNIX domain sockets have been created. If we don't needreconnect

+     * now we can move to an empty network namespace to prevent TCP/IP and
+     * other network activity in case this process is compromised.
+     *

+ * TODO: Need to double-check if this is necessary. Looks like unixsockets+ * can be shared across network namespaces. Maybe it is better ifsocket can+ * be created in new namespace (but before we forbid listen withseccomp).

      */
-    if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {

- fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS):%m\n");

+    int unshare_flags = CLONE_NEWPID | CLONE_NEWNS;
+    if (!enable_reconnect) {
+        unshare_flags |= CLONE_NEWNET;
+    }
+    if (unshare(unshare_flags) != 0) {

+ fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS%s):%m\n",

+                 enable_reconnect ? " | CLONE_NEWNET" : "");
         exit(1);
     }

@@ -4146,16 +4155,16 @@ static void setup_chroot(struct lo_data *lo)

* source directory. This reduces the impact of arbitrary codeexecution bugs.

  */
 static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
-                          bool enable_syslog)
+                          bool enable_syslog, bool enable_reconnect)
 {
     if (lo->sandbox == SANDBOX_NAMESPACE) {
-        setup_namespaces(lo, se);
+        setup_namespaces(lo, se, enable_reconnect);
         setup_mounts(lo->source);
     } else {
         setup_chroot(lo);
     }

-    setup_seccomp(enable_syslog);
+    setup_seccomp(enable_syslog, enable_reconnect);
     setup_capabilities(g_strdup(lo->modcaps));
 }

@@ -4500,13 +4509,27 @@ int main(int argc, char *argv[])
     /* Must be before sandbox since it wants /proc */
     setup_capng();

-    setup_sandbox(&lo, se, opts.syslog);
+    setup_sandbox(&lo, se, opts.syslog, opts.reconnect);

     setup_root(&lo, &lo.root);
-    /* Block until ctrl+c or fusermount -u */
-    ret = virtio_loop(se);
+    do {
+        /* TODO: this loop should be like {mount-loop-unmount} but setup of

+ * listen descriptor happens in mount and will fail aftersandboxing.

+         * Need to extract setup to virtio_session_new where we can create

+ * socket before sandboxing and in session_mount only acceptclient.*/

+        /* Block until ctrl+c, fusermount -u or client disconnect */
+        ret = virtio_loop(se);
+        fuse_log(FUSE_LOG_ERR, "ret is %d, %s\n", ret,
+                 ret ? "reconnecting" : "terminating");
+        fuse_session_unmount(se);
+        if (ret) {
+            if (fuse_session_mount(se) != 0) {
+                fuse_log(FUSE_LOG_ERR, "failed to remount\n");
+                goto err_out3;
+            }
+        }
+    } while (opts.reconnect && !ret);

-    fuse_session_unmount(se);
     cleanup_capng();
 err_out3:
     fuse_remove_signal_handlers(se);

diff --git a/tools/virtiofsd/passthrough_seccomp.cb/tools/virtiofsd/passthrough_seccomp.c

index 0033dab493..8dd773a062 100644
--- a/tools/virtiofsd/passthrough_seccomp.c
+++ b/tools/virtiofsd/passthrough_seccomp.c
@@ -129,6 +129,11 @@ static const int syscall_allowlist_syslog[] = {
     SCMP_SYS(sendto),
 };

+/* Allow accept to handle client reconnect */
+static const int syscall_allowlist_reconnect[] = {
+    SCMP_SYS(accept),
+};
+

static void add_allowlist(scmp_filter_ctx ctx, const int syscalls[],size_t len)

 {
     size_t i;

@@ -142,7 +147,7 @@ static void add_allowlist(scmp_filter_ctx ctx, constint syscalls[], size_t len)

     }
 }

-void setup_seccomp(bool enable_syslog)
+void setup_seccomp(bool enable_syslog, bool enable_reconnect)
 {
     scmp_filter_ctx ctx;

@@ -166,6 +171,11 @@ void setup_seccomp(bool enable_syslog)
                       G_N_ELEMENTS(syscall_allowlist_syslog));
     }

+    if (enable_reconnect) {
+        add_allowlist(ctx, syscall_allowlist_reconnect,
+ G_N_ELEMENTS(syscall_allowlist_reconnect));
+    }
+

/* libvhost-user calls this for post-copy migration, we don't needit */

     if (seccomp_rule_add(ctx, SCMP_ACT_ERRNO(ENOSYS),
                          SCMP_SYS(userfaultfd), 0) != 0) {

diff --git a/tools/virtiofsd/passthrough_seccomp.hb/tools/virtiofsd/passthrough_seccomp.h

index 12674fc050..bec47b114c 100644
--- a/tools/virtiofsd/passthrough_seccomp.h
+++ b/tools/virtiofsd/passthrough_seccomp.h
@@ -9,6 +9,6 @@
 #ifndef VIRTIOFSD_PASSTHROUGH_SECCOMP_H
 #define VIRTIOFSD_PASSTHROUGH_SECCOMP_H

-void setup_seccomp(bool enable_syslog);
+void setup_seccomp(bool enable_syslog, bool enable_reconnect);

 #endif /* VIRTIOFSD_PASSTHROUGH_SECCOMP_H */
--
2.37.2

This property selects if VM can migrate and if it can whatshould
qemu put
to the migration stream. So we select on source what type of
migration is
allowed for this VM, destination can't check anything atload time.
OK, so the new field "migration" regulates only outgoingmigration anddo nothing for incoming. On incoming migration the migrationstream
itself defines the type of device migration.
Worth mentioning in doc?
Good point. I don't think this deserves a respin but if I haveto send v4
I'll include
clarification in it.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [PATCH v3 1/1] vhost-user-fs: add migration type property, (continued)

Prev by Date: [PATCH v4 3/4] target/i386: KVM: allow fast string operations if host supports them
Next by Date: Re: [PATCH v2 00/20] vfio: Add migration pre-copy support and device dirty tracking
Previous by thread: Re: [PATCH v3 1/1] vhost-user-fs: add migration type property
Next by thread: Re: [PATCH v3 1/1] vhost-user-fs: add migration type property
Index(es):
- Date
- Thread