On 1/24/24 08:47, Hannes Reinecke wrote:
On 1/24/24 07:52, Philippe Mathieu-Daudé wrote:
Hi Hannes,
[+Markus as QOM/QDev rubber duck]
On 23/1/24 13:40, Hannes Reinecke wrote:
On 1/23/24 11:59, Damien Hedde wrote:
Hi all,
We are currently looking into hotplugging nvme devices and it is
currently not possible:
When nvme was introduced 2 years ago, the feature was disabled.
commit cc6fb6bc506e6c47ed604fcb7b7413dff0b7d845
Author: Klaus Jensen
Date: Tue Jul 6 10:48:40 2021 +0200
hw/nvme: mark nvme-subsys non-hotpluggable
We currently lack the infrastructure to handle subsystem
hotplugging, so
disable it.
Do someone know what's lacking or anyone have some tips/idea of
what we should develop to add the support ?
Problem is that the object model is messed up. In qemu namespaces
are attached to controllers, which in turn are children of the PCI
device.
There are subsystems, but these just reference the controller.
So if you hotunplug the PCI device you detach/destroy the
controller and detach the namespaces from the controller.
But if you hotplug the PCI device again the NVMe controller will be
attached to the PCI device, but the namespace are still be detached.
Klaus said he was going to fix that, and I dimly remember some patches
floating around. But apparently it never went anywhere.
Fundamental problem is that the NVMe hierarchy as per spec is
incompatible with the qemu object model; qemu requires a strict
tree model where every object has exactly _one_ parent.
The modelling problem is not clear to me.
Do you have an example of how the NVMe hierarchy should be?
Sure.
As per NVMe spec we have this hierarchy:
---> subsys ---
| |
| V
controller namespaces
There can be several controllers, and several
namespaces.
The initiator (ie the linux 'nvme' driver) connects
to a controller, queries the subsystem for the attached
namespaces, and presents each namespace as a block device.
For Qemu we have the problem that every device _must_ be
a direct descendant of the parent (expressed by the fact
that each 'parent' object is embedded in the device object).
So if we were to present a NVMe PCI device, the controller
must be derived from the PCI device:
pci -> controller
but now we have to express the NVMe hierarchy, too:
pci -> ctrl1 -> subsys1 -> namespace1
which actually works.
We can easily attach several namespaces:
pci -> ctrl1 ->subsys1 -> namespace2
For a single controller and a single subsystem.
However, as mentioned above, there can be _several_
controllers attached to the same subsystem.
So we can express the second controller:
pci -> ctrl2
but we cannot attach the controller to 'subsys1'
as then 'subsys1' would need to be derived from
'ctrl2', and not (as it is now) from 'ctrl1'.
The most logical step would be to have 'subsystems'
their own entity, independent of any controllers.
But then the block devices (which are derived from
the namespaces) could not be traced back
to the PCI device, and a PCI hotplug would not
'automatically' disconnect the nvme block devices.
Plus the subsystem would be independent from the NVMe
PCI devices, so you could have a subsystem with
no controllers attached. And one would wonder who
should be responsible for cleaning up that.
Thanks for the details !
My use case is the simple one with no nvme subsystem/namespaces:
- hotplug a pci nvme device (nvme controller) as in the following CLI
(which automatically put the drive into a default namespace)
./qemu-system-aarch64 -nographic -M virt \
-drive file=nvme0.disk,if=none,id=nvme-drive0 \
-device nvme,serial=nvme0,id=nvmedev0,drive=nvme-drive0
In the simple tree approach where subsystems and namespaces are not
shared by controllers. We could delete the whole nvme hiearchy under
the controller while unplugging it ?
In your first message, you said
> So if you hotunplug the PCI device you detach/destroy the controller
> and detach the namespaces from the controller.
> But if you hotplug the PCI device again the NVMe controller will be
> attached to the PCI device, but the namespace are still be detached.
Do you mean that if we unplug the pci device we HAVE to keep some nvme
objects so that if we plug the device back we can recover them ?
Or just that it's hard to unplug nvme objects if they are not real qom
children of pci device ?