qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Using PCI config space to indicate config location


From: Anthony Liguori
Subject: Re: [Qemu-devel] Using PCI config space to indicate config location
Date: Tue, 09 Oct 2012 08:56:27 -0500
User-agent: Notmuch/0.13.2+93~ged93d79 (http://notmuchmail.org) Emacs/23.3.1 (x86_64-pc-linux-gnu)

Rusty Russell <address@hidden> writes:

> Anthony Liguori <address@hidden> writes:
>> We'll never remove legacy so we shouldn't plan on it.  There are
>> literally hundreds of thousands of VMs out there with the current virtio
>> drivers installed in them.  We'll be supporting them for a very, very
>> long time :-)
>
> You will be supporting this for qemu on x86, sure.

And PPC.

> As I think we're
> still in the growth phase for virtio, I prioritize future spec
> cleanliness pretty high.
>
> But I think you'll be surprised how fast this is deprecated:
> 1) Bigger queues for block devices (guest-specified ringsize)
> 2) Smaller rings for openbios (guest-specified alignment)
> 3) All-mmio mode (powerpc)
> 4) Whatever network features get numbers > 31.

We can do all of these things with incremental change to the existing
layout.  That's the only way what I'm suggesting is different.

You want to reorder all of the fields and have a driver flag day.  But I
strongly suspect we'll decide we need to do the same exercise again in 4
years when we now need to figure out how to take advantage of
transactional memory or some other whiz-bang hardware feature.

There are a finite number of BARs but each BAR has an almost infinite
size.  So extending BARs instead of introducing new one seems like the
conservative approach moving forward.

>> I don't think we gain a lot by moving the ISR into a separate BAR.
>> Splitting up registers like that seems weird to me too.
>
> Confused.  I proposed the same split as you have, just ISR by itself.

I disagree with moving the ISR into a separate BAR.  That's what seems
weird to me.

>> It's very normal to have a mirrored set of registers that are PIO in one
>> bar and MMIO in a different BAR.
>>
>> If we added an additional constraints that BAR1 was mirrored except for
>> the config space and the MSI section was always there, I think the end
>> result would be nice.  IOW:
>
> But it won't be the same, because we want all that extra stuff, like
> more feature bits and queue size alignment.  (Admittedly queues past
> 16TB aren't a killer feature).
>
> To make it concrete:
>
> Current:
> struct {
>         __le32 host_features;   /* read-only */
>         __le32 guest_features;  /* read/write */
>         __le32 queue_pfn;       /* read/write */
>         __le16 queue_size;      /* read-only */
>         __le16 queue_sel;       /* read/write */
>         __le16 queue_notify;    /* read/write */
>         u8 status;              /* read/write */
>         u8 isr;                 /* read-only, clear on read */
>         /* Optional */
>         __le16 msi_config_vector;       /* read/write */
>         __le16 msi_queue_vector;        /* read/write */
>         /* ... device features */
> };
>
> Proposed:
> struct virtio_pci_cfg {
>       /* About the whole device. */
>       __le32 device_feature_select;   /* read-write */
>       __le32 device_feature;          /* read-only */
>       __le32 guest_feature_select;    /* read-write */
>       __le32 guest_feature;           /* read-only */
>       __le16 msix_config;             /* read-write */
>       __u8 device_status;             /* read-write */
>       __u8 unused;
>
>       /* About a specific virtqueue. */
>       __le16 queue_select;    /* read-write */
>       __le16 queue_align;     /* read-write, power of 2. */
>       __le16 queue_size;      /* read-write, power of 2. */
>       __le16 queue_msix_vector;/* read-write */
>       __le64 queue_address;   /* read-write: 0xFFFFFFFFFFFFFFFF == DNE. */
> };
>
> struct virtio_pci_isr {
>         __u8 isr;                 /* read-only, clear on read */
> };

What I'm suggesting is:

> struct {
>         __le32 host_features;   /* read-only */
>         __le32 guest_features;  /* read/write */
>         __le32 queue_pfn;       /* read/write */
>         __le16 queue_size;      /* read-only */
>         __le16 queue_sel;       /* read/write */
>         __le16 queue_notify;    /* read/write */
>         u8 status;              /* read/write */
>         u8 isr;                 /* read-only, clear on read */
>         __le16 msi_config_vector;       /* read/write */
>         __le16 msi_queue_vector;        /* read/write */
>         __le32 host_feature_select;     /* read/write */
>         __le32 guest_feature_select;    /* read/write */
>         __le32 queue_pfn_hi;            /* read/write */
> };
>

With the additional semantic that the virtio-config space is overlayed
on top of the register set in BAR0 unless the
VIRTIO_PCI_F_SEPARATE_CONFIG feature is acknowledged.  This feature
acts as a latch and when set, removes the config space overlay.

If the config space overlays the registers, the offset in BAR0 of the
overlay depends on whether MSI is enabled or not in the PCI device.

BAR1 is an MMIO mirror of BAR0 except that the config space is never
overlayed in BAR1 regardless of VIRTIO_PCI_F_SEPARATE_CONFIG.

BAR2 contains the config space.

A guest can look at BAR1 and BAR2 to determine whether they exist.

> We could also enforce LE in the per-device config space in this case,
> another nice cleanup for PCI people.
>
>> BAR0[pio]: virtio-pci registers + optional MSI section + virtio-config
>> BAR1[mmio]: virtio-pci registers + MSI section + future extensions
>> BAR2[mmio]: virtio-config
>>
>> We can continue to do ISR access via BAR0 for performance reasons.
>
> But powerpc explicitly *doesnt* want a pio bar.  So let it be its own
> bar, which can be either.

Does it really care?  It all ends up being MMIO for PPC anyway so I
don't think there's any real pressing need for it not to be PIO.

>>> As to MMIO vs PIO, the BARs are self-describing, so we should explicitly
>>> endorse that and leave it to the devices.
>>>
>>> The detection is simple: if BAR1 has non-zero length, it's new-style,
>>> otherwise legacy.
>>
>> I agree that this is the best way to extend, but I think we should still
>> use a transport feature bit.  We want to be able to detect within QEMU
>> whether a guest is using these new features because we need to adjust
>> migration state accordingly.
>>
>> Otherwise we would have to detect reads/writes to the new BARs to
>> maintain whether the extended register state needs to be saved.  This
>> gets nasty dealing with things like reset.
>
> I don't think it'll be that bad; reset clears the device to unknown,
> bar0 moves it from unknown->legacy mode, bar1/2/3 changes it from
> unknown->modern mode, and anything else is bad (I prefer being strict so
> we catch bad implementations from the beginning).

You mean, "accesses to bar1/2/3" change it to modern mode.  That's a
pretty big side effect of a read.

> But I'm happy to implement it and see what it's like.
>
>> A feature bit simplifies this all pretty well.
>
> I suspect it will be quite ugly, actually.  The guest has to use BAR0 to
> get the features to see if they can use BAR1.  Do they ack the feature
> (via BAR0) before accessing BAR1?  If not, qemu can't rely on the
> feature bit.

See above.  A guest could happily just use BAR1/BAR2 and completely
ignore BAR0 provided that BAR1/BAR2 are present.

It also means the guest drivers don't need separate code paths for
"legacy" vs. "modern".  They can switch between the two by just changing
a single pointer.

The implementation ends up being pretty trivial too.  We can use the
same MemoryRegion in QEMU for both PIO and MMIO.  The kernel code stays
the same.  It just takes a couple if()s to handle whether there's a
config overlay or not.

Regards,

Anthony Liguori

>
> Cheers,
> Rusty.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to address@hidden
> More majordomo info at  http://vger.kernel.org/majordomo-info.html




reply via email to

[Prev in Thread] Current Thread [Next in Thread]