[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v2 13/13] net: Introduce e1000e device emulation
From: |
Jason Wang |
Subject: |
Re: [Qemu-devel] [PATCH v2 13/13] net: Introduce e1000e device emulation |
Date: |
Thu, 7 Apr 2016 15:24:08 +0800 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 |
On 04/06/2016 04:22 PM, Dmitry Fleytman wrote:
> Hi Jason,
>
> Please see my comments below.
>
>> On 8 Mar 2016, at 11:31 AM, Jason Wang <address@hidden
>> <mailto:address@hidden>> wrote:
>>
>>
>>
>> On 02/23/2016 01:37 AM, Leonid Bloch wrote:
>>> From: Dmitry Fleytman <address@hidden
>>> <mailto:address@hidden>>
>>>
>>> This patch introduces emulation for the Intel 82574 adapter, AKA e1000e.
>>>
>>> This implementation is derived from the e1000 emulation code, and
>>> utilizes the TX/RX packet abstractions that were initially developed for
>>> the vmxnet3 device. Although some parts of the introduced code may be
>>> shared with e1000, the differences are substantial enough so that the
>>> only shared resources for the two devices are the definitions in
>>> hw/net/e1000_regs.h.
>>>
>>> Similarly to vmxnet3, the new device uses virtio headers for task
>>> offloads (for backends that support virtio extensions). Usage of
>>> virtio headers may be forcibly disabled via a boolean device property
>>> "vnet" (which is enabled by default). In such case task offloads
>>> will be performed in software, in the same way it is done on
>>> backends that do not support virtio headers.
>>>
>>> The device code is split into two parts:
>>>
>>> 1. hw/net/e1000e.c: QEMU-specific code for a network device;
>>> 2. hw/net/e1000e_core.[hc]: Device emulation according to the spec.
>>>
>>> The new device name is e1000e.
>>>
>>> Intel specifications for the 82574 controller are available at:
>>> http://www.intel.com/content/dam/doc/datasheet/82574l-gbe-controller-datasheet.pdf
[...]
>>> +Device properties:
>>> +
>>> ++-------------------------------------------------+--------+---------+
>>> +| Propery name | Description | Type | Default |
>>> ++--------------------------------------------------------------------|
>>> +| subsys_ven | PCI device Subsystem Vendor ID | UINT16 | 0x8086 |
>>> +| | | | |
>>> +| subsys | PCI device Subsystem ID | UINT16 | 0 |
>>> +| | | | |
>>> +| vnet | Use virtio headers or perform | BOOL | TRUE |
>>> +| | SW offloads emulation | | |
>>> ++----------------+--------------------------------+--------+---------+
>>
>> You may using PropertyInfo which has a "description" field to do this
>> better (e.g qdev_prop_vlan). Then there's even no need for this file.
>
> We replaced this file with source code comments.
> As for PropertyInfo description I can’t see any way to pass
> description to DEFINE_PROP_* macros. Could you please elaborate on this?
You could use DEFINE_PROP() which can accept a PropertyInfo parameter.
>
>>
>>> diff --git a/hw/net/Makefile.objs b/hw/net/Makefile.objs
>>> index 527d264..4201115 100644
>>> --- a/hw/net/Makefile.objs
>>> +++ b/hw/net/Makefile.objs
[...]
>>
>>> + MemoryRegion io;
>>> + MemoryRegion msix;
>>> +
>>> + uint32_t ioaddr;
>>> +
>>> + uint16_t subsys_ven;
>>> + uint16_t subsys;
>>> +
>>> + uint16_t subsys_ven_used;
>>> + uint16_t subsys_used;
>>> +
>>> + uint32_t intr_state;
>>> + bool use_vnet;
>>> +
>>> + E1000ECore core;
>>
>> What's the advantages of introducing another indirection level here?
>> Looks like it causes lots of extra code overheads.
>
> This was done by intention.
> Unlike e1000 and e1000e devices which are really different, e1000e and
> newer Intel devices like ixgb* are rather similar. Introduction of
> e1000e core abstraction will simplify possible code reuse in the future.
Ok, I see.
>
>>
>>> +
>>> +} E1000EState;
>>> +
[...]
>>> +static NetClientInfo net_e1000e_info = {
>>> + .type = NET_CLIENT_OPTIONS_KIND_NIC,
>>> + .size = sizeof(NICState),
>>> + .can_receive = e1000e_nc_can_receive,
>>> + .receive = e1000e_nc_receive,
>>> + .receive_iov = e1000e_nc_receive_iov,
>>
>> All of the above three functions has a NetClientState parameter, so
>> probably no need to have "nc" in their name.
>
> The issue is there are other functions with similar names but without
> _nc_...
>
Right.
>>
>>> + .link_status_changed = e1000e_set_link_status,
>>> +};
>>> +
[...]
>>> + for (i = 0; i < s->conf.peers.queues; i++) {
>>> + nc = qemu_get_subqueue(s->nic, i);
>>> + if (!nc->peer || !qemu_has_vnet_hdr(nc->peer)) {
>>> + s->core.has_vnet = false;
>>
>> So in fact we're probing the vnet support instead of doing vnet our own
>> if backend does not support it. It seems to me that there's no need to
>> having "vnet" property.
>
> This property is intended for forcible disable of virtio headers
> support. Possible use cases are verification of SW offloads logic and
> work around for theoretical interoperability problems.
Ok, so maybe we'd better rename it to "disable_vnet".
>
>>
>>> + trace_e1000e_cfg_support_virtio(false);
>>> + return;
>>> + }
>>> + }
>>> +
>>> + VMSTATE_UINT8(core.tx[1].sum_needed, E1000EState),
>>> + VMSTATE_UINT8(core.tx[1].ipcss, E1000EState),
>>> + VMSTATE_UINT8(core.tx[1].ipcso, E1000EState),
>>> + VMSTATE_UINT16(core.tx[1].ipcse, E1000EState),
>>> + VMSTATE_UINT8(core.tx[1].tucss, E1000EState),
>>> + VMSTATE_UINT8(core.tx[1].tucso, E1000EState),
>>> + VMSTATE_UINT16(core.tx[1].tucse, E1000EState),
>>> + VMSTATE_UINT8(core.tx[1].hdr_len, E1000EState),
>>> + VMSTATE_UINT16(core.tx[1].mss, E1000EState),
>>> + VMSTATE_UINT32(core.tx[1].paylen, E1000EState),
>>> + VMSTATE_INT8(core.tx[1].ip, E1000EState),
>>> + VMSTATE_INT8(core.tx[1].tcp, E1000EState),
>>> + VMSTATE_BOOL(core.tx[1].tse, E1000EState),
>>> + VMSTATE_BOOL(core.tx[1].cptse, E1000EState),
>>> + VMSTATE_BOOL(core.tx[1].skip_cp, E1000EState),
>>
>> You can move the tx state into another structure and use VMSTATE_ARRAY()
>> here.
>
> Not sure I got you point. Could you please provide more details?
I mean e.g, move skip_cp into e1000e_txd_props and move tx_pkt out of
e1000_tx. Then you can use VMSTATE_ARRAY to save and load
e1000_txd_props array.
>
>>
>>> + VMSTATE_END_OF_LIST()
>>> + }
>>> +};
>>> +
>>> +static Property e1000e_properties[] = {
>>> + DEFINE_NIC_PROPERTIES(E1000EState, conf),
>>> + DEFINE_PROP_BOOL("vnet", E1000EState, use_vnet, true),
>>> + DEFINE_PROP_UINT16("subsys_ven", E1000EState,
>>> + subsys_ven, PCI_VENDOR_ID_INTEL),
>>> + DEFINE_PROP_UINT16("subsys", E1000EState, subsys, 0),
>>
>> I'm not quite sure the reason we need subsys_ven and subsys here?
>
> Some vendors (like VMWare) may replace these device IDs with their
> own. These parameters provide a way to make device look exactly as
> those. Useful in various V2V scenarios.
I get it.
Thanks
>
>>
>>> + DEFINE_PROP_END_OF_LIST(),
>>> +};
>>> +
[...]
>>> +
>>> +static void
>>> +e1000e_intrmgr_on_throttling_timer(void *opaque)
>>> +{
>>> + E1000IntrDelayTimer *timer = opaque;
>>> +
>>> + assert(!msix_enabled(timer->core->owner));
>>
>> Can this assert be triggered by guest? If yes, let's remove this.
>
> No, this timer is armed for legacy interrupts only.
> This assertion would mean there is an internal logic error in the device,
> i.e. the device code did not check we’re working with legacy
> interrupts before arming it.
Ok.
>
>>
>>> +
>>> + timer->running = false;
>>> +
>>> + if (!timer->core->itr_intr_pending) {
>>> + trace_e1000e_irq_throttling_no_pending_interrupts();
>>> + return;
>>> + }
>>> +
>>> + if (msi_enabled(timer->core->owner)) {
>>> + trace_e1000e_irq_msi_notify_postponed();
>>> + e1000e_set_interrupt_cause(timer->core, 0);
>>> + } else {
>>> + trace_e1000e_irq_legacy_notify_postponed();
>>> + e1000e_set_interrupt_cause(timer->core, 0);
>>> + }
>>> +}
>>> +
>>> +static void
>>> +e1000e_intrmgr_on_msix_throttling_timer(void *opaque)
>>> +{
>>> + E1000IntrDelayTimer *timer = opaque;
>>> + int idx = timer - &timer->core->eitr[0];
>>> +
>>> + assert(msix_enabled(timer->core->owner));
>>
>> Same question as above.
>
> The same. This timer is for MSI-X only.
Ok.
>
>>
>>> +
>>> + timer->running = false;
>>> +
>>> + if (!timer->core->eitr_intr_pending[idx]) {
>>> + trace_e1000e_irq_throttling_no_pending_vec(idx);
>>> + return;
>>> + }
>>> +
>>> + trace_e1000e_irq_msix_notify_postponed_vec(idx);
>>> + msix_notify(timer->core->owner, idx);
>>> +}
>>> +
>>> +static void
>>> +e1000e_intrmgr_initialize_all_timers(E1000ECore *core, bool create)
>>> +{
>>> + int i;
>>> +
>>> + core->radv.delay_reg = RADV;
>>> + core->rdtr.delay_reg = RDTR;
>>> + core->raid.delay_reg = RAID;
>>> + core->tadv.delay_reg = TADV;
>>> + core->tidv.delay_reg = TIDV;
>>> +
>>> + core->radv.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> + core->rdtr.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> + core->raid.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> + core->tadv.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> + core->tidv.delay_resolution_ns = E1000_INTR_DELAY_NS_RES;
>>> +
>>> + core->radv.core = core;
>>> + core->rdtr.core = core;
>>> + core->raid.core = core;
>>> + core->tadv.core = core;
>>> + core->tidv.core = core;
>>> +
>>> + core->itr.core = core;
>>
>> Couldn't we simply get core pointer through container_of() ?
>
> Unfortunately not. Timer abstraction functions are not aware of which
> specific timer they’re dealing with.
Yes, it is.
>
>>
>>> + core->itr.delay_reg = ITR;
>>> + core->itr.delay_resolution_ns = E1000_INTR_THROTTLING_NS_RES;
>>> +
>>> + for (i = 0; i < E1000E_MSIX_VEC_NUM; i++) {
>>> + core->eitr[i].core = core;
>>> + core->eitr[i].delay_reg = EITR + i;
>>> + core->eitr[i].delay_resolution_ns =
>>> E1000_INTR_THROTTLING_NS_RES;
>>> + }
>>> +
>>> + if (!create) {
>>> + return;
>>> + }
>>> +
>>> + core->radv.timer =
>>> + timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->radv);
>>> + core->rdtr.timer =
>>> + timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->rdtr);
>>> + core->raid.timer =
>>> + timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->raid);
>>> +
>>> + core->tadv.timer =
>>> + timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->tadv);
>>> + core->tidv.timer =
>>> + timer_new_ns(QEMU_CLOCK_VIRTUAL, e1000e_intrmgr_on_timer,
>>> &core->tidv);
>>> +
>>> + core->itr.timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>>> + e1000e_intrmgr_on_throttling_timer,
>>> + &core->itr);
>>
>> Should we stop/start the above timers during vm stop/start through vm
>> state change handler?
>
> Not sure. Is there any reason for this?
Otherwise we may raise irq during vm stop?
[...]
>>> +
>>> +static inline void
>>> +e1000e_tx_ring_init(E1000ECore *core, E1000E_TxRing *txr, int idx)
>>> +{
>>> + static const E1000E_RingInfo i[E1000E_NUM_QUEUES] = {
>>> + { TDBAH, TDBAL, TDLEN, TDH, TDT, 0 },
>>> + { TDBAH1, TDBAL1, TDLEN1, TDH1, TDT1, 1 }
>>> + };
>>
>> Instead of using static inside a function, why not just use a global
>> array and then there's no need for this function and caller can access
>> it directly?
>
> Anyway there should be a function to combine the two assignments below.
> And since this array is used by this function only it should better be
> hidden.
I mean you can avoid the assignment before each transmission by just
introducing arrays like:
int tdt[] = {TDT, TDT1};
And then access them through qidx, e.g: core->mac[tdt[qidx]].
>
>>
>>> +
>>> + assert(idx < ARRAY_SIZE(i));
>>> +
>>> + txr->i = &i[idx];
>>> + txr->tx = &core->tx[idx];
>>> +}
>>> +
>>> +typedef struct E1000E_RxRing_st {
>>> + const E1000E_RingInfo *i;
>>> +} E1000E_RxRing;
>>> +
>>> +static inline void
>>> +e1000e_rx_ring_init(E1000ECore *core, E1000E_RxRing *rxr, int idx)
>>> +{
>>> + static const E1000E_RingInfo i[E1000E_NUM_QUEUES] = {
>>> + { RDBAH0, RDBAL0, RDLEN0, RDH0, RDT0, 0 },
>>> + { RDBAH1, RDBAL1, RDLEN1, RDH1, RDT1, 1 }
>>> + };
>>
>> Similar issue with above.
>>
>>> +
>>> + assert(idx < ARRAY_SIZE(i));
>>> +
>>> + rxr->i = &i[idx];
>>> +}
>>> +
[...]
>>> + trace_e1000e_rx_start_recv();
>>> +
>>> + for (i = 0; i <= core->max_queue_num; i++) {
>>> +
>>> qemu_flush_queued_packets(qemu_get_subqueue(core->owner_nic, i));
>>
>> Is this the hardware behavior, I mean setting rdt of queue 0 will also
>> flush queue 1? Looks like the correct behavior is to calculate the queue
>> index and flush it.
>
> The issue is that there is no direct correspondence between virtio
> queues and device queues.
> There is RSS mechanism in the middle that may re-route packets from
> virtio queue 0 to device queue 1 and vice versa,
> so we cannot know where each specific packets goes until it read and
> processed.
Aha, I see.
>
>>
>>> + }
>>> +}
>>> +
>>> +int
>>> +e1000e_can_receive(E1000ECore *core)
>>> +{
>>> + int i;
>>> +
>>> + bool link_up = core->mac[STATUS] & E1000_STATUS_LU;
>>> + bool rx_enabled = core->mac[RCTL] & E1000_RCTL_EN;
>>> + bool pci_master = core->owner->config[PCI_COMMAND] &
>>> PCI_COMMAND_MASTER;
>>
>> Should we flush the queue packets when guest enable bus master? (like
>> what we did in e1000_write_config)?
>
> Yes, fixed.
>
>>
>>> +
>>> + if (!link_up || !rx_enabled || !pci_master) {
>>> + trace_e1000e_rx_can_recv_disabled(link_up, rx_enabled,
>>> pci_master);
>>> + return false;
>>> + }
>>> +
>>> + for (i = 0; i < E1000E_NUM_QUEUES; i++) {
>>> + E1000E_RxRing rxr;
>>> +
>>> + e1000e_rx_ring_init(core, &rxr, i);
>>> + if (e1000e_ring_enabled(core, rxr.i) &&
>>> + e1000e_has_rxbufs(core, rxr.i, 1)) {
>>> + trace_e1000e_rx_can_recv();
>>> + return true;
>>
>> Similar issue, this will return true if anyone of the queue has
>> available buffer. This seems wrong.
>
>
> The same as above. This is done due to RSS mechanisms of the device.
Right, looks ok now, but there's also a case e.g we want to queue the
packet to queue 0, but only queue 1 has the available buffer. May need
to check this in the future.
[...]
>
> Thanks for review,
> Dmitry
>
You're welcome