qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v3 3/9] rocker: add register programming guide


From: Paolo Bonzini
Subject: Re: [Qemu-devel] [PATCH v3 3/9] rocker: add register programming guide
Date: Mon, 12 Jan 2015 12:40:57 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0


On 11/01/2015 04:57, address@hidden wrote:
> From: Scott Feldman <address@hidden>
> 
> This is the register programming guide for the Rocker device.  It's intended
> for driver writers and device writers.  It covers the device's PCI space,
> the register set, DMA interface, and interrupts.
> 
> Signed-off-by: Scott Feldman <address@hidden>
> Signed-off-by: Jiri Pirko <address@hidden>
> ---
>  hw/net/rocker/reg_guide.txt |  961 
> +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 961 insertions(+)
>  create mode 100644 hw/net/rocker/reg_guide.txt

This should be docs/specs/rocker.txt

> diff --git a/hw/net/rocker/reg_guide.txt b/hw/net/rocker/reg_guide.txt
> new file mode 100644
> index 0000000..3146708
> --- /dev/null
> +++ b/hw/net/rocker/reg_guide.txt
> @@ -0,0 +1,961 @@
> +Rocker Network Switch Register Programming Guide
> +Copyright (c) Scott Feldman <address@hidden>
> +Copyright (c) Neil Horman <address@hidden>
> +Version 0.11, 12/29/2014
> +
> +LICENSE
> +=======
> +
> +This program is free software; you can redistribute it and/or modify
> +it under the terms of the GNU General Public License as published by
> +the Free Software Foundation; either version 2 of the License, or
> +(at your option) any later version.
> +
> +This program is distributed in the hope that it will be useful,
> +but WITHOUT ANY WARRANTY; without even the implied warranty of
> +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> +GNU General Public License for more details.
> +
> +SECTION 1: Introduction
> +=======================
> +
> +Overview
> +--------
> +
> +This document describes the hardware/software interface for the Rocker switch
> +device.  The intended audience is authors of OS drivers and device emulation
> +software.
> +
> +Notations and Conventions
> +-------------------------
> +
> +o In register descriptions, [n:m] indicates a range from bit n to bit m,
> +inclusive.
> +o Use of leading 0x indicates a hexadecimal number.
> +o Use of leading 0b indicates a binary number.
> +o The use of RSVD or Reserved indicates that a bit or field is reserved for
> +future use.
> +o Field width is in bytes, unless otherwise noted.
> +o Register are (R) read-only, (R/W) read/write, (W) write-only, or (COR) 
> clear
> +on read
> +o TLV values in network-byte-order are designated with (N).
> +
> +
> +SECTION 2: PCI Configuration Registers
> +======================================
> +
> +PCI Configuration Space
> +-----------------------
> +
> +Each switch instance registers as a PCI device with PCI configuration space:
> +
> +     offset  width   description             value
> +     ---------------------------------------------
> +     0x0     2       Vendor ID               0x1b36
> +     0x2     2       Device ID               0x0006
> +     0x4     4       Command/Status
> +     0x8     1       Revision ID             0x01
> +     0x9     3       Class code              0x2800
> +     0xC     1       Cache line size
> +     0xD     1       Latency timer
> +     0xE     1       Header type
> +     0xF     1       Built-in self test
> +     0x10    4       Base address low
> +     0x14    4       Base address high
> +     0x18-28         Reserved
> +     0x2C    2       Subsystem vendor ID     0x0000
> +     0x2E    2       Subsystem ID            0x0000

This should not be guaranteed to 0, should it?

> +     0x30-38         Reserved
> +     0x3C    1       Interrupt line
> +     0x3D    1       Interrupt pin           0x00
> +     0x3E    1       Min grant               0x00
> +     0x3D    1       Max latency             0x00
> +     0x40    1       TRDY timeout
> +     0x41    1       Retry count
> +     0x42    2       Reserved
> +
> +
> +SECTION 3: Memory-Mapped Register Space
> +=======================================
> +
> +There are two memory-mapped BARs.  BAR0 maps device register space and is
> +0x2000 in size.  BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in
> +size, allowing for 256 MSI-X vectors.  The host BIOS will assign the base
> +address location.  The host driver/OS will map the base address to host 
> memory,
> +giving the driver mmio access to the device register space.

No need for the bits after "The host BIOS..." since that's just normal PCI.

> +All registers are 4 or 8 bytes long.  It is assumed host software will 
> access 4
> +byte registers with one 4-byte access, and 8 byte registers with either two
> +4-byte accesses or a single 8-byte access.  In the case of two 4-byte 
> accesses,
> +access must be lower and then upper 4-bytes, in that order.

Double 4-byte accesses are not implemented, are they?

> +Interrupt credits
> +^^^^^^^^^^^^^^^^^
> +
> +MSI-X vectors used for descriptor ring completions use a credit mechanism for
> +efficient device, PCIe bus, OS and driver operations.  Each descriptor ring 
> has
> +a credit count which represent the number of outstanding descriptors to be
> +processed by the driver.  As the device marks descriptors complete, the 
> credit
> +count is incremented.  As the driver processes those outstanding descriptors,
> +it returns credits back to the device.  This way, the device knows the 
> driver's
> +progress and can make decisions about when to fire the next interrupt or not.
> +When the credit count is zero, and the first descriptors are posted for the
> +driver, a single interrupt is fired.  Once the interrupt is fired, the
> +interrupt is disabled (auto-masked).  In response to the interrupt, the 
> driver
> +will process descriptors and PIO write a returned credit value for that
> +descriptor ring.  If the driver returns all credits (the driver caught up 
> with
> +the device and there is no outstanding work), then the interrupt is unmasked,
> +but not fired.  If only partial credits are returned, the interrupt remains
> +masked but the device generates an interrupt, signaling the driver that more
> +outstanding work is available.

Perhaps mention that this masking is unrelated to the MSI-X interrupt
mask register?

> +SECTION 5: Test Registers
> +=========================
> +
> +Rocker switch has several test registers to support troubleshooting register

s/Rocker switch/Rocker/

> +access, interrupt generation, and DMA operations:
> +
> +     TEST_REG, offset 0x0010, 32-bit (R/W)
> +     TEST_REG64, offset 0x0018, 64-bit (R/W)
> +     TEST_IRQ, offset 0x0020, 32-bit (R/W)
> +     TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W)
> +     TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W)
> +     TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W)
> +
> +Reads to TEST_REG and TEST_REG64 will read a value 2x the last value written 
> to

s/2x/equal to twice/

> +the register.  The 32-bit and 64-bit versions are for testing 32-bit and 
> 64-bit
> +host accesses.

Right now, as mentioned above, 64-bit registers must be accessed with a
single 32-bit host access.

In the case of 32-bit host accesses, should TEST_REG64's value be
latched until the upper half is written?  If so, please mention it and
describe that this behavior is shared with the other 64-bit Rocker
registers.

> +Bits written to TEST_IRQ will cause same (unmasked) bits to be written to
> +IRQ_STAT and an interrupt generated.  Use IRQ_MASK to mask and unmask
> +particular bits.

It looks like actually TEST_IRQ will generate a single interrupt, not
many of them.  So writing 1 sets bits 1 in the PBA, not bit 0.  Writing
3 sets bits 3, not bits 0 and 1.

Please do not use "IRQ_STAT", call it the PBA instead.  Also remove the
reference to IRQ_MASK, it's uninteresting.

> +SECTION 7: Switch Control
> +=========================
> +
> +This section covers switch-wide register settings.
> +
> +Control
> +-------
> +
> +This register is used for low level control of the switch.
> +
> +     CONTROL: offset 0x0300, 32-bit, (W)
> +
> +     bit     name            description
> +     ------------------------------------------------------------------------
> +     [0]     CONTROL_RESET   If set, device will perform reset (same
> +                             as pci reset)

It's not the same as PCI reset, as it will not reset BARs for example.

> +
> +SECTION 8: CPU Packet Processing
> +================================
> +
> +For packets ingressing on switch ports that are not forwarded by the switch 
> but
> +rather directed to the host CPU for further processing are delivered in the
> +DMA RX ring.  Likewise, for host CPU originating packets destined to egress 
> on
> +switch ports onto the network are scheduled by software using the DMA TX 
> ring.

Ingress packets for ports that are not forwarded by the switch are
directed to the host CPU for further processing, and delivered in the
DMA RX ring.  Likewise, the host CPU can use the DMA TX ring to schedule
packets that will egress onto the network.

> +
> +Tx Packet Processing
> +--------------------
> +
> +Software schedules packets for egress on switch ports using the DMA TX ring. 
>  A
> +TX descriptor buffer describes the packet location and size in host DMA-able
> +memory, the destination port, and any hardware-offload functions (such as L3
> +payload checksum offload).  Software then bumps the descriptor head to signal
> +hardware of new Tx work.  In response, hardware will DMA read Tx descriptors 
> up
> +to head, DMA read descriptor buffer and packet data, perform offloading
> +functions, and finally frame packet on wire (network).  Once packet 
> processing
> +is complete, hardware will writeback status to descriptor(s) to signal to
> +software that Tx is complete and software resources (e.g. skb) backing packet
> +can be released.
> +
> +Figure 2 shows an example 3-fragment packet queued with one Tx descriptor.  A
> +TLV is used for each packet fragment.
> +
> +                                                pkt frag 1
> +                                                +–––––––+  +–+
> +                                            +–––+       |    |
> +                              desc buf      |   |       |    |
> +                             +––––––––+     |   |       |    |
> +             Tx ring     +–––+        +–––––+   |       |    |
> +           +–––––––––+   |   |  TLVs  |         +–––––––+    |
> +           |         +–––+   +––––––––+         pkt frag 2   |
> +           | desc 0  |       |        +–––––+   +–––––––+    |
> +           +–––––––––+       |  TLVs  |     +–––+       |    |
> +     head+–+         |       +––––––––+         |       |    |
> +           | desc 1  |       |        +–––––+   +–––––––+    |pkt
> +           +–––––––––+       |  TLVs  |     |                |
> +           |         |       +––––––––+     |   pkt frag 3   |
> +           |         |                      |   +–––––––+    |
> +           +–––––––––+                      +–––+       |    |
> +           |         |                          |       |    |
> +           |         |                          |       |    |
> +           +–––––––––+                          |       |    |
> +           |         |                          |       |    |
> +           |         |                          |       |    |
> +           +–––––––––+                          |       |    |
> +           |         |                          +–––––––+  +–+
> +           |         |
> +           +–––––––––+
> +
> +                             fig 2.
> +
> +The TLVs for Tx descriptor buffer are:
> +
> +     field                   width   description
> +     ---------------------------------------------------------------------
> +     PPORT                   4       Destination physical port #
> +     TX_OFFLOAD              1       Hardware offload modes:
> +                                       0: no offload
> +                                       1: insert IP csum (ipv4 only)
> +                                       2: insert TCP/UDP csum
> +                                       3: L3 csum calc and insert
> +                                          into csum offset (TX_L3_CSUM_OFF)
> +                                         16-bit 1's complement csum value.
> +                                          IPv4 pseudo-header and IP
> +                                          already calculated by OS
> +                                        and inserted.
> +                                       4: TSO (TCP Segmentation Offload)
> +     TX_L3_CSUM_OFF          2       For L3 csum offload mode, the offset,
> +                                     from the beginning of the packet,
> +                                     of the csum field in the L3 header
> +     TX_TSO_MSS              2       For TSO offload mode, the
> +                                     Maximum Segment Size in bytes
> +        TX_TSO_HDR_LEN               2       For TSO offload mode, the
> +                                     length of ethernet, IP, and
> +                                     TCP/UDP headers, including IP
> +                                     and TCP options.
> +     TX_FRAGS                <array> Packet fragments
> +       TX_FRAG               <nest>  Packet fragment
> +         TX_FRAG_ADDR        8       DMA address of packet fragment
> +         TX_FRAG_LEN         2       Packet fragment length
> +
> +Possible status return codes in descriptor on completion are:
> +
> +     DESC_COMP_ERR   reason
> +     --------------------------------------------------------------------
> +     0               OK
> +     ENXIO           address or data read err on desc buf or packet
> +                     fragment

This is more like EFAULT actually.

> +     EINVAL          bad pport or TSO or csum offloading error
> +     ENOMEM          no memory for internal staging tx fragment

QEMU is portable and these values are not, unfortunately.  So please
hardcode them to be 6/22/12 respectively.

Or even better, to avoid the temptation, make them 1/2/3 and create new
constants ROCKER_OK, ROCKER_ERR_FAULT, ROCKER_ERR_INVAL, ROCKER_ERR_NOMEM.

In any case, since you are at it, sort them in either numeric order or
alphabetic order (apart from OK which can remain first).

> +Rx Packet Processing
> +--------------------
> +
> +For packets ingressing on switch ports that are not forwarded by the switch 
> but
> +rather directed to the host CPU for further processing are delivered in the
> +DMA RX ring.  Rx descriptor buffers are allocated by software and placed on 
> the
> +ring.  Hardware will fill Rx descriptor buffers with packet data, write the
> +completion, and signal to software that a new packet is ready.  Since Rx 
> packet
> +size is not known a-priori, the Rx descriptor buffer must be allocated for
> +worst-case packet size.  A single Rx descriptor will contain the entire Rx
> +packet data in one RX_PACKET TLV.  Other Rx TLVs describe and hardware 
> offloads
> +performed on the packet, such as checksum validation.
> +
> +The TLVs for Rx descriptor buffer are:
> +
> +     field           width   description
> +     ---------------------------------------------------
> +     PPORT           4       Source physical port #
> +     RX_FLAGS        2       Packet parsing flags:
> +                               (1 << 0): IPv4 packet
> +                               (1 << 1): IPv6 packet
> +                               (1 << 2): csum calculated
> +                               (1 << 3): IPv4 csum good
> +                               (1 << 4): IP fragment
> +                               (1 << 5): TCP packet
> +                               (1 << 6): UDP packet
> +                               (1 << 7): TCP/UDP csum good
> +     RX_CSUM         2       IP calculated checksum:
> +                               IPv4: IP payload csum
> +                               IPv6: header and payload csum
> +                             (Only valid is RX_FLAGS:csum calc is set)
> +     RX_PACKET (N)   <var>   Packet data
> +
> +Possible status return codes in descriptor on completion are:
> +
> +     DESC_COMP_ERR   reason
> +     --------------------------------------------------------------------
> +     0               OK
> +     ENXIO           address or data read err on desc buf
> +     ENOMEM          no memory for internal staging desc buf
> +     EMSGSIZE        Rx descriptor buffer wasn't big enough to contain
> +                     pactet data TLV and other TLVs.

EMSGSIZE in fact doesn't exist on Windows even.  So make this
ROCKER_ERR_MSGSIZE==4.


> +     field                   width   description
> +     ----------------------------------------------------
> +     OF_DPA_CMD              2       CMD_[ADD|MOD]
> +     OF_DPA_TBL              2       Flow table ID
> +                                       0: ingress port
> +                                       10: vlan
> +                                       20: termination mac
> +                                       30: unicast routing
> +                                       40: multicast routing
> +                                       50: bridging
> +                                       60: ACL policy

Decimal, I guess.  Better mention it, if only for completeness.

> +Possible status return codes in descriptor on completion are:
> +
> +     DESC_COMP_ERR   command                 reason
> +     --------------------------------------------------------------------
> +     0               all                     OK
> +     EFAULT          all                     head or tail index outside
> +                                             of ring
> +     ENXIO           all                     address or data read err on
> +                                             desc buf
> +     ENOSPC          GET_STATS               cmd descriptor buffer wasn't
> +                                             big enough to contain write-back
> +                                             TLVs
> +     EINVAL          ADD|MOD                 invalid parameters passed in
> +     EEXIST          ADD                     entry already exists
> +     ENOSPC          ADD                     no space left in flow table
> +     ENOENT          MOD|DEL|GET_STATS       group ID invalid
> +     EBUSY           DEL                     group reference count non-zero
> +     ENODEV          ADD                     next group ID doesn't exist

Same as above, please add decimal values instead of overloading errno.

Paolo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]