qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v3 3/9] rocker: add register programming guide


From: Scott Feldman
Subject: Re: [Qemu-devel] [PATCH v3 3/9] rocker: add register programming guide
Date: Fri, 16 Jan 2015 00:14:05 -0800

On Mon, Jan 12, 2015 at 3:40 AM, Paolo Bonzini <address@hidden> wrote:
> On 11/01/2015 04:57, address@hidden wrote:
>> +PCI Configuration Space
>> +-----------------------
>> +
>> +Each switch instance registers as a PCI device with PCI configuration space:
>> +
>> +     offset  width   description             value
>> +     ---------------------------------------------
>> +     0x0     2       Vendor ID               0x1b36
>> +     0x2     2       Device ID               0x0006
>> +     0x4     4       Command/Status
>> +     0x8     1       Revision ID             0x01
>> +     0x9     3       Class code              0x2800
>> +     0xC     1       Cache line size
>> +     0xD     1       Latency timer
>> +     0xE     1       Header type
>> +     0xF     1       Built-in self test
>> +     0x10    4       Base address low
>> +     0x14    4       Base address high
>> +     0x18-28         Reserved
>> +     0x2C    2       Subsystem vendor ID     0x0000
>> +     0x2E    2       Subsystem ID            0x0000
>
> This should not be guaranteed to 0, should it?

Your're right.  Added a note that subsystem implementation will fill this in.

>
>> +     0x30-38         Reserved
>> +     0x3C    1       Interrupt line
>> +     0x3D    1       Interrupt pin           0x00
>> +     0x3E    1       Min grant               0x00
>> +     0x3D    1       Max latency             0x00
>> +     0x40    1       TRDY timeout
>> +     0x41    1       Retry count
>> +     0x42    2       Reserved
>> +
>> +
>> +SECTION 3: Memory-Mapped Register Space
>> +=======================================
>> +
>> +There are two memory-mapped BARs.  BAR0 maps device register space and is
>> +0x2000 in size.  BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in
>> +size, allowing for 256 MSI-X vectors.  The host BIOS will assign the base
>> +address location.  The host driver/OS will map the base address to host 
>> memory,
>> +giving the driver mmio access to the device register space.
>
> No need for the bits after "The host BIOS..." since that's just normal PCI.

Gone.

>> +All registers are 4 or 8 bytes long.  It is assumed host software will 
>> access 4
>> +byte registers with one 4-byte access, and 8 byte registers with either two
>> +4-byte accesses or a single 8-byte access.  In the case of two 4-byte 
>> accesses,
>> +access must be lower and then upper 4-bytes, in that order.
>
> Double 4-byte accesses are not implemented, are they?

They are now :)  Tested on i386.  I'll include changes with v4.

>> +Interrupt credits
>> +^^^^^^^^^^^^^^^^^
>> +
>> +MSI-X vectors used for descriptor ring completions use a credit mechanism 
>> for
>> +efficient device, PCIe bus, OS and driver operations.  Each descriptor ring 
>> has
>> +a credit count which represent the number of outstanding descriptors to be
>> +processed by the driver.  As the device marks descriptors complete, the 
>> credit
>> +count is incremented.  As the driver processes those outstanding 
>> descriptors,
>> +it returns credits back to the device.  This way, the device knows the 
>> driver's
>> +progress and can make decisions about when to fire the next interrupt or 
>> not.
>> +When the credit count is zero, and the first descriptors are posted for the
>> +driver, a single interrupt is fired.  Once the interrupt is fired, the
>> +interrupt is disabled (auto-masked).  In response to the interrupt, the 
>> driver
>> +will process descriptors and PIO write a returned credit value for that
>> +descriptor ring.  If the driver returns all credits (the driver caught up 
>> with
>> +the device and there is no outstanding work), then the interrupt is 
>> unmasked,
>> +but not fired.  If only partial credits are returned, the interrupt remains
>> +masked but the device generates an interrupt, signaling the driver that more
>> +outstanding work is available.
>
> Perhaps mention that this masking is unrelated to the MSI-X interrupt
> mask register?

Done.

>> +SECTION 5: Test Registers
>> +=========================
>> +
>> +Rocker switch has several test registers to support troubleshooting register
>
> s/Rocker switch/Rocker/

Done.

>> +access, interrupt generation, and DMA operations:
>> +
>> +     TEST_REG, offset 0x0010, 32-bit (R/W)
>> +     TEST_REG64, offset 0x0018, 64-bit (R/W)
>> +     TEST_IRQ, offset 0x0020, 32-bit (R/W)
>> +     TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W)
>> +     TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W)
>> +     TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W)
>> +
>> +Reads to TEST_REG and TEST_REG64 will read a value 2x the last value 
>> written to
>
> s/2x/equal to twice/

Done.

>> +the register.  The 32-bit and 64-bit versions are for testing 32-bit and 
>> 64-bit
>> +host accesses.
>
> Right now, as mentioned above, 64-bit registers must be accessed with a
> single 32-bit host access.

Fixed in implementation.

> In the case of 32-bit host accesses, should TEST_REG64's value be
> latched until the upper half is written?  If so, please mention it and
> describe that this behavior is shared with the other 64-bit Rocker
> registers.
>
>> +Bits written to TEST_IRQ will cause same (unmasked) bits to be written to
>> +IRQ_STAT and an interrupt generated.  Use IRQ_MASK to mask and unmask
>> +particular bits.
>
> It looks like actually TEST_IRQ will generate a single interrupt, not
> many of them.  So writing 1 sets bits 1 in the PBA, not bit 0.  Writing
> 3 sets bits 3, not bits 0 and 1.

Good catch...updated doc.

> Please do not use "IRQ_STAT", call it the PBA instead.  Also remove the
> reference to IRQ_MASK, it's uninteresting.
>
>> +SECTION 7: Switch Control
>> +=========================
>> +
>> +This section covers switch-wide register settings.
>> +
>> +Control
>> +-------
>> +
>> +This register is used for low level control of the switch.
>> +
>> +     CONTROL: offset 0x0300, 32-bit, (W)
>> +
>> +     bit     name            description
>> +     
>> ------------------------------------------------------------------------
>> +     [0]     CONTROL_RESET   If set, device will perform reset (same
>> +                             as pci reset)
>
> It's not the same as PCI reset, as it will not reset BARs for example.

Fixed.

>> +
>> +SECTION 8: CPU Packet Processing
>> +================================
>> +
>> +For packets ingressing on switch ports that are not forwarded by the switch 
>> but
>> +rather directed to the host CPU for further processing are delivered in the
>> +DMA RX ring.  Likewise, for host CPU originating packets destined to egress 
>> on
>> +switch ports onto the network are scheduled by software using the DMA TX 
>> ring.
>
> Ingress packets for ports that are not forwarded by the switch are
> directed to the host CPU for further processing, and delivered in the
> DMA RX ring.  Likewise, the host CPU can use the DMA TX ring to schedule
> packets that will egress onto the network.

Fixed by simplifying.

>> +
>> +Tx Packet Processing
>> +--------------------
>> +
>> +Software schedules packets for egress on switch ports using the DMA TX 
>> ring.  A
>> +TX descriptor buffer describes the packet location and size in host DMA-able
>> +memory, the destination port, and any hardware-offload functions (such as L3
>> +payload checksum offload).  Software then bumps the descriptor head to 
>> signal
>> +hardware of new Tx work.  In response, hardware will DMA read Tx 
>> descriptors up
>> +to head, DMA read descriptor buffer and packet data, perform offloading
>> +functions, and finally frame packet on wire (network).  Once packet 
>> processing
>> +is complete, hardware will writeback status to descriptor(s) to signal to
>> +software that Tx is complete and software resources (e.g. skb) backing 
>> packet
>> +can be released.
>> +
>> +Figure 2 shows an example 3-fragment packet queued with one Tx descriptor.  
>> A
>> +TLV is used for each packet fragment.
>> +
>> +                                                pkt frag 1
>> +                                                +–––––––+  +–+
>> +                                            +–––+       |    |
>> +                              desc buf      |   |       |    |
>> +                             +––––––––+     |   |       |    |
>> +             Tx ring     +–––+        +–––––+   |       |    |
>> +           +–––––––––+   |   |  TLVs  |         +–––––––+    |
>> +           |         +–––+   +––––––––+         pkt frag 2   |
>> +           | desc 0  |       |        +–––––+   +–––––––+    |
>> +           +–––––––––+       |  TLVs  |     +–––+       |    |
>> +     head+–+         |       +––––––––+         |       |    |
>> +           | desc 1  |       |        +–––––+   +–––––––+    |pkt
>> +           +–––––––––+       |  TLVs  |     |                |
>> +           |         |       +––––––––+     |   pkt frag 3   |
>> +           |         |                      |   +–––––––+    |
>> +           +–––––––––+                      +–––+       |    |
>> +           |         |                          |       |    |
>> +           |         |                          |       |    |
>> +           +–––––––––+                          |       |    |
>> +           |         |                          |       |    |
>> +           |         |                          |       |    |
>> +           +–––––––––+                          |       |    |
>> +           |         |                          +–––––––+  +–+
>> +           |         |
>> +           +–––––––––+
>> +
>> +                             fig 2.
>> +
>> +The TLVs for Tx descriptor buffer are:
>> +
>> +     field                   width   description
>> +     ---------------------------------------------------------------------
>> +     PPORT                   4       Destination physical port #
>> +     TX_OFFLOAD              1       Hardware offload modes:
>> +                                       0: no offload
>> +                                       1: insert IP csum (ipv4 only)
>> +                                       2: insert TCP/UDP csum
>> +                                       3: L3 csum calc and insert
>> +                                          into csum offset (TX_L3_CSUM_OFF)
>> +                                         16-bit 1's complement csum value.
>> +                                          IPv4 pseudo-header and IP
>> +                                          already calculated by OS
>> +                                        and inserted.
>> +                                       4: TSO (TCP Segmentation Offload)
>> +     TX_L3_CSUM_OFF          2       For L3 csum offload mode, the offset,
>> +                                     from the beginning of the packet,
>> +                                     of the csum field in the L3 header
>> +     TX_TSO_MSS              2       For TSO offload mode, the
>> +                                     Maximum Segment Size in bytes
>> +        TX_TSO_HDR_LEN               2       For TSO offload mode, the
>> +                                     length of ethernet, IP, and
>> +                                     TCP/UDP headers, including IP
>> +                                     and TCP options.
>> +     TX_FRAGS                <array> Packet fragments
>> +       TX_FRAG               <nest>  Packet fragment
>> +         TX_FRAG_ADDR        8       DMA address of packet fragment
>> +         TX_FRAG_LEN         2       Packet fragment length
>> +
>> +Possible status return codes in descriptor on completion are:
>> +
>> +     DESC_COMP_ERR   reason
>> +     --------------------------------------------------------------------
>> +     0               OK
>> +     ENXIO           address or data read err on desc buf or packet
>> +                     fragment
>
> This is more like EFAULT actually.
>
>> +     EINVAL          bad pport or TSO or csum offloading error
>> +     ENOMEM          no memory for internal staging tx fragment
>
> QEMU is portable and these values are not, unfortunately.  So please
> hardcode them to be 6/22/12 respectively.
>
> Or even better, to avoid the temptation, make them 1/2/3 and create new
> constants ROCKER_OK, ROCKER_ERR_FAULT, ROCKER_ERR_INVAL, ROCKER_ERR_NOMEM.

Since Linux driver is already out there in 3.18, we're stuck with the
values defined in errno.h for x86_64.  But, no problem, I've
hard-coded those values for ROCKER_EINVAL, ROCKER_ENOMEM, etc.  I'll
switch the Linux driver over to these constants when it's touched
again.

> In any case, since you are at it, sort them in either numeric order or
> alphabetic order (apart from OK which can remain first).
>
>> +Rx Packet Processing
>> +--------------------
>> +
>> +For packets ingressing on switch ports that are not forwarded by the switch 
>> but
>> +rather directed to the host CPU for further processing are delivered in the
>> +DMA RX ring.  Rx descriptor buffers are allocated by software and placed on 
>> the
>> +ring.  Hardware will fill Rx descriptor buffers with packet data, write the
>> +completion, and signal to software that a new packet is ready.  Since Rx 
>> packet
>> +size is not known a-priori, the Rx descriptor buffer must be allocated for
>> +worst-case packet size.  A single Rx descriptor will contain the entire Rx
>> +packet data in one RX_PACKET TLV.  Other Rx TLVs describe and hardware 
>> offloads
>> +performed on the packet, such as checksum validation.
>> +
>> +The TLVs for Rx descriptor buffer are:
>> +
>> +     field           width   description
>> +     ---------------------------------------------------
>> +     PPORT           4       Source physical port #
>> +     RX_FLAGS        2       Packet parsing flags:
>> +                               (1 << 0): IPv4 packet
>> +                               (1 << 1): IPv6 packet
>> +                               (1 << 2): csum calculated
>> +                               (1 << 3): IPv4 csum good
>> +                               (1 << 4): IP fragment
>> +                               (1 << 5): TCP packet
>> +                               (1 << 6): UDP packet
>> +                               (1 << 7): TCP/UDP csum good
>> +     RX_CSUM         2       IP calculated checksum:
>> +                               IPv4: IP payload csum
>> +                               IPv6: header and payload csum
>> +                             (Only valid is RX_FLAGS:csum calc is set)
>> +     RX_PACKET (N)   <var>   Packet data
>> +
>> +Possible status return codes in descriptor on completion are:
>> +
>> +     DESC_COMP_ERR   reason
>> +     --------------------------------------------------------------------
>> +     0               OK
>> +     ENXIO           address or data read err on desc buf
>> +     ENOMEM          no memory for internal staging desc buf
>> +     EMSGSIZE        Rx descriptor buffer wasn't big enough to contain
>> +                     pactet data TLV and other TLVs.
>
> EMSGSIZE in fact doesn't exist on Windows even.  So make this
> ROCKER_ERR_MSGSIZE==4.
>
>
>> +     field                   width   description
>> +     ----------------------------------------------------
>> +     OF_DPA_CMD              2       CMD_[ADD|MOD]
>> +     OF_DPA_TBL              2       Flow table ID
>> +                                       0: ingress port
>> +                                       10: vlan
>> +                                       20: termination mac
>> +                                       30: unicast routing
>> +                                       40: multicast routing
>> +                                       50: bridging
>> +                                       60: ACL policy
>
> Decimal, I guess.  Better mention it, if only for completeness.
>
>> +Possible status return codes in descriptor on completion are:
>> +
>> +     DESC_COMP_ERR   command                 reason
>> +     --------------------------------------------------------------------
>> +     0               all                     OK
>> +     EFAULT          all                     head or tail index outside
>> +                                             of ring
>> +     ENXIO           all                     address or data read err on
>> +                                             desc buf
>> +     ENOSPC          GET_STATS               cmd descriptor buffer wasn't
>> +                                             big enough to contain 
>> write-back
>> +                                             TLVs
>> +     EINVAL          ADD|MOD                 invalid parameters passed in
>> +     EEXIST          ADD                     entry already exists
>> +     ENOSPC          ADD                     no space left in flow table
>> +     ENOENT          MOD|DEL|GET_STATS       group ID invalid
>> +     EBUSY           DEL                     group reference count non-zero
>> +     ENODEV          ADD                     next group ID doesn't exist
>
> Same as above, please add decimal values instead of overloading errno.

Updated doc with new ROCKER_Exxx return codes.

>
> Paolo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]