[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tc

From: Wei Xu
Subject: Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
Date: Fri, 18 Mar 2016 12:17:49 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0

On 2016年03月18日 10:03, Jason Wang wrote:

On 03/18/2016 12:45 AM, Wei Xu wrote:
On 2016年03月17日 16:42, Jason Wang wrote:

On 03/15/2016 05:17 PM, address@hidden wrote:
From: Wei Xu <address@hidden>

All the data packets in a tcp connection will be cached to a big buffer
in every receive interval, and will be sent out via a timer, the
'virtio_net_rsc_timeout' controls the interval, the value will
influent the
performance and response of tcp connection extremely, 50000(50us) is a
experience value to gain a performance improvement, since the whql test
sends packets every 100us, so '300000(300us)' can pass the test case,
this is also the default value, it's gonna to be tunable.

The timer will only be triggered if the packets pool is not empty,
and it'll drain off all the cached packets

'NetRscChain' is used to save the segments of different protocols in a
VirtIONet device.

The main handler of TCP includes TCP window update, duplicated ACK
and the real data coalescing if the new segment passed sanity check
and is identified as an 'wanted' one.

An 'wanted' segment means:
1. Segment is within current window and the sequence is the expected
2. ACK of the segment is in the valid window.
3. If the ACK in the segment is a duplicated one, then it must less
than 2,
     this is to notify upper layer TCP starting retransmission due to
the spec.

Sanity check includes:
1. Incorrect version in IP header
2. IP options & IP fragment
3. Not a TCP packets
4. Sanity size check to prevent buffer overflow attack.

There maybe more cases should be considered such as ip
identification other
flags, while it broke the test because windows set it to the same
even it's
not a fragment.

Normally it includes 2 typical ways to handle a TCP control flag,
and 'finalize', 'bypass' means should be sent out directly, and
means the packets should also be bypassed, and this should be done
after searching for the same connection packets in the pool and sending
all of them out, this is to avoid out of data.

All the 'SYN' packets will be bypassed since this always begin a new'
connection, other flags such 'FIN/RST' will trigger a finalization,
this normally happens upon a connection is going to be closed, an
'URG' packet
also finalize current coalescing unit while there maybe protocol
difference to
different OS.
But URG packet should be sent as quickly as possible regardless of
ordering, no?
Yes, you right, URG will terminate the current 'SCU', i'll amend the
commit log.

Statistics can be used to monitor the basic coalescing status, the
'out of order'
and 'out of window' means how many retransmitting packets, thus
describe the
performance intuitively.

Signed-off-by: Wei Xu <address@hidden>
   hw/net/virtio-net.c            | 486
   include/hw/virtio/virtio-net.h |   1 +
   include/hw/virtio/virtio.h     |  72 ++++++
   3 files changed, 558 insertions(+), 1 deletion(-)

+        } else {
+            /* Coalesce window update */
+            o_tcp->th_win = n_tcp->th_win;
+            chain->stat.win_update++;
+            return RSC_COALESCE;
+        }
+    } else {
+        /* pure ack, update ack */
+        o_tcp->th_ack = n_tcp->th_ack;
+        chain->stat.pure_ack++;
+        return RSC_COALESCE;
Looks like there're something I missed. The spec said:

"In other words, any pure ACK that is not a duplicate ACK or a window
update triggers an exception and must not be coalesced. All such pure
ACKs must be indicated as individual segments."

Does it mean we *should not* coalesce windows update and pure ack?
(Since it can wakeup transmission)?
It's also a little bit inexplicit and flexible due to the spec, please
see the flowchart I on the same page.

Comments about the  flowchart:
The first of the following two flowcharts describes the rules for
coalescing segments and updating the TCP headers.
This flowchart refers to mechanisms for distinguishing valid duplicate
ACKs and window updates. The second flowchart describes these mechanisms.
As show in the flowchart, only status 'C' will break current scu and
get finalized, both 'A' and 'B' can be coalesced afaik.

Interesting, looks like you're right.

+    }
+static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
NetRscSeg *seg,
+                                    const uint8_t *buf, NetRscUnit
+    void *data;
+    uint16_t o_ip_len;
+    uint32_t nseq, oseq;
+    NetRscUnit *o_unit;
+    o_unit = &seg->unit;
+    o_ip_len = htons(*o_unit->ip_plen);
+    nseq = htonl(n_unit->tcp->th_seq);
+    oseq = htonl(o_unit->tcp->th_seq);
+    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
+        /* Log this only for debugging observation */
+        chain->stat.tcp_option++;
+    }
+    /* Ignore packet with more/larger tcp options */
+    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {
What if n_unit->tcp_hdrlen > o_uint->tcp_hdr_len ?
do you mean '<'? that also means some option changed, maybe i should
include it.

+        chain->stat.tcp_larger_option++;
+        return RSC_FINAL;
+    }
+    /* out of order or retransmitted. */
+    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
+        chain->stat.data_out_of_win++;
+        return RSC_FINAL;
+    }
+    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
+    if (nseq == oseq) {
+        if ((0 == o_unit->payload) && n_unit->payload) {
+            /* From no payload to payload, normal case, not a dup
ack or etc */
+            chain->stat.data_after_pure_ack++;
+            goto coalesce;
+        } else {
+            return virtio_net_rsc_handle_ack(chain, seg, buf,
+                                             n_unit->tcp,
+        }
+    } else if ((nseq - oseq) != o_unit->payload) {
+        /* Not a consistent packet, out of order */
+        chain->stat.data_out_of_order++;
+        return RSC_FINAL;
+    } else {
+        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
+            chain->stat.over_size++;
+            return RSC_FINAL;
+        }
+        /* Here comes the right data, the payload lengh in v4/v6 is
+           so use the field value to update and record the new data
len */
+        o_unit->payload += n_unit->payload; /* update new data len */
+        /* update field in ip header */
+        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
+        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be
+           for windows guest, while this may change the behavior
for linux
+           guest. */
This needs more thought, 'can' probably means don't. (Linux GRO won't
merge PUSH packet).
Yes, since it's mainly for win guest, how about take this as this
first and do more test and see how to handle it?
Right, this is not an issue probably but just an optimization for latency.


+        return NULL;
+    }
+    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
+        if (chain->proto == proto) {
+            return chain;
+        }
+    }
+    chain = g_malloc(sizeof(*chain));
+    chain->hdr_size = n->guest_hdr_len;
Why introduce a specified field for instead of just use
this is to reduce invoking times to find VirtIONet by 'nc', there are
a few places will use it.
Okay, then I think you'd better keep a pointer to VirtIONet structure
instead. (Consider you may want to refer more data from it, we don't
want to duplicate fields in two places).
sure, that's a good idea.

+    chain->proto = proto;
+    chain->max_payload = MAX_IP4_PAYLOAD;
+    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
+                                      virtio_net_rsc_purge, chain);
+    memset(&chain->stat, 0, sizeof(chain->stat));
+    QTAILQ_INIT(&chain->buffers);
+    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
+    return chain;
+static ssize_t virtio_net_rsc_receive(NetClientState *nc,
+                                      const uint8_t *buf, size_t size)
+    uint16_t proto;
+    NetRscChain *chain;
+    struct eth_header *eth;
+    VirtIONet *n;
+    n = qemu_get_nic_opaque(nc);
+    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
+        return virtio_net_do_receive(nc, buf, size);
+    }
+    eth = (struct eth_header *)(buf + n->guest_hdr_len);
+    proto = htons(eth->h_proto);
+    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
+    if (!chain) {
+        return virtio_net_do_receive(nc, buf, size);
+    } else {
+        chain->stat.received++;
+        return virtio_net_rsc_receive4(chain, nc, buf, size);
+    }
+static ssize_t virtio_net_receive(NetClientState *nc,
+                                  const uint8_t *buf, size_t size)
+    if (virtio_net_rsc_bypass) {
+        return virtio_net_do_receive(nc, buf, size);
You need a feature bit for this and compat it for older machine types.
And also need some work on virtio spec I think.
yes, not sure which way is good to support this, hmp/qmp/ethtool, this
is gonna to support win guest,
so need a well-compatible interface, any comments?
I think this should be implemented through feature bits/negotiation
instead of something like ethtool.
Looks this feature should be turn on/off dynamically due to the spec, so maybe this should be managed from the guest, is there any reference code for this?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]