qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tc


From: Jason Wang
Subject: Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
Date: Fri, 18 Mar 2016 10:03:04 +0800
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1


On 03/18/2016 12:45 AM, Wei Xu wrote:
> On 2016年03月17日 16:42, Jason Wang wrote:
>
>>
>> On 03/15/2016 05:17 PM, address@hidden wrote:
>>> From: Wei Xu <address@hidden>
>>>
>>> All the data packets in a tcp connection will be cached to a big buffer
>>> in every receive interval, and will be sent out via a timer, the
>>> 'virtio_net_rsc_timeout' controls the interval, the value will
>>> influent the
>>> performance and response of tcp connection extremely, 50000(50us) is a
>>> experience value to gain a performance improvement, since the whql test
>>> sends packets every 100us, so '300000(300us)' can pass the test case,
>>> this is also the default value, it's gonna to be tunable.
>>>
>>> The timer will only be triggered if the packets pool is not empty,
>>> and it'll drain off all the cached packets
>>>
>>> 'NetRscChain' is used to save the segments of different protocols in a
>>> VirtIONet device.
>>>
>>> The main handler of TCP includes TCP window update, duplicated ACK
>>> check
>>> and the real data coalescing if the new segment passed sanity check
>>> and is identified as an 'wanted' one.
>>>
>>> An 'wanted' segment means:
>>> 1. Segment is within current window and the sequence is the expected
>>> one.
>>> 2. ACK of the segment is in the valid window.
>>> 3. If the ACK in the segment is a duplicated one, then it must less
>>> than 2,
>>>     this is to notify upper layer TCP starting retransmission due to
>>> the spec.
>>>
>>> Sanity check includes:
>>> 1. Incorrect version in IP header
>>> 2. IP options & IP fragment
>>> 3. Not a TCP packets
>>> 4. Sanity size check to prevent buffer overflow attack.
>>>
>>> There maybe more cases should be considered such as ip
>>> identification other
>>> flags, while it broke the test because windows set it to the same
>>> even it's
>>> not a fragment.
>>>
>>> Normally it includes 2 typical ways to handle a TCP control flag,
>>> 'bypass'
>>> and 'finalize', 'bypass' means should be sent out directly, and
>>> 'finalize'
>>> means the packets should also be bypassed, and this should be done
>>> after searching for the same connection packets in the pool and sending
>>> all of them out, this is to avoid out of data.
>>>
>>> All the 'SYN' packets will be bypassed since this always begin a new'
>>> connection, other flags such 'FIN/RST' will trigger a finalization,
>>> because
>>> this normally happens upon a connection is going to be closed, an
>>> 'URG' packet
>>> also finalize current coalescing unit while there maybe protocol
>>> difference to
>>> different OS.
>> But URG packet should be sent as quickly as possible regardless of
>> ordering, no?
>
> Yes, you right, URG will terminate the current 'SCU', i'll amend the
> commit log.
>
>>
>>> Statistics can be used to monitor the basic coalescing status, the
>>> 'out of order'
>>> and 'out of window' means how many retransmitting packets, thus
>>> describe the
>>> performance intuitively.
>>>
>>> Signed-off-by: Wei Xu <address@hidden>
>>> ---
>>>   hw/net/virtio-net.c            | 486
>>> ++++++++++++++++++++++++++++++++++++++++-
>>>   include/hw/virtio/virtio-net.h |   1 +
>>>   include/hw/virtio/virtio.h     |  72 ++++++
>>>   3 files changed, 558 insertions(+), 1 deletion(-)

[...]

>>> +        } else {
>>> +            /* Coalesce window update */
>>> +            o_tcp->th_win = n_tcp->th_win;
>>> +            chain->stat.win_update++;
>>> +            return RSC_COALESCE;
>>> +        }
>>> +    } else {
>>> +        /* pure ack, update ack */
>>> +        o_tcp->th_ack = n_tcp->th_ack;
>>> +        chain->stat.pure_ack++;
>>> +        return RSC_COALESCE;
>> Looks like there're something I missed. The spec said:
>>
>> "In other words, any pure ACK that is not a duplicate ACK or a window
>> update triggers an exception and must not be coalesced. All such pure
>> ACKs must be indicated as individual segments."
>>
>> Does it mean we *should not* coalesce windows update and pure ack?
>> (Since it can wakeup transmission)?
>
> It's also a little bit inexplicit and flexible due to the spec, please
> see the flowchart I on the same page.
>
> Comments about the  flowchart:
> ------------------------------------------------------------------------
> The first of the following two flowcharts describes the rules for
> coalescing segments and updating the TCP headers.
> This flowchart refers to mechanisms for distinguishing valid duplicate
> ACKs and window updates. The second flowchart describes these mechanisms.
> ------------------------------------------------------------------------
> As show in the flowchart, only status 'C' will break current scu and
> get finalized, both 'A' and 'B' can be coalesced afaik.
>

Interesting, looks like you're right.

>>
>>> +    }
>>> +}
>>> +
>>> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
>>> NetRscSeg *seg,
>>> +                                    const uint8_t *buf, NetRscUnit
>>> *n_unit)
>>> +{
>>> +    void *data;
>>> +    uint16_t o_ip_len;
>>> +    uint32_t nseq, oseq;
>>> +    NetRscUnit *o_unit;
>>> +
>>> +    o_unit = &seg->unit;
>>> +    o_ip_len = htons(*o_unit->ip_plen);
>>> +    nseq = htonl(n_unit->tcp->th_seq);
>>> +    oseq = htonl(o_unit->tcp->th_seq);
>>> +
>>> +    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
>>> +        /* Log this only for debugging observation */
>>> +        chain->stat.tcp_option++;
>>> +    }
>>> +
>>> +    /* Ignore packet with more/larger tcp options */
>>> +    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {
>> What if n_unit->tcp_hdrlen > o_uint->tcp_hdr_len ?
> do you mean '<'? that also means some option changed, maybe i should
> include it.

Yes.

>>
>>> +        chain->stat.tcp_larger_option++;
>>> +        return RSC_FINAL;
>>> +    }
>>> +
>>> +    /* out of order or retransmitted. */
>>> +    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
>>> +        chain->stat.data_out_of_win++;
>>> +        return RSC_FINAL;
>>> +    }
>>> +
>>> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
>>> +    if (nseq == oseq) {
>>> +        if ((0 == o_unit->payload) && n_unit->payload) {
>>> +            /* From no payload to payload, normal case, not a dup
>>> ack or etc */
>>> +            chain->stat.data_after_pure_ack++;
>>> +            goto coalesce;
>>> +        } else {
>>> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
>>> +                                             n_unit->tcp,
>>> o_unit->tcp);
>>> +        }
>>> +    } else if ((nseq - oseq) != o_unit->payload) {
>>> +        /* Not a consistent packet, out of order */
>>> +        chain->stat.data_out_of_order++;
>>> +        return RSC_FINAL;
>>> +    } else {
>>> +coalesce:
>>> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
>>> +            chain->stat.over_size++;
>>> +            return RSC_FINAL;
>>> +        }
>>> +
>>> +        /* Here comes the right data, the payload lengh in v4/v6 is
>>> different,
>>> +           so use the field value to update and record the new data
>>> len */
>>> +        o_unit->payload += n_unit->payload; /* update new data len */
>>> +
>>> +        /* update field in ip header */
>>> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
>>> +
>>> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be
>>> coalesced
>>> +           for windows guest, while this may change the behavior
>>> for linux
>>> +           guest. */
>> This needs more thought, 'can' probably means don't. (Linux GRO won't
>> merge PUSH packet).
> Yes, since it's mainly for win guest, how about take this as this
> first and do more test and see how to handle it?

Right, this is not an issue probably but just an optimization for latency.

[...]

>>
>>> +        return NULL;
>>> +    }
>>> +
>>> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
>>> +        if (chain->proto == proto) {
>>> +            return chain;
>>> +        }
>>> +    }
>>> +
>>> +    chain = g_malloc(sizeof(*chain));
>>> +    chain->hdr_size = n->guest_hdr_len;
>> Why introduce a specified field for instead of just use
>> n->guest_hdr_len?
> this is to reduce invoking times to find VirtIONet by 'nc', there are
> a few places will use it.

Okay, then I think you'd better keep a pointer to VirtIONet structure
instead. (Consider you may want to refer more data from it, we don't
want to duplicate fields in two places).

>>
>>> +    chain->proto = proto;
>>> +    chain->max_payload = MAX_IP4_PAYLOAD;
>>> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>>> +                                      virtio_net_rsc_purge, chain);
>>> +    memset(&chain->stat, 0, sizeof(chain->stat));
>>> +
>>> +    QTAILQ_INIT(&chain->buffers);
>>> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
>>> +
>>> +    return chain;
>>> +}
>>> +
>>> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
>>> +                                      const uint8_t *buf, size_t size)
>>> +{
>>> +    uint16_t proto;
>>> +    NetRscChain *chain;
>>> +    struct eth_header *eth;
>>> +    VirtIONet *n;
>>> +
>>> +    n = qemu_get_nic_opaque(nc);
>>> +    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
>>> +        return virtio_net_do_receive(nc, buf, size);
>>> +    }
>>> +
>>> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
>>> +    proto = htons(eth->h_proto);
>>> +
>>> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
>>> +    if (!chain) {
>>> +        return virtio_net_do_receive(nc, buf, size);
>>> +    } else {
>>> +        chain->stat.received++;
>>> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
>>> +    }
>>> +}
>>> +
>>> +static ssize_t virtio_net_receive(NetClientState *nc,
>>> +                                  const uint8_t *buf, size_t size)
>>> +{
>>> +    if (virtio_net_rsc_bypass) {
>>> +        return virtio_net_do_receive(nc, buf, size);
>> You need a feature bit for this and compat it for older machine types.
>> And also need some work on virtio spec I think.
> yes, not sure which way is good to support this, hmp/qmp/ethtool, this
> is gonna to support win guest,
> so need a well-compatible interface, any comments?

I think this should be implemented through feature bits/negotiation
instead of something like ethtool.

[...]



reply via email to

[Prev in Thread] Current Thread [Next in Thread]