qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] An RDMA race?


From: Dr. David Alan Gilbert
Subject: Re: [Qemu-devel] An RDMA race?
Date: Mon, 4 Jan 2016 18:15:56 +0000
User-agent: Mutt/1.5.24 (2015-08-30)

* Michael R. Hines (address@hidden) wrote:
> Adding such a control message would defeat the benefits of RDMA, as there
> shouldn't be any signalling in the actual DMA path, or RDMA latency would
> be too high. If you're sending control messages for individual writes, then
> you need to change up your design. It's OK to design ACKs for groups of
> writes, depending on the requirements.

I started off with sending individual messages, and then once I had it working
I made it group them to send one message every 2048 pages.
The performance isn't very good though, and I've not yet analysed why.

> So, the out-of-order issue you're seeing is only with your new message, not
> the original messages?

Yes I believe they're only on the new messages; however:
  1) I'm sending a lot more control messages, so if there's a race I'm
    a lot more likely to trigger it. (I'm not sure I'm triggering it in the
    case where I group those 2048 together) - so does this mean it would
    occasionally trigger on the unmodified code?

  2) My reading of the existing code is that I think it could happen;
    a) the source is ready to send something and is waiting for a CONTROL_READY,
    b) the destination sends the CONTROL_READY
        (blocking in qemu_rdma_post_send_control call to 
         qemu_rdma_block_for_wrid(rdma, RDMA_WRID_SEND_CONTROL, NULL)
    c) The source sends it's data
    d) That arrives at the destination
    e) finally the WRID_SEND_CONTROL arrives back

   It's having d/e the wrong way round which is the race I think I'm seeing
   and then we lose (d)'s data.

> Can you describe/document it in more detail so I can help advise?

There are 2 cases where the destination needs to know which pages it's received:
  i) In COLO or checkpointing where it's receiving a partial new checkpoint;
    since it's only receiving a partial checkpoint it needs to know what it's
    received. This allows the destination to avoid copying the whole of it's
    received checkpoint and only copy the bits that changed.

 ii) On postcopy once a page is received by the destination the page has to
    be atomically placed;  I've not thought too hard about that yet.

Dave

> 
> - Michael
> 
> On Mon, Dec 14, 2015 at 6:53 PM, Dr. David Alan Gilbert <address@hidden
> > wrote:
> 
> > * Michael R. Hines (address@hidden) wrote:
> > > David,
> > >
> > > Thanks for including my email directly. It helps a lot.
> > >
> > > Below, I'm going to assume that only "dest" is calling
> > > qemu_rdma_exchange_recv()
> > > and only src is calling qemu_rdma_exchange_send(), since you didn't
> > specify
> > > who
> > > is sending and who is receiving.
> > >
> > > If that assumption is wrong, please respond again.
> >
> > That's correct.
> >
> > > Comments inline.....
> > >
> > > On Sat, Dec 12, 2015 at 1:48 AM, Dr. David Alan Gilbert <
> > address@hidden
> > > > wrote:
> > >
> > > > Hi Michael,
> > > >    I think I've got an RDMA race condition, but I'm being a little
> > > > cautious at the moment and wondered if you agree with the following
> > > > diagnosis.
> > > >
> > > > It's showing up in a world of mine that's sending more control messages
> > > > from the destination->source and I'm seeing the following.
> > > >
> > > > We normally expect:
> > > >
> > > >    src                        dest
> > > >      ----------->control ready->
> > > >
> > >
> > > If src is sending, this is not correct. Dest should send the ready
> > message
> > > if it is receiving, not src, which breaks the above assumption. So, I'll
> > > reverse the assumption previously and continue with your observation and
> > > assume that src is receiving instead of dest, which should instead look
> > > like:
> >
> > Gah! Yes, I got the label the wrong way around; it's dest sending control
> > ready.
> >
> > > src  (receiving)                      dest (sending)
> > >      ----------->control ready->
> > >
> > >
> > >
> > > >    Sees SEND_CONTROL signal to ack that it has been sent
> > > >
> > >
> > > I'll assume here that you meant that dest sees the ready message and is
> > > then later sends something.
> > >
> > >
> > > >          <-----control message--
> > > >    Sees RECV_CONTROL message from dest
> > > >
> > > >
> > > Similar assumption for the receiver (src).
> > >
> > >
> > > > but what I'm seeing is:
> > > >    src                        dest
> > > >      ----------->control ready->
> > > >          <-----control message--
> > > >    Sees RECV_CONTROL message from dest
> > > >
> > >
> > > hmmmmm....
> > >
> > >
> > > >    Sees SEND_CONTROL signal to ack that it has been sent
> > > >
> > > >
> > > There's not enough information here....... do you have a multi-threaded
> > > send or receive or something?
> >
> > No, I've been trying to wire RDMA into the COLO fault-tolerant setup;
> > so the change which got me to trigger this bug was that I'd
> > added a new control message 'notify write' which explicitly
> > told the destination it had a page written to; at the RDMA level
> > that was the only change.
> >
> > > Do the work request IDs match up?
> >
> > Yes I think so; I also added a sequence number to the 'ready' messages
> > to check I wasn't losing one.
> > I had a chat to one of our RDMA guys (Doug Ledford) and he said
> > it's perfectly legal for RDMA to take longer to return the signal
> > from the send than for the round trip of the destination responding;
> > the 'signal' doesn't happen until an ack has been received from the
> > destination card anyway, so the ack can get delayed or retried.
> > So I think we do need to fix this; the question then is how do we fix
> > it for all control messages without breaking anything else.   Are there
> > any cases that rely on having received the signal from the send before
> > continuing, or could i just do what I'm doing for all control messages?
> >
> > Dave
> >
> > > - Michael
> > --
> > Dr. David Alan Gilbert / address@hidden / Manchester, UK
> >
> 
> 
> 
> -- 
> /*
>  * Michael R. Hines
>  * https://michael.hinespot.com
>  */
--
Dr. David Alan Gilbert / address@hidden / Manchester, UK



reply via email to

[Prev in Thread] Current Thread [Next in Thread]