[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH 07/18] nbd: Minimal structured read for client
From: |
Paolo Bonzini |
Subject: |
Re: [Qemu-devel] [PATCH 07/18] nbd: Minimal structured read for client |
Date: |
Wed, 15 Feb 2017 17:54:58 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0 |
On 07/02/2017 21:14, Eric Blake wrote:
> On 02/03/2017 09:47 AM, Vladimir Sementsov-Ogievskiy wrote:
>> Minimal implementation: always send DF flag, to not deal with fragmented
>> replies.
>
> This works well with your minimal server implementation, but I worry
> that it will cause us to fall over when talking to a fully-compliant
> server that chooses to send EOVERFLOW errors for any request larger than
> 64k when DF is set; it also makes it impossible to benefit from sparse
> reads. I guess that means we need to start thinking about followup
> patches to flush out our implementation. But maybe I can live with this
> patch as is, since the goal of your series was not so much the full
> power of structured reads, but getting to a point where we could use
> structured reply for block status, even if it means your client can only
> communicate with qemu-nbd as server for now, as long as we do get to the
> rest of the patches for a full-blown structured read.
Can you post a diff that expresses this as a comment? I'll squash the
comment into this commit.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <address@hidden>
>> ---
>> block/nbd-client.c | 47 +++++++++++----
>> block/nbd-client.h | 2 +
>> include/block/nbd.h | 15 +++--
>> nbd/client.c | 170
>> ++++++++++++++++++++++++++++++++++++++++++++++------
>> qemu-nbd.c | 2 +-
>> 5 files changed, 203 insertions(+), 33 deletions(-)
>
> Hmm - no change to the testsuite. Structured reads seems like the sort
> of thing that it would be nice to test with some canned server replies,
> particularly with server behavior that is permitted by the NBD protocol
> but does not happen by default in qemu-nbd.
This would require implementing some kind of mock server support. That
would be a very good thing but not something we have much infrastructure
for (you could use either a real socket or a mock QIOChannel).
Thanks,
Paolo
>>
>> diff --git a/block/nbd-client.c b/block/nbd-client.c
>> index 3779c6c999..ff96bd1635 100644
>> --- a/block/nbd-client.c
>> +++ b/block/nbd-client.c
>> @@ -180,13 +180,20 @@ static void nbd_co_receive_reply(NBDClientSession *s,
>> *reply = s->reply;
>> if (reply->handle != request->handle ||
>> !s->ioc) {
>> + reply->simple = true;
>> reply->error = EIO;
>
> I don't think this is quite right - by setting reply->simple to true,
> you are forcing the caller to treat this as the final packet related to
> this request->handle, even though that might not be the case.
>
> As it is, I wonder if this code is correct, even before your patch - the
> server is allowed to give responses out-of-order (if we request multiple
> reads without waiting for the first response) - I don't see how setting
> reply->error to EIO if the request->handle indicates that we are
> receiving an out-of-order response to some other packet, but that our
> request is still awaiting traffic.
>
>> } else {
>> - if (qiov && reply->error == 0) {
>> - ret = nbd_wr_syncv(s->ioc, qiov->iov, qiov->niov, request->len,
>> - true);
>> - if (ret != request->len) {
>> - reply->error = EIO;
>> + if (qiov) {
>> + if ((reply->simple ? reply->error == 0 :
>> + reply->type == NBD_REPLY_TYPE_OFFSET_DATA)) {
>> + ret = nbd_wr_syncv(s->ioc, qiov->iov, qiov->niov,
>> request->len,
>> + true);
>
> This works only because you used the DF flag. If we allow fragmenting,
> then you have to be careful to write the reply into the correct offset
> of the iov.
>
>> + if (ret != request->len) {
>> + reply->error = EIO;
>> + }
>> + } else if (!reply->simple &&
>> + reply->type == NBD_REPLY_TYPE_OFFSET_HOLE) {
>> + qemu_iovec_memset(qiov, 0, 0, request->len);
>> }
>
> Up to here, you didn't do any inspection for NBD_REPLY_FLAG_DONE (so you
> don't know if this is the last packet the server is sending for this
> reqeust->handle), and didn't do any special casing for
> NBD_REPLY_TYPE_NONE or for the various error replies. I'm not sure if
> this will always do what you want. In fact, I'm not even sure if
> reply->error is set correctly for all structured packets.
>
>> }
>>
>> @@ -227,6 +234,7 @@ int nbd_client_co_preadv(BlockDriverState *bs, uint64_t
>> offset,
>> .type = NBD_CMD_READ,
>> .from = offset,
>> .len = bytes,
>> + .flags = client->structured_reply ? NBD_CMD_FLAG_DF : 0,
>> };
>> NBDReply reply;
>> ssize_t ret;
>> @@ -237,12 +245,30 @@ int nbd_client_co_preadv(BlockDriverState *bs,
>> uint64_t offset,
>> nbd_coroutine_start(client, &request);
>> ret = nbd_co_send_request(bs, &request, NULL);
>> if (ret < 0) {
>> - reply.error = -ret;
>> - } else {
>> - nbd_co_receive_reply(client, &request, &reply, qiov);
>> + goto out;
>> }
>> +
>> + nbd_co_receive_reply(client, &request, &reply, qiov);
>> + if (reply.error != 0) {
>> + ret = -reply.error;
>> + }
>> +
>> + if (!reply.simple) {
>> + while (!(reply.flags & NBD_REPLY_FLAG_DONE)) {
>> + nbd_co_receive_reply(client, &request, &reply, qiov);
>> + if (reply.error != 0) {
>> + ret = -reply.error;
>> + }
>> + if (reply.simple) {
>
> Hmm. It looks like this part of the loop is only triggered if
> nbd_co_receive_reply() detects a handle mismatch and slams reply.simple
> to true. As long as we use the DF flag, it looks like the server should
> never send an error packet followed by a data packet, and your
> particular server implementation always set the DONE flag on the error
> packet, so it got past your testing. But if we don't rely on the DF
> flag, a server could reasonable send an ERROR_OFFSET packet for half the
> buffer, followed by a data packet for the other half of the buffer,
> which may wipe out reply.error from the error packet.
>
>> + ret = -EIO;
>> + goto out;
>> + }
>> + }
>> + }
>> +
>> +out:
>> nbd_coroutine_end(client, &request);
>> - return -reply.error;
>> + return ret;
>> }
>>
>> int nbd_client_co_pwritev(BlockDriverState *bs, uint64_t offset,
>> @@ -408,7 +434,8 @@ int nbd_client_init(BlockDriverState *bs,
>> &client->nbdflags,
>> tlscreds, hostname,
>> &client->ioc,
>> - &client->size, errp);
>> + &client->size,
>> + &client->structured_reply, errp);
>> if (ret < 0) {
>> logout("Failed to negotiate with the NBD server\n");
>> return ret;
>> diff --git a/block/nbd-client.h b/block/nbd-client.h
>> index f8d6006849..cba1f965bf 100644
>> --- a/block/nbd-client.h
>> +++ b/block/nbd-client.h
>> @@ -32,6 +32,8 @@ typedef struct NBDClientSession {
>> NBDReply reply;
>>
>> bool is_unix;
>> +
>> + bool structured_reply;
>> } NBDClientSession;
>>
>> NBDClientSession *nbd_get_client_session(BlockDriverState *bs);
>> diff --git a/include/block/nbd.h b/include/block/nbd.h
>> index 58b864f145..dae2e4bd03 100644
>> --- a/include/block/nbd.h
>> +++ b/include/block/nbd.h
>
> Can you add the use of an order file to list .h files first in your
> diffs? See
> https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00288.html for
> tips.
>
>> @@ -57,11 +57,16 @@ struct NBDRequest {
>> };
>> typedef struct NBDRequest NBDRequest;
>>
>> -struct NBDReply {
>> +typedef struct NBDReply {
>> + bool simple;
>> uint64_t handle;
>> uint32_t error;
>> -};
>> -typedef struct NBDReply NBDReply;
>> +
>> + uint16_t flags;
>> + uint16_t type;
>> + uint32_t length;
>> + uint64_t offset;
>> +} NBDReply;
>
> I don't know if this is the best way to represent things; I might have
> used a union type, since not all fields are valid in all reply packets.
>
>>
>> struct NBDSimpleReply {
>> /* uint32_t NBD_SIMPLE_REPLY_MAGIC */
>> @@ -169,10 +174,10 @@ ssize_t nbd_wr_syncv(QIOChannel *ioc,
>> int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint16_t
>> *flags,
>> QCryptoTLSCreds *tlscreds, const char *hostname,
>> QIOChannel **outioc,
>> - off_t *size, Error **errp);
>> + off_t *size, bool *structured_reply, Error
>> **errp);
>> int nbd_init(int fd, QIOChannelSocket *sioc, uint16_t flags, off_t size);
>> ssize_t nbd_send_request(QIOChannel *ioc, NBDRequest *request);
>> -ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply *reply);
>> +int nbd_receive_reply(QIOChannel *ioc, NBDReply *reply);
>> int nbd_client(int fd);
>> int nbd_disconnect(int fd);
>>
>> diff --git a/nbd/client.c b/nbd/client.c
>> index 1c274f3012..9225f7e30d 100644
>> --- a/nbd/client.c
>> +++ b/nbd/client.c
>> @@ -472,11 +472,10 @@ static QIOChannel *nbd_receive_starttls(QIOChannel
>> *ioc,
>> return QIO_CHANNEL(tioc);
>> }
>>
>> -
>> int nbd_receive_negotiate(QIOChannel *ioc, const char *name, uint16_t
>> *flags,
>> QCryptoTLSCreds *tlscreds, const char *hostname,
>> QIOChannel **outioc,
>> - off_t *size, Error **errp)
>> + off_t *size, bool *structured_reply, Error **errp)
>> {
>> char buf[256];
>> uint64_t magic, s;
>> @@ -584,6 +583,12 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char
>> *name, uint16_t *flags,
>> if (nbd_receive_query_exports(ioc, name, errp) < 0) {
>> goto fail;
>> }
>> +
>> + if (structured_reply != NULL) {
>> + *structured_reply =
>> + nbd_receive_simple_option(ioc, NBD_OPT_STRUCTURED_REPLY,
>> + false, NULL) == 0;
>
> Okay, you're allowing the server to reject the option, in which case we
> set structured_reply to false. But re-reading patch 4/18,
> nbd_receive_simple_option() can return -1 for multiple reasons, some of
> them where it is still in sync (for sending more options), but others
> where it is out of sync (such as failure to write, in which case the
> connection MUST be dropped rather than trying to carry on). I don't
> think this handles errors correctly, and therefore I'm not even sure
> that the refactoring in 4/18 is correct.
>
> I think you may be better off with nbd_receive_simple_option() in 4/18
> being tri-state: return -1 if the connection is unrecoverable (such as
> after a write or read error, where we must not send or receive any more
> data), 0 if the server replied with an error but the connection is still
> in sync for trying something else, and 1 if the server replies with
> success. Then this code should check if the return is < 0 (kill
> negotiation), == 0 (*structured_reply = false), or == 1
> (*structured_reply = true).
>
>> + }
>> }
>> /* write the export name request */
>> if (nbd_send_option_request(ioc, NBD_OPT_EXPORT_NAME, -1, name,
>> @@ -603,6 +608,14 @@ int nbd_receive_negotiate(QIOChannel *ioc, const char
>> *name, uint16_t *flags,
>> goto fail;
>> }
>> be16_to_cpus(flags);
>> +
>> + if (!!structured_reply && *structured_reply &&
>
> Why do you need !! to coerce to bool, when && also coerces to bool?
>
>> + !(*flags & NBD_CMD_FLAG_DF))
>> + {
>> + error_setg(errp, "Structured reply is negotiated, "
>> + "but DF flag is not.");
>
> No trailing '.' for error_setg() messages.
>
> Also, I'm not quite sure the NBD protocol allows an implementation that
> supports structured read but does not support the DF flag. Maybe that's
> an NBD spec bug that we should get clarified. (Ideally, the server
> always advertises the DF flag if structured replies are negotiated,
> because the common implementation of user-space handshake followed by
> kernel transmission phase works best when the already-existing
> ioctl(NBD_SET_FLAGS) can then be used to tell the kernel to use/expect
> structure replies.)
>
>> + goto fail;
>> + }
>> } else if (magic == NBD_CLIENT_MAGIC) {
>> uint32_t oldflags;
>>
>> @@ -790,20 +803,33 @@ ssize_t nbd_send_request(QIOChannel *ioc, NBDRequest
>> *request)
>> return 0;
>> }
>>
>> -ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply *reply)
>> +static inline int read_sync_check(QIOChannel *ioc, void *buffer, size_t
>> size)
>> {
>> - uint8_t buf[NBD_REPLY_SIZE];
>> - uint32_t magic;
>> ssize_t ret;
>>
>> - ret = read_sync(ioc, buf, sizeof(buf));
>> + ret = read_sync(ioc, buffer, size);
>> if (ret < 0) {
>> return ret;
>> }
>> -
>> - if (ret != sizeof(buf)) {
>> + if (ret != size) {
>> LOG("read failed");
>> - return -EINVAL;
>> + return -EIO;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +/* nbd_receive_simple_reply
>> + * Read simple reply except magic field (which should be already read)
>> + */
>> +static int nbd_receive_simple_reply(QIOChannel *ioc, NBDReply *reply)
>> +{
>> + uint8_t buf[NBD_REPLY_SIZE - 4];
>> + ssize_t ret;
>> +
>> + ret = read_sync_check(ioc, buf, sizeof(buf));
>
> This patch is fairly long; I might have split the creation and initial
> use of the read_sync_check() function into its own patch.
>
>> + if (ret < 0) {
>> + return ret;
>> }
>>
>> /* Reply
>> @@ -812,9 +838,124 @@ ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply
>> *reply)
>> [ 7 .. 15] handle
>> */
>>
>> - magic = ldl_be_p(buf);
>> - reply->error = ldl_be_p(buf + 4);
>> - reply->handle = ldq_be_p(buf + 8);
>> + reply->error = ldl_be_p(buf);
>> + reply->handle = ldq_be_p(buf + 4);
>> +
>> + return 0;
>> +}
>> +
>> +/* nbd_receive_structured_reply_chunk
>> + * Read structured reply chunk except magic field (which should be already
>> read)
>> + * Data for NBD_REPLY_TYPE_OFFSET_DATA is not read too.
>> + * Length field of reply out parameter corresponds to unread part of reply.
>
> Awkward wording; maybe:
>
> Read the header portion of a structured reply chunk (except for magic,
> which should already be read). On success, the length field of reply
> corresponds to number of bytes in the reply that still need to be read.
>
>> + */
>> +static int nbd_receive_structured_reply_chunk(QIOChannel *ioc, NBDReply
>> *reply)
>> +{
>> + NBDStructuredReplyChunk chunk;
>> + ssize_t ret;
>> + uint16_t message_size;
>> +
>> + ret = read_sync_check(ioc, (uint8_t *)&chunk + sizeof(chunk.magic),
>
> This looks like some ugly pointer math. I wonder if you'd be better
> served by having a separate packed struct without the magic field,
> particularly since you have to read the magic in first to decide whether
> to dispatch to simple/structured for the rest of the structure. In
> other words, note how patch 2/18 omits a uint32_t magic in
> NBDSimpleReply; maybe patch 3/18 could do the same with
> NBDStructuredReplyChunk. Then again, that would impact how you code
> things for subcasses like NBDStructuredRead, so I don't know how much
> churn to request here.
>
>> + sizeof(chunk) - sizeof(chunk.magic));
>> + if (ret < 0) {
>> + return ret;
>> + }
>> +
>> + reply->flags = be16_to_cpu(chunk.flags);
>> + reply->type = be16_to_cpu(chunk.type);
>> + reply->handle = be64_to_cpu(chunk.handle);
>> + reply->length = be32_to_cpu(chunk.length);
>> +
>> + switch (reply->type) {
>> + case NBD_REPLY_TYPE_NONE:
>> + break;
>> + case NBD_REPLY_TYPE_OFFSET_DATA:
>> + case NBD_REPLY_TYPE_OFFSET_HOLE:
>> + ret = read_sync_check(ioc, &reply->offset, sizeof(reply->offset));
>> + if (ret < 0) {
>> + return ret;
>> + }
>> + be64_to_cpus(&reply->offset);
>> + reply->length -= sizeof(reply->offset);
>> + break;
>> + case NBD_REPLY_TYPE_ERROR:
>> + case NBD_REPLY_TYPE_ERROR_OFFSET:
>
> Here, I think that you want:
>
> default:
> if (reply->type & 0x8000) {
>
> rather than specific NBD_REPLY_TYPE_ERROR labels, so that you can
> gracefully handle ALL error packets sent by a server even if you haven't
> explicitly coded for them (a good server should probably not be sending
> an error packet you don't recognize, but the protocol also went to good
> efforts to describe a common behavior of all error packets).
>
> /me reads on...
>
>> + ret = read_sync_check(ioc, &reply->error, sizeof(reply->error));
>> + if (ret < 0) {
>> + return ret;
>> + }
>> + be32_to_cpus(&reply->error);
>> +
>> + ret = read_sync_check(ioc, &message_size, sizeof(message_size));
>> + if (ret < 0) {
>> + return ret;
>> + }
>> + be16_to_cpus(&message_size);
>> +
>> + if (message_size > 0) {
>> + /* TODO: provide error message to user */
>> + ret = drop_sync(ioc, message_size);
>> + if (ret < 0) {
>> + return ret;
>> + }
>> + }
>> +
>> + if (reply->type == NBD_REPLY_TYPE_ERROR_OFFSET) {
>> + /* drop 64bit offset */
>> + ret = drop_sync(ioc, 8);
>> + if (ret < 0) {
>> + return ret;
>> + }
>> + }
>> + break;
>> + default:
>> + if (reply->type & (1 << 15)) {
>
> ...oh, you did, mostly.
>
>> + /* unknown error */
>> + ret = drop_sync(ioc, reply->length);
>
> Still, even if it is an unknown error, you should still be able to
> populate reply->error, rather than ignoring it.
>
>> + if (ret < 0) {
>> + return ret;
>> + }
>> +
>> + reply->error = NBD_EINVAL;
>
> Meanwhile, you should also probably ensure that reply->error is non-zero
> even if the server was buggy and sent an error flag without telling you
> an error value (your choice of EINVAL seems reasonable).
>
>> + reply->length = 0;
>> + } else {
>> + /* unknown non-error reply type */
>> + return -EINVAL;
>> + }
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +int nbd_receive_reply(QIOChannel *ioc, NBDReply *reply)
>> +{
>> + uint32_t magic;
>> + int ret;
>> +
>> + ret = read_sync_check(ioc, &magic, sizeof(magic));
>> + if (ret < 0) {
>> + return ret;
>> + }
>> +
>> + be32_to_cpus(&magic);
>> +
>> + switch (magic) {
>> + case NBD_SIMPLE_REPLY_MAGIC:
>> + reply->simple = true;
>> + ret = nbd_receive_simple_reply(ioc, reply);
>> + break;
>> + case NBD_STRUCTURED_REPLY_MAGIC:
>> + reply->simple = false;
>> + ret = nbd_receive_structured_reply_chunk(ioc, reply);
>> + break;
>> + default:
>> + LOG("invalid magic (got 0x%" PRIx32 ")", magic);
>> + return -EINVAL;
>
> This part looks good.
>
>> + }
>> +
>> + if (ret < 0) {
>> + return ret;
>> + }
>>
>> reply->error = nbd_errno_to_system_errno(reply->error);
>>
>> @@ -827,10 +968,5 @@ ssize_t nbd_receive_reply(QIOChannel *ioc, NBDReply
>> *reply)
>> ", handle = %" PRIu64" }",
>> magic, reply->error, reply->handle);
>>
>> - if (magic != NBD_SIMPLE_REPLY_MAGIC) {
>> - LOG("invalid magic (got 0x%" PRIx32 ")", magic);
>> - return -EINVAL;
>> - }
>> return 0;
>> }
>> -
>> diff --git a/qemu-nbd.c b/qemu-nbd.c
>
> Why are you deleting a blank line from the end of the file? Is that
> rebase churn that should have been avoided in an earlier patch? </me
> goes and reads> - oh, it's been like that since commit 798bfe000 created
> the file. Thanks for cleaning it up (and I'm surprised checkpatch
> doesn't flag it).
>
>> index c734f627b4..de0099e333 100644
>> --- a/qemu-nbd.c
>> +++ b/qemu-nbd.c
>> @@ -272,7 +272,7 @@ static void *nbd_client_thread(void *arg)
>>
>> ret = nbd_receive_negotiate(QIO_CHANNEL(sioc), NULL, &nbdflags,
>> NULL, NULL, NULL,
>> - &size, &local_error);
>> + &size, NULL, &local_error);
>
> We could reasonably allow new-style structured reads when we're handling
> the entire client side; but I agree that when qemu-nbd is used as the
> handshaking phase before handing things over to the kernel that we can't
> request structured reads, at least not until a) the kernel nbd module
> implements structured read supports, and b) we have a way to tell that
> the kernel is willing to accept structured read negotiation (and that's
> where my earlier comments about the DF flag being a witness of
> structured reads comes into play).
>
> Umm, how does this patch compile? You changed the signature of
> nbd_receive_negotiate() to add a parameter, but did not modify the call
> in block/nbd-client.c to pass the new parameter. (So far, I've been
> reviewing based just on email content; I also plan on applying and
> testing your patches before all is said and done, but I sometimes
> surprise myself with my ability to do a quality read-only review even
> without spending time compiling things).
>
>> if (ret < 0) {
>> if (local_error) {
>> error_report_err(local_error);
>>
>
> We'll definitely need a v2, but I'm grateful that you've tackled this work.
>
signature.asc
Description: OpenPGP digital signature
- [Qemu-devel] [PATCH 04/18] nbd/client: refactor nbd_receive_starttls, (continued)
[Qemu-devel] [PATCH 02/18] nbd-server: refactor simple reply sending, Vladimir Sementsov-Ogievskiy, 2017/02/03
[Qemu-devel] [PATCH 07/18] nbd: Minimal structured read for client, Vladimir Sementsov-Ogievskiy, 2017/02/03
[Qemu-devel] [PATCH 12/18] nbd: BLOCK_STATUS for bitmap export: client part, Vladimir Sementsov-Ogievskiy, 2017/02/03
[Qemu-devel] [PATCH 17/18] nbd: BLOCK_STATUS for standard get_block_status function: server part, Vladimir Sementsov-Ogievskiy, 2017/02/03
[Qemu-devel] [PATCH 16/18] iotests: add test for nbd dirty bitmap export, Vladimir Sementsov-Ogievskiy, 2017/02/03
[Qemu-devel] [PATCH 09/18] block/dirty-bitmap: add bdrv_dirty_bitmap_next(), Vladimir Sementsov-Ogievskiy, 2017/02/03
[Qemu-devel] [PATCH 11/18] nbd: BLOCK_STATUS for bitmap export: server part, Vladimir Sementsov-Ogievskiy, 2017/02/03