qemu-block
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set


From: Klaus Jensen
Subject: Re: [PATCH v7 05/11] hw/block/nvme: Support Zoned Namespace Command Set
Date: Mon, 19 Oct 2020 11:50:57 +0200

On Oct 19 11:17, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Define values and structures that are needed to support Zoned
> Namespace Command Set (NVMe TP 4053) in PCI NVMe controller emulator.
> Define trace events where needed in newly introduced code.
> 
> In order to improve scalability, all open, closed and full zones
> are organized in separate linked lists. Consequently, almost all
> zone operations don't require scanning of the entire zone array
> (which potentially can be quite large) - it is only necessary to
> enumerate one or more zone lists.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> Subsequent commits in this series add ZDE support and checks for
> active and open zone limits.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
> Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Signed-off-by: Matias Bjorling <matias.bjorling@wdc.com>
> Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Signed-off-by: Adam Manzanares <adam.manzanares@wdc.com>
> Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  block/nvme.c          |   2 +-
>  hw/block/nvme-ns.c    | 193 +++++++++
>  hw/block/nvme-ns.h    |  54 +++
>  hw/block/nvme.c       | 975 ++++++++++++++++++++++++++++++++++++++++--
>  hw/block/nvme.h       |   9 +
>  hw/block/trace-events |  21 +
>  include/block/nvme.h  | 113 ++++-
>  7 files changed, 1339 insertions(+), 28 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 05485fdd11..7a513c9a17 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> +static void nvme_fill_read_data(NvmeRequest *req, uint64_t offset,
> +                                uint32_t max_len, uint8_t pattern)
> +{
> +    QEMUSGList *qsg = &req->qsg;
> +    QEMUIOVector *iov = &req->iov;
> +    ScatterGatherEntry *entry;
> +    uint32_t len, ent_len;
> +
> +    if (qsg->nsg > 0) {
> +        entry = qsg->sg;
> +        len = qsg->size;
> +        if (max_len) {
> +            len = MIN(len, max_len);
> +        }
> +        for (; len > 0; len -= ent_len) {
> +            ent_len = MIN(len, entry->len);
> +            if (offset > ent_len) {
> +                offset -= ent_len;
> +            } else if (offset != 0) {
> +                dma_memory_set(qsg->as, entry->base + offset,
> +                               pattern, ent_len - offset);
> +                offset = 0;
> +            } else {
> +                dma_memory_set(qsg->as, entry->base, pattern, ent_len);
> +            }
> +            entry++;

dma_memory_set can fail. In that case, I think just fail the command
with NVME_DATA_TRAS_ERROR.

But I think this should be removed in any case, see below.

> +static uint16_t nvme_check_zone_read(NvmeNamespace *ns, NvmeZone *zone,
> +                                     uint64_t slba, uint32_t nlb,
> +                                     NvmeReadFillCtx *rfc)
> +{
> +    NvmeZone *next_zone;
> +    uint64_t bndry = nvme_zone_rd_boundary(ns, zone);
> +    uint64_t end = slba + nlb, wp1, wp2;
> +    uint16_t status;
> +
> +    rfc->pre_rd_fill_slba = ~0ULL;
> +    rfc->pre_rd_fill_nlb = 0;
> +    rfc->read_slba = slba;
> +    rfc->read_nlb = nlb;
> +    rfc->post_rd_fill_slba = ~0ULL;
> +    rfc->post_rd_fill_nlb = 0;
> +
> +    status = nvme_zone_state_ok_to_read(zone);
> +    if (status != NVME_SUCCESS) {
> +        ;
> +    } else if (likely(end <= bndry)) {
> +        if (end > zone->w_ptr) {
> +            wp1 = zone->w_ptr;
> +            if (slba >= wp1) {
> +                /* No i/o necessary, just fill */
> +                rfc->pre_rd_fill_slba = slba;
> +                rfc->pre_rd_fill_nlb = nlb;
> +                rfc->read_nlb = 0;
> +            } else {
> +                rfc->read_nlb = wp1 - slba;
> +                rfc->post_rd_fill_slba = wp1;
> +                rfc->post_rd_fill_nlb = nlb - rfc->read_nlb;
> +           }
> +        }
> +    } else if (!ns->params.cross_zone_read) {
> +        status = NVME_ZONE_BOUNDARY_ERROR;
> +    } else {
> +        /*
> +         * Read across zone boundary, look at the next zone.
> +         * Earlier bounds checks ensure that the current zone
> +         * is not the last one.
> +         */
> +        next_zone = zone + 1;
> +        status = nvme_zone_state_ok_to_read(next_zone);
> +        if (status != NVME_SUCCESS) {
> +            ;
> +        } else if (end > nvme_zone_rd_boundary(ns, next_zone)) {
> +            /*
> +             * As zone size is much larger than a typical maximum
> +             * i/o size in real hardware, allow the i/o range
> +             * to span no more than one pair of zones.
> +             */
> +            status = NVME_ZONE_BOUNDARY_ERROR;

While assumptious, it seems totally fair. But irrelevant if removed as I
propose below.

> +        } else {
> +            wp1 = zone->w_ptr;
> +            wp2 = next_zone->w_ptr;
> +            if (wp2 == bndry) {
> +                if (slba >= wp1) {
> +                    /* Again, no i/o necessary, just fill */
> +                    rfc->pre_rd_fill_slba = slba;
> +                    rfc->pre_rd_fill_nlb = nlb;
> +                    rfc->read_nlb = 0;
> +                } else {
> +                    rfc->read_nlb = wp1 - slba;
> +                    rfc->post_rd_fill_slba = wp1;
> +                    rfc->post_rd_fill_nlb = nlb - rfc->read_nlb;
> +                }
> +            } else if (slba < wp1) {
> +                if (end > wp2) {
> +                    if (wp1 == bndry) {
> +                        rfc->post_rd_fill_slba = wp2;
> +                        rfc->post_rd_fill_nlb = end - wp2;
> +                        rfc->read_nlb = wp2 - slba;
> +                    } else {
> +                        rfc->pre_rd_fill_slba = wp2;
> +                        rfc->pre_rd_fill_nlb = end - wp2;
> +                        rfc->read_nlb = wp2 - slba;
> +                        rfc->post_rd_fill_slba = wp1;
> +                        rfc->post_rd_fill_nlb = bndry - wp1;
> +                    }
> +                } else {
> +                    rfc->post_rd_fill_slba = wp1;
> +                    rfc->post_rd_fill_nlb = bndry - wp1;
> +                }
> +            } else {
> +                if (end > wp2) {
> +                    rfc->pre_rd_fill_slba = slba;
> +                    rfc->pre_rd_fill_nlb = end - slba;
> +                    rfc->read_slba = bndry;
> +                    rfc->read_nlb = wp2 - bndry;
> +                } else {
> +                    rfc->read_slba = bndry;
> +                    rfc->read_nlb = end - bndry;
> +                    rfc->post_rd_fill_slba = slba;
> +                    rfc->post_rd_fill_nlb = bndry - slba;
> +                }
> +            }
> +        }

This seems to use the read boundary (zone size), the gap between ZCAP
and ZSZE should also be filled with the fill pattern. This brings in one
more gap which causes this code to have even more edge cases.

This feels like an ad-hoc solution and this pre/post filling strategy is
going to fall short if (when) we implement support for DSM since that
may introduce deallocated blocks all over the place. Technically, this
is not your problem, (but then it is probably going to be my headache
soon since I'd like to introduce DSM support), so I would prefer that we
find a better solution here. I think my work on DULBE and relying on the
ability of the block layer to guarantee zeroes in most cases is useful
here and a better solution than faking it like this.

I fear that adding fill_pattern support in a way that only works for
zoned namespaces is going to cause us a lot of headaches if we want to
support it (and DSM) for NVM namespaces as well.

I propose that you drop the fill_pattern feature and just rely on the
block layer for this (by possibly integrating with the DULBE support
that I posted). This would allow RAZB to still be trivially supported
(and even across more than one boundary). For reference, my ZNS proposal
supports this (but not RAZB) along with the required discards on zone
resets, though I think it is missing a check on the zone size being a
multiple of the cluster size of the underlying blockdev for zeroes to be
guaranteed for non-zero cluster sizes.

> +    }
> +
> +    return status;
> +}
> +
> +static bool nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
> +                                      bool failed)
> +{
> +    NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd;
> +    NvmeZone *zone;
> +    uint64_t slba, start_wp = req->cqe.result64;
> +    uint32_t nlb;
> +
> +    if (rw->opcode != NVME_CMD_WRITE &&
> +        rw->opcode != NVME_CMD_ZONE_APPEND &&
> +        rw->opcode != NVME_CMD_WRITE_ZEROES) {
> +        return false;
> +    }
> +
> +    slba = le64_to_cpu(rw->slba);
> +    nlb = le16_to_cpu(rw->nlb) + 1;
> +    zone = nvme_get_zone_by_slba(ns, slba);
> +
> +    if (!failed && zone->w_ptr < start_wp + nlb) {
> +        /*
> +         * A preceding queued write to the zone has failed,
> +         * now this write is not at the WP, fail it too.
> +         */
> +        failed = true;
> +    }
> +
> +    if (failed) {
> +        if (zone->w_ptr > start_wp) {
> +            zone->w_ptr = start_wp;
> +            zone->d.wp = start_wp;
> +        }

This doesn't fix the data corruption. The example from my last review
still applies.

> +        req->cqe.result64 = 0;
> +    } else if (zone->w_ptr == nvme_zone_wr_boundary(zone)) {
> +        switch (nvme_get_zone_state(zone)) {
> +        case NVME_ZONE_STATE_IMPLICITLY_OPEN:
> +        case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> +        case NVME_ZONE_STATE_CLOSED:
> +        case NVME_ZONE_STATE_EMPTY:
> +            nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_FULL);
> +            /* fall through */
> +        case NVME_ZONE_STATE_FULL:
> +            break;
> +        default:
> +            assert(false);
> +        }
> +        zone->d.wp = zone->w_ptr;
> +    } else {
> +        zone->d.wp += nlb;
> +    }
> +
> +    return failed;
> +}

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]