Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with c

qemu-block

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with c

From:	Klaus Jensen
Subject:	Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs
Date:	Mon, 7 Jun 2021 08:17:24 +0200

Hi Vladimir,

Thanks for taking the time to look through this!

I'll try to comment on all your observations below.

On Jun  7 08:14, Vladimir Sementsov-Ogievskiy wrote:

04.06.2021 09:52, Klaus Jensen wrote:

From: Klaus Jensen <k.jensen@samsung.com>

This series reimplements flush, dsm, copy, zone reset and format nvm to
allow cancellation. I posted an RFC back in March ("hw/block/nvme:
convert ad-hoc aio tracking to aiocb") and I've applied some feedback
from Stefan and reimplemented the remaining commands.

The basic idea is to define custom AIOCBs for these commands. The custom
AIOCB takes care of issuing all the "nested" AIOs one by one instead of
blindly sending them off simultaneously without tracking the returned
aiocbs.

I'm not familiar with nvme. But intuitively, isn't it less efficient tosend mutltiple commands one-by-one? Overall, wouldn't it be slower?


No, you are right, it is of course slower overall.

In block layer we mostly do opposite transition: instead of doing IOoperations one-by-one, run them simultaneously to make a non-emptyqueue on a device.. Even on one device. This way overall performance isincreased.

Of these commands, Copy is the only one that I would consider optimizinglike this. But the most obvious use of the Copy command is host-drivengarbage collection in the context of zoned namespaces. And I would notconsider that operation to be performance critical in terms of latency.All regular I/O commands are "one aiocb" and doesnt need any of this.And we already "parallelize" this heavily.

If you need to store nested AIOCBs, you may store them in a list forexample, and cancel in a loop, keeping simultaneous start for allflushes.. If you send two flushes on two different disks, what's thereason to wait for first flush finish before issuing the second?

Keeping a list of returned aiocbs was my initial approach actually. Butwhen I looked at hw/ide I got the impression that the AIOCB approach wasthe right one. My first approach involved adding an aiocblist to thecore NvmeRequest structure, but I ended up killing that approach becauseI didnt want to deal with it on the normal I/O path.

But you are absolutely correct that waiting for the first flush tofinish is suboptimal.

I've kept the RFC since I'm still new to using the block layer like
this. I was hoping that Stefan could find some time to look over this -
this is a huge series, so I don't expect non-nvme folks to spend a large
amount of time on it, but I would really like feedback on my approach in
the reimplementation of flush and format.
If I understand your code correctly, you do stat next io operation fromcall-back of a previous. It works, and it is similar to haw mirrorblock-job was operating some time ago (still it maintained severalin-flight requests simultaneously, but I'm about using _aio_functions). Still, now mirror doesn't use _aio_ functions like this.
Better approach to call several io functions of block layer one-by-oneis creating a coroutine. You may just add a coroutine function, thatdoes all your linear logic as you want, without any callbacks like:
nvme_co_flush()
{
  for (...) {
     blk_co_flush();
  }
}
(and you'll need qemu_coroutine_create() and qemu_coroutine_enter() tostart a coroutine).

So, this is definitely a tempting way to implement this. I must admitthat I did not consider it like this because I thought this was at thewrong level of abstraction (looked to me like this was something thatbelonged in block/, not hw/). Again, I referred to the Trimimplementation in hw/ide as the source of inspiration on the sequentialAIOCB approach.

Still, I'm not sure that moving from simultaneous issuing several IOcommands to sequential is good idea..
And this way you of course can't use blk_aio_canel.. This leads to my last 
doubt:
One more thing I don't understand after fast look at the series: howcancelation works? It seems to me, that you just call cancel on nestedAIOCBs, produced by blk_<io_functions>, but no one of them implementcancel.. I see only four implementations of .cancel_async callback inthe whole Qemu code: in iscsi, in ide/core.c, in dma-helpers.c and inthread-pool.c.. Seems no one is related to blk_aio_flush() and otherblk_* functions you call in the series. Or, what I miss?

Right now, cancellation is only initiated by the device when asubmission queue is deleted. This causes blk_aio_cancel() to be calledon each BlockAIOCB (NvmeRequest.aiocb) for outstanding requests. In mostcases this BlockAIOCB is a DMAAIOCB from softmmu/dma-helpers.c, whichimplements .cancel_async. Prior to this patchset, Flush, DSM, Copy andso on, did not have any BlockAIOCB to cancel since we did not keepreferences to the ongoing AIOs.

The blk_aio_cancel() call is synchronous, but it does callbdrv_aio_cancel_async() which calls the .cancel_async callback ifimplemented. This means that we can now cancel ongoing DSM or Copycommands while they are processing their individual LBA ranges. So whileblk_aio_cancel() subsequently waits for the AIO to complete this maycause them to complete earlier than if they had run to full completion(i.e. if they did not implement .cancel_async).


There are two things I'd like to do subsequent to this patch series:

1. Fix the Abort command to actually do something. Currently thecommand is a no-op (which is allowed by the spec), but I'd like it toactually cancel the command that the host specified.


  2. Make submission queue deletion asynchronous.

The infrastructure provided by this refactor should allow this if I amnot mistaken.

Overall, I think this "sequentialization" makes it easier to reasonabout cancellation, but that might just be me ;)

Those commands are special in
that may issue AIOs to multiple namespaces and thus, to multiple block
backends. Since this device does not support iothreads, I've opted for
simply always returning the main loop aio context, but I wonder if this
is acceptable or not. It might be the case that this should contain an
assert of some kind, in case someone starts adding iothread support.

Klaus Jensen (11):
  hw/nvme: reimplement flush to allow cancellation
  hw/nvme: add nvme_block_status_all helper
  hw/nvme: reimplement dsm to allow cancellation
  hw/nvme: save reftag when generating pi
  hw/nvme: remove assert from nvme_get_zone_by_slba
  hw/nvme: use prinfo directly in nvme_check_prinfo and nvme_dif_check
  hw/nvme: add dw0/1 to the req completion trace event
  hw/nvme: reimplement the copy command to allow aio cancellation
  hw/nvme: reimplement zone reset to allow cancellation
  hw/nvme: reimplement format nvm to allow cancellation
  Partially revert "hw/block/nvme: drain namespaces on sq deletion"

 hw/nvme/nvme.h       |   10 +-
 include/block/nvme.h |    8 +
 hw/nvme/ctrl.c       | 1861 ++++++++++++++++++++++++------------------
 hw/nvme/dif.c        |   64 +-
 hw/nvme/trace-events |   21 +-
 5 files changed, 1102 insertions(+), 862 deletions(-)

signature.asc
Description: PGP signature

[Prev in Thread]

Current Thread

[Next in Thread]

[RFC PATCH 03/11] hw/nvme: reimplement dsm to allow cancellation, (continued)
- [RFC PATCH 03/11] hw/nvme: reimplement dsm to allow cancellation, Klaus Jensen, 2021/06/04
- [RFC PATCH 04/11] hw/nvme: save reftag when generating pi, Klaus Jensen, 2021/06/04
- [RFC PATCH 05/11] hw/nvme: remove assert from nvme_get_zone_by_slba, Klaus Jensen, 2021/06/04
- [RFC PATCH 06/11] hw/nvme: use prinfo directly in nvme_check_prinfo and nvme_dif_check, Klaus Jensen, 2021/06/04
- [RFC PATCH 07/11] hw/nvme: add dw0/1 to the req completion trace event, Klaus Jensen, 2021/06/04
- [RFC PATCH 08/11] hw/nvme: reimplement the copy command to allow aio cancellation, Klaus Jensen, 2021/06/04
- [RFC PATCH 09/11] hw/nvme: reimplement zone reset to allow cancellation, Klaus Jensen, 2021/06/04
- [RFC PATCH 10/11] hw/nvme: reimplement format nvm to allow cancellation, Klaus Jensen, 2021/06/04
- [RFC PATCH 11/11] Partially revert "hw/block/nvme: drain namespaces on sq deletion", Klaus Jensen, 2021/06/04
- Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs, Vladimir Sementsov-Ogievskiy, 2021/06/07
  - Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs, Klaus Jensen <=
    - Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs, Vladimir Sementsov-Ogievskiy, 2021/06/07
    - Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs, Klaus Jensen, 2021/06/07
    - Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs, Vladimir Sementsov-Ogievskiy, 2021/06/07
    - Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs, Klaus Jensen, 2021/06/07
- Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs, Stefan Hajnoczi, 2021/06/08

Prev by Date: Re: [PATCH v3 1/7] file-posix: fix max_iov for /dev/sg devices
Next by Date: Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs
Previous by thread: Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs
Next by thread: Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs
Index(es):
- Date
- Thread