[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [libvirt] IO accounting overhaul
From: |
Benoît Canet |
Subject: |
Re: [Qemu-devel] [libvirt] IO accounting overhaul |
Date: |
Mon, 1 Sep 2014 12:44:38 +0200 |
User-agent: |
Mutt/1.5.23 (2014-03-12) |
The Monday 01 Sep 2014 à 11:52:00 (+0200), Markus Armbruster wrote :
> Cc'ing libvirt following Stefan's lead.
>
> Benoît Canet <address@hidden> writes:
>
> > Hi,
> >
> > I collected some items of a cloud provider wishlist regarding I/O accouting.
>
> Feedback from real power-users, lovely!
>
> > In a cloud I/O accouting can have 3 purpose: billing, helping the customers
> > and doing metrology to help the cloud provider seeks hidden costs.
> >
> > I'll cover the two former topic in this mail because they are the most
> > important
> > business wize.
> >
> > 1) prefered place to collect billing IO accounting data:
> > --------------------------------------------------------
> > For billing purpose the collected data must be as close as possible to what
> > the
> > customer would see by using iostats in his vm.
>
> Good point.
>
> > The first conclusion we can draw is that the choice of collecting IO
> > accouting
> > data used for billing in the block devices models is right.
>
> Slightly rephrasing: doing I/O accounting in the block device models is
> right for billing.
>
> There may be other uses for I/O accounting, with different preferences.
> For instance, data on how exactly guest I/O gets translated to host I/O
> as it flows through the nodes in the block graph could be useful.
I think this is the third point that I named as metrology.
Basically it boils down to "Where are the hidden IO costs of the QEMU block
layer".
>
> Doesn't diminish the need for accurate billing information, of course.
>
> > 2) what to do with occurences of rare events:
> > ---------------------------------------------
> >
> > Another point is that QEMU developpers agree that they don't know which
> > policy
> > to apply to some I/O accounting events.
> > Must QEMU discard invalid I/O write IO or account them as done ?
> > Must QEMU count a failed read I/O as done ?
> >
> > When discusting this with a cloud provider the following appears:
> > these decisions
> > are really specific to each cloud provider and QEMU should not implement
> > them.
>
> Good point, consistent with the old advice to avoid baking policy into
> inappropriately low levels of the stack.
>
> > The right thing to do is to add accouting counters to collect these events.
> >
> > Moreover these rare events are precious troubleshooting data so it's
> > an additional
> > reason not to toss them.
>
> Another good point.
>
> > 3) list of block I/O accouting metrics wished for billing and helping
> > the customers
> > -----------------------------------------------------------------------------------
> >
> > Basic I/O accouting data will end up making the customers bills.
> > Extra I/O accouting informations would be a precious help for the cloud
> > provider
> > to implement a monitoring panel like Amazon Cloudwatch.
>
> These are the first two from your list of three purposes, i.e. the ones
> you promised to cover here.
>
> > Here is the list of counters and statitics I would like to help
> > implement in QEMU.
> >
> > This is the most important part of the mail and the one I would like
> > the community
> > review the most.
> >
> > Once this list is settled I would proceed to implement the required
> > infrastructure
> > in QEMU before using it in the device models.
>
> For context, let me recap how I/O accounting works now.
>
> The BlockDriverState abstract data type (short: BDS) can hold the
> following accounting data:
>
> uint64_t nr_bytes[BDRV_MAX_IOTYPE];
> uint64_t nr_ops[BDRV_MAX_IOTYPE];
> uint64_t total_time_ns[BDRV_MAX_IOTYPE];
> uint64_t wr_highest_sector;
>
> where BDRV_MAX_IOTYPE enumerates read, write, flush.
>
> wr_highest_sector is a high watermark updated by the block layer as it
> writes sectors.
>
> The other three are *not* touched by the block layer. Instead, the
> block layer provides a pair of functions for device models to update
> them:
>
> void bdrv_acct_start(BlockDriverState *bs, BlockAcctCookie *cookie,
> int64_t bytes, enum BlockAcctType type);
> void bdrv_acct_done(BlockDriverState *bs, BlockAcctCookie *cookie);
>
> bdrv_acct_start() initializes cookie for a read, write, or flush
> operation of a certain size. The size of a flush is always zero.
>
> bdrv_acct_done() adds the operations to the BDS's accounting data.
> total_time_ns is incremented by the time between _start() and _done().
>
> You may call _start() without calling _done(). That's a feature.
> Device models use it to avoid accounting some requests.
>
> Device models are not supposed to mess with cookie directly, only
> through these two functions.
>
> Some device models implement accounting, some don't. The ones that do
> don't agree on how to count invalid guest requests (the ones not passed
> to block layer) and failed requests (passed to block layer and failed
> there). It's a mess in part caused by us never writing down what
> exactly device models are expected to do.
>
> Accounting data is used by "query-blockstats", and nothing else.
>
> Corollary: even though every BDS holds accounting data, only the ones in
> "top" BDSes ever get used. This is a common block layer blemish, and
> we're working on cleaning it up.
>
> If a device model doesn't implement accounting, query-blockstats lies.
> Fortunately, its lies are pretty transparent (everything's zero) as long
> as you don't do things like connecting a backend to a device model that
> doesn't implement accounting after disconnecting it from a device model
> that does. Still, I'd welcome a more honest QMP interface.
>
> For me, this accounting data belongs to the device model, not the BDS.
> Naturally, the block device models should use common infrastructure. I
> guess they use the block layer only because it's obvious infrastructure
> they share. Clumsy design.
>
> > /* volume of data transfered by the IOs */
> > read_bytes
> > write_bytes
>
> This is nr_bytes[BDRV_ACCT_READ] and nr_bytes[BDRV_ACCT_WRITE].
>
> nr_bytes[BDRV_ACCT_FLUSH] is always zero.
>
> Should this count only actual I/O, i.e. accumulated size of successful
> operations?
>
> > /* operation count */
> > read_ios
> > write_ios
> > flush_ios
> >
> > /* how many invalid IOs the guest submit */
> > invalid_read_ios
> > invalid_write_ios
> > invalid_flush_ios
> >
> > /* how many io error happened */
> > read_ios_error
> > write_ios_error
> > flush_ios_error
>
> This is nr_ops[BDRV_ACCT_READ], nr_ops[BDRV_ACCT_WRITE],
> nr_ops[BDRV_ACCT_FLUSH] split up into successful, invalid and failed.
>
> > /* account the time passed doing IOs */
> > total_read_time
> > total_write_time
> > total_flush_time
>
> This is total_time_ns[BDRV_ACCT_READ], total_time_ns[BDRV_ACCT_WRITE],
> total_time_ns[BDRV_ACCT_FLUSH].
>
> I guess this should count both successful and failed I/O. Could throw
> in invalid, too, but it's probably too quick to matter.
>
> Please specify the unit clearly. Both total_FOO_time_ns or total_FOO_ns
> would work for me.
Yes _ns is fine for me too.
>
> > /* since when the volume is iddle */
> > qvolume_iddleness_time
>
> "idle"
>
> The obvious way to maintain this information with the current could
> would be saving the value of get_clock() in bdrv_acct_done().
>
> > /* the following would compute latecies for slices of 1 seconds then toss
> > the
> > * result and start a new slice. A weighted sumation of the instant
> > latencies
> > * could help to implement this.
> > */
> > 1s_read_average_latency
> > 1s_write_average_latency
> > 1s_flush_average_latency
> >
> > /* the former three numbers could be used to further compute a 1
> > minute slice value */
> > 1m_read_average_latency
> > 1m_write_average_latency
> > 1m_flush_average_latency
> >
> > /* the former three numbers could be used to further compute a 1 hours
> > slice value */
> > 1h_read_average_latency
> > 1h_write_average_latency
> > 1h_flush_average_latency
>
> This is something like "what we added to total_FOO_time in the last
> completed 1s / 1m / 1h time slice divided by the number of additions".
> Just another way to accumulate the same raw data, thus no worries.
>
> > /* 1 second average number of requests in flight */
> > 1s_read_queue_depth
> > 1s_write_queue_depth
> >
> > /* 1 minute average number of requests in flight */
> > 1m_read_queue_depth
> > 1m_write_queue_depth
> >
> > /* 1 hours average number of requests in flight */
> > 1h_read_queue_depth
> > 1h_write_queue_depth
>
> I guess this involves counting bdrv_acct_start() and bdrv_acct_done().
> The "you need not call bdrv_acct_done()" feature may get in the way.
> Solvable.
>
> Permit me a short detour into the other use for I/O accounting I
> mentioned: data on how exactly guest I/O gets translated to host I/O as
> it flows through the nodes in the block graph. Do you think this would
> be pretty much the same data, just collected at different points?
That something I would like to take care in a further sub project.
Optionally collecting the same data for each BDS of the graph.
>
> > 4) Making this happen
> > -------------------------
> >
> > Outscale want to make these IO stat happen and gave me the go to do whatever
> > grunt is required to do so.
> > That said we could collaborate on some part of the work.
>
> Cool!
>
> A quick stab at tasks:
>
> * QMP interface, either a compatible extension of query-blockstats or a
> new one.
I would like to extend query-blockstat in a first time but I also
would like to postpone the QMP interface changes and just write the
shared infrastructure and deploy it in the device models.
>
> * Rough idea on how to do the shared infrastructure.
-API wize I think about adding
bdrv_acct_invalid() and
bdrv_acct_failed() and systematically issuing a bdrv_acct_start() asap.
-To calculate the averages I think about a global timer firing every seconds
and iterating of the bds list to make the computations even when there is no
IO activity. Is it acceptable to have a qemu_mutex by statitic structure ?
>
> * Implement (can be split up into several tasks if desired)
First I would like to write a series implementing a backward-compatible API and
get it
merged.
Then the deployment of the new API specifics in the devices models can be
splitted/parallelized.
Best regards
Benoît
>
> --
> libvir-list mailing list
> address@hidden
> https://www.redhat.com/mailman/listinfo/libvir-list
- Re: [Qemu-devel] IO accounting overhaul, Markus Armbruster, 2014/09/01
- Re: [Qemu-devel] [libvirt] IO accounting overhaul,
Benoît Canet <=
- Re: [Qemu-devel] [libvirt] IO accounting overhaul, Markus Armbruster, 2014/09/01
- Re: [Qemu-devel] [libvirt] IO accounting overhaul, Kevin Wolf, 2014/09/05
- Re: [Qemu-devel] [libvirt] IO accounting overhaul, Benoît Canet, 2014/09/05
- Re: [Qemu-devel] [libvirt] IO accounting overhaul, Benoît Canet, 2014/09/05
- Re: [Qemu-devel] [libvirt] IO accounting overhaul, Benoît Canet, 2014/09/05
- Re: [Qemu-devel] [libvirt] IO accounting overhaul, Markus Armbruster, 2014/09/08
- Re: [Qemu-devel] [libvirt] IO accounting overhaul, Kevin Wolf, 2014/09/08