qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 for-6.0? 0/3] qcow2: fix parallel rewrite and discard (rw-


From: Vladimir Sementsov-Ogievskiy
Subject: Re: [PATCH v4 for-6.0? 0/3] qcow2: fix parallel rewrite and discard (rw-lock)
Date: Tue, 30 Mar 2021 21:33:17 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.9.0

30.03.2021 19:39, Max Reitz wrote:
On 30.03.21 15:25, Vladimir Sementsov-Ogievskiy wrote:
30.03.2021 15:51, Max Reitz wrote:
On 30.03.21 12:51, Vladimir Sementsov-Ogievskiy wrote:
30.03.2021 12:49, Max Reitz wrote:
On 25.03.21 20:12, Vladimir Sementsov-Ogievskiy wrote:
ping. Do we want it for 6.0?

I’d rather wait.  I think the conclusion was that guests shouldn’t hit this 
because they serialize discards?

I think, that we never had bugs, so we of course can wait.


There’s also something Kevin wrote on IRC a couple of weeks ago, for which I 
had hoped he’d sent an email but I don’t think he did, so I’ll try to remember 
and paraphrase as well as I can...

He basically asked whether it wouldn’t be conceptually simpler to take a 
reference to some cluster in get_cluster_offset() and later release it with a 
to-be-added put_cluster_offset().

He also noted that reading is problematic, too, because if you read a discarded 
and reused cluster, this might result in an information leak (some guest 
application might be able to read data it isn’t allowed to read); that’s why 
making get_cluster_offset() the point of locking clusters against discarding 
would be better.

Yes, I thought about read too, (RFCed in cover letter of [PATCH v5 0/6] qcow2: 
fix parallel rewrite and discard (lockless))


This would probably work with both of your solutions.  For the in-memory 
solutions, you’d take a refcount to an actual cluster; in the CoRwLock 
solution, you’d take that lock.

What do you think?


Hmm. What do you mean? Just rename my qcow2_inflight_writes_inc() and 
qcow2_inflight_writes_dec() to get_cluster_offset()/put_cluster_offset(), to 
make it more native to use for read operations as well?

Hm.  Our discussion wasn’t so detailed.

I interpreted it to mean all qcow2 functions that find an offset to a qcow2 
cluster, namely qcow2_get_host_offset(), qcow2_alloc_host_offset(), and 
qcow2_alloc_compressed_cluster_offset().

What about qcow2_alloc_clusters() ?

Seems like all callers for that but do_alloc_cluster_offset() call it to 
allocate metadata clusters, which cannot be discarded from the guest.

Is it really possible that some metadata cluster is used while qcow2 discards 
it internally at the same time, or isn’t all of this only a problem for data 
clusters?

It's a good question which I think doesn't have fast answer.

I think:

1. On normal code paths we usually handle qcow2 metadata always under s->lock 
mutex. So, we can't decrease refcount to zero in parallel, as update_refcount() 
should be called under s->lock too.

2. [1] is violated at least by code paths that are not in coroutine (for example 
qcow2-snapshots). But seems that on these paths we are guarded by other things 
(like drained section).. For sure we should move the whole qcow2 driver to 
coroutine and do every metadata update under s->lock.

3. Does [1] violated on some coroutine paths I don't know. But I hope that it 
doesn't, otherwise we should have bugs..

In future we'll probably want to work with metadata without s->lock, at least for 
IO. Then we'll need same mechanisms like we are now inventing for data writes.. But 
I'm not sure that for metadata it will be so simple (as for now mutex prevents not 
only parallel write & discard of metadata, but also parallel updates. (And for 
parallel guest writes to same cluster we don't care, it's a problem of the guest)


When those functions return an offset (in)to some cluster, that cluster (or the 
image as a whole) should be locked against discards. Every offset received this 
way would require an accompanying qcow2_put_host_offset().

Or to update any kind of "getting cluster offset" in the whole qcow2 driver to take a 
kind of "dynamic reference count" by get_cluster_offset() and then call corresponding 
put() somewhere? In this case I'm afraid it's a lot more work..

Hm, really?  I would have assumed we need to do some locking in all functions 
that get a cluster offset this way, so it should be less work to take the lock 
in the functions they invoke to get the offset.

It would be also the problem that a lot of paths in qcow2 are not in coroutine and 
don't even take s->lock when they actually should.

I’m not sure what you mean here, because all functions that invoke any of the 
three functions I listed above are coroutine_fns (or, well, I didn’t look it 
up, but they all have *_co_* in their name).

qcow2_alloc_clusters() has a lot more callers..

Let’s hope we don’t need to worry about it then. O:)

This will also mean that we do same job as normal qcow2 refcounts already do: no sense in 
keeping additional "dynamic refcount" for L2 table cluster while reading it, as 
we already have non-zero qcow2 normal refcount for it..

I’m afraid I don’t understand how normal refcounts relate to this. For example, 
qcow2_get_host_offset() doesn’t touch refcounts at all.


I mean the following: remember our discussion about what is free-cluster. If we add 
"dynamic-refcount", or "infligth-write-counter" thing only to count inflight data-writing 
(or, as discussed, we should count data-reads as well) operations, than "full reference count" of 
the cluster is inflight-write-count + qcow2-metadata-refcount.

But if we add a kind of "dynamic refcount" for any use of host cluster, for example reading of L2 table, than 
we duplicate the reference in qcow2-metadata to this L2 table (represented as refcount) by our "dynamic 
refcount", and we don't have a concept of "full reference count" as the sum above.. We still should 
treat a cluster as free when both "dynamic refcount" and qcow2-metadata-refcount are zero, but their sum 
doesn't have a good sense. Not a problem maybe.. But looks like a complication with no benefit.

You’re right, but I don’t think that’s a real problem.  Perhaps the sum was 
even a false equivalency.  There is a difference between the dynamic refcount 
and the on-disk refcount: We must wait with discarding until the the dynamic 
refcount is 0, and discarding will then drop the on-disk refcount to 0, too.  I 
think.  So I’m not sure whether the sum really means anything.

But if metadata isn’t a problem and that means we don’t have ask these 
questions at all, well, that’ll be even better.

==

OK, I think now that you didn't mean qcow2_alloc_clusters(). So, we are saying about only 
functions returning an offset to cluster with "guest data", not to any kind of 
host cluster. Than what you propose looks like this to me:

  - take my v5
  - rename qcow2_inflight_writes_dec() to put_cluster_offset()
  - call qcow2_inflight_writes_inc() from the three functions you mention

Yes, I think so.  Or you take the CoRwLock in those three functions, depending 
on which implementation we want.

Yes.


Sorry if we’ve discussed this before and I just forgot, but: What are the 
performance implications of either solution?  As far as I remember, the 
inflight-write-counter solution had the problem of always doing stuff on every 
I/O access.  You said the impact was small and yes, it is, but it’s still there.

Yes. I think, I can measure it on tmpfs, and if degradation less than 1%, I'd 
just ignore it. Still inflight-counter solution is more complicated.

Hmm, note that CoRwLock solution has one non-trivial thing too: moving the lock 
to task coroutine. With inflight-counters we don't need such thing.

I haven’t looked at the CoRwLock solution but it I would assume it’s basically 
zero cost for common cases, right?  I.e. the case where the guest already 
serializes discards from other accesses, which I thought is what e.g. Linux 
does.  (And even if it doesn’t, I would assume that concurrent I/O and discards 
are rather rare.)

With CoRwLock any kind of host-cluster-discard becomes blocking. We block the 
guest io (remember that we want to protect reads too), wait until all 
operations completed, then we do discard, than unfreeze the whole io.. What 
about removing persistent dirty bitmap? It becomes such an operation.. Hmm. Can 
I imagine another example? Switch-to-snapshot should be in a drained section 
anyway.. Rewriting compressed cluster to normal cluster is rare case. Hmm what 
else?

We don't have it now, but what about a feature like this:

 During block-commit, discard clusters that are already committed to base from 
all intermediate images: thus we avoid increasing of disk-usage during commit 
job.

But, anyway, we don't have such feature yet.

So, I think that in general lockless solution is better.. But currently we 
don't have a significant use-case which would be a show-stopper for CoRwLock 
solution. Or at least I can't imagine it.

Ha, or, I have one in mind: mirror. mirror does discards (it has 
MIRROR_METHOD_DISCARD). Honestly, I always doubt about that fact. I think that 
mirror job should not use discard at all, it should either write or 
write-zeroes, nothing else.. But it's my opinion, probably I don't know the 
use-case. Anyway, with current implementation of mirror-job write-lock on 
discard may reduce performance of mirror job (and of guest as well) in cases 
when mirror does discard operations.. Mirror does discard when block_status 
return neither ZERO nor DATA. And accordingly to my old mail 
https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg05136.html , only 
iscsi may reasonably report such block status... Hmm. Any thoughts?



That make sense to me. Still, put_cluster_offset() name doesn't make it obvious that it's 
only for clusters with "guest data", and we shouldn't call it when work with 
metadata clusters.

Yeah, it was meant for symmetry with qcow2_get_host_offset() (i.e. it would be 
named “qcow2_put_host_offset()”).  Now that there are three functions that 
would take a reference, it should get some other name.  I don’t know.  
qcow2_put_data_cluster_offset()?

Max



--
Best regards,
Vladimir



reply via email to

[Prev in Thread] Current Thread [Next in Thread]