qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [QEMU PATCH v2 0/2]: KVM: i386: Add support for save an


From: Liran Alon
Subject: Re: [Qemu-devel] [QEMU PATCH v2 0/2]: KVM: i386: Add support for save and restore nested state
Date: Thu, 8 Nov 2018 02:13:31 +0200

Ping on my last reply.
I would like to reach to an agreement on how v3 should look like before just 
implementing what I think is right.

Thanks,
-Liran

> On 3 Nov 2018, at 4:02, Liran Alon <address@hidden> wrote:
> 
> 
> 
>> On 2 Nov 2018, at 18:39, Jim Mattson <address@hidden> wrote:
>> 
>> On Thu, Nov 1, 2018 at 8:46 PM, Liran Alon <address@hidden> wrote:
>> 
>>> Hmm this makes sense.
>>> 
>>> This means though that the patch I have submitted here isn't good enough.
>>> My patch currently assumes that when it attempts to get nested state from 
>>> KVM,
>>> QEMU should always set nested_state->size to max size supported by KVM as 
>>> received
>>> from kvm_check_extension(s, KVM_CAP_NESTED_STATE);
>>> (See kvm_get_nested_state() introduced on my patch).
>>> This indeed won't allow migration from host with new KVM to host with old 
>>> KVM if
>>> nested_state size was enlarged between these KVM versions.
>>> Which is obviously an issue.
>>> 
>>> Jim, I think that my confusion was created from the fact that there is no 
>>> clear documentation
>>> on how KVM_{GET,SET}_NESTED_STATE should be changed once we will need to 
>>> add more state to
>>> nested_state in future KVM versions. I think it's worth adding that to 
>>> IOCTLs documentation.
>> 
>> The nested state IOCTLs aren't unique in this respect. Any changes to
>> the state saved by any of this whole family of state-saving ioctls
>> require opt-in from userspace.
>> 
>>> For example, let's assume we have a new KVM_CAP_NESTED_STATE_V2.
>>> In this scenario, does kvm_check_extension(s, KVM_CAP_NESTED_STATE) still 
>>> returns the
>>> size of nested_state v1 and kvm_check_extension(s, KVM_CAP_NESTED_STATE_V2) 
>>> returns the
>>> size of the nested_state v2?
>> 
>> Hmm...I don't recall kvm_check_extension(s, KVM_CAP_NESTED_STATE)
>> being part of my original design. The way I had envisioned it,
>> the set of capabilities enabled by userspace would be sufficient to
>> infer the maximum data size.
> 
> If the set of capabilities should be sufficient to infer the max size of 
> nested_state,
> why did we code kvm_vm_ioctl_check_extension() such that on 
> KVM_CAP_NESTED_STATE
> it returns the max size of nested_state?
> 
>> 
>> If, for example, we add a field to stash the time remaining for the
>> VMCS12 VMX preemption timer, then presumably, userspace will enable it
>> by enabling KVM_CAP_SAVE_VMX_PREEMPTION_TIMER (or something like
>> that), and then userspace will know that the maximum nested state data
>> is 4 bytes larger.
> 
> In that case, why did we defined struct kvm_nested_state to hold a blob of 
> data[] instead
> of separating the blob into well defined blobs? (e.g. Currently one blob for 
> vmcs12 and another one for shadow vmcs12).
> Then when we add a new component which is opt-in by a new KVM_CAP, we will 
> add another well defined blob
> to struct kvm_nested_state.
> 
> I think this is important because it allows us to specify in 
> nested_state->flags which components are saved
> and create multiple VMState subsections with needed() methods for the various 
> saved components.
> 
> Thus allowing for example to easily still migrate from a new QEMU which does 
> stash the time remaining for the VMCS12 VMX preemption timer
> to an old QEMU which doesn’t stash it in case nested_state->flags specify 
> that this component is not saved (Because L1 don’t use VMX preemption timer 
> for example).
> 
> This seems to behave more nicely with how QEMU migration mechanism is defined 
> and the purpose of VMState subsections.
> 
> In addition, if we will define struct kvm_nested_state like this, we will 
> also not need the “size” field which needs to be carefully handled to avoid 
> buffer overflows.
> (We will just define large enough buffers (with padding) for each opaque 
> component such as vmcs12 and shadow vmcs12).
> 
>> 
>>> Also note that the approach suggested by Jim requires mgmt-layer at dest
>>> to be able to specify to QEMU which KVM_CAP_NESTED_STATE_V* capabilities it 
>>> should enable on kvm_init().
>>> When we know we are migrating from a host which supports v1 to a host which 
>>> supports v2,
>>> we should make sure that dest QEMU doesn't enable KVM_CAP_NESTED_STATE_V2.
>>> However, when we are just launching a new machine on the host which 
>>> supports v2, we do want
>>> QEMU to enable KVM_CAP_NESTED_STATE_V2 enabled for that VM.
>> 
>> No, no, no. Even when launching a new VM on a host that supports v2,
>> you cannot enable v2 until you have passed rollback horizon. Should
>> you decide to roll back the kernel with v2 support, you must be able
>> to move that new VM to a host with an old kernel.
> 
> If we use VMState subsections as I described above, QEMU should be able to 
> know which components of nested_state are
> actively saved by KVM and therefore are *required* to be restored on dest 
> host in order to migrate without guest issues after it is resumed on dest.
> Therefore, still allowing migration from new hosts to old hosts in case guest 
> didn’t enter a state which makes new saved state required in order
> for migration to succeed.
> 
> If the mechanism will work like this, nested_state KVM_CAPs enabled on QEMU 
> launch are only used to inform KVM which
> struct kvm_nested_state is used by userspace. Not what is actually sent as 
> part of migration stream.
> 
> What are your thoughts on this?
> 
> -Liran
> 
>> 
>>> But on second thought, I'm not sure that this is the right approach as-well.
>>> We don't really want the used version of nested_state to be determined on 
>>> kvm_init().
>>> * On source QEMU, we actually want to determine it when preparing for 
>>> migration based
>>> on to the support given by our destination host. If it's an old host, we 
>>> would like to
>>> save an old version nested_state and if it's a new host, we will like to 
>>> save our newest
>>> supported nested_state.
>>> * On dest QEMU, we will want to just be able to set received nested_state 
>>> in KVM.
>>> 
>>> Therefore, I don't think that we want this versioning to be based on 
>>> KVM_CAP at all.
>>> It seems that we would want the process to behave as follows:
>>> 1) Mgmt-layer at dest queries dest host max supported nested_state size.
>>>  (Which should be returned from kvm_check_extension(KVM_CAP_NESTED_STATE))
>>> 2) Mgmt-layer at source initiate migration to dest with requesting QEMU to 
>>> send nested_state
>>>  matching dest max supported nested_state size.
>>>  When saving nested state using KVM_GET_NESTED_STATE IOCTL, QEMU will 
>>> specify in nested_state->size
>>>  the *requested* size to be saved and KVM should be able to save only the 
>>> information which matches
>>>  the version that worked with that size.
>>> 3) After some sanity checks on received migration stream, dest host use 
>>> KVM_SET_NESTED_STATE IOCTL.
>>>  This IOCTL should deduce which information it should deploy based on given 
>>> nested_state->size.
>>> 
>>> This also makes me wonder if it's not just nicer to use nested_state->flags 
>>> to specify which
>>> information is actually present on nested_state instead of managing 
>>> versioning with nested_state->size.
>> 
>> Yes, you can use nested_state->flags to determine what the data
>> payload is, but you cannot enable new flags unless userspace opts in.
>> This is just like KVM_CAP_EXCEPTION_PAYLOAD for kvm_vcpu_events. The
>> flag, KVM_VCPUEVENT_VALID_PAYLOAD, can only be set on the saved vcpu
>> events if userspace has opted-in with KVM_CAP_EXCEPTION_PAYLOAD. This
>> is because older kernels will reject kvm_vcpu_events that have the
>> KVM_VCPUEVENT_VALID_PAYLOAD flag set.
>> 
>> You don't need a new KVM_CAP_NESTED_STATE_V2 ioctl. You just need
>> buy-in from userspace for any new data payload. Explicitly enumerating
>> the payload components in the flags field makes perfect sense.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]