qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_li


From: alvise rigo
Subject: Re: [Qemu-devel] [RFC v4 1/9] exec.c: Add new exclusive bitmap to ram_list
Date: Thu, 10 Sep 2015 19:36:01 +0200

Hi Alex,

On Thu, Sep 10, 2015 at 6:19 PM, Alex Bennée <address@hidden> wrote:
>
> alvise rigo <address@hidden> writes:
>
>> Hi Paolo,
>>
>> A brief update on this. I have a first implementation of the idea you
>> proposed, though it's not working really well. The failing rate of SCs
>> for some reason is very high.
>
> Due to high memory contention on the EXCL page?

I guess it's also due to the fact that we were using the bitmap itself
to determine if a SC operation should fail or not.
The SC should rather fail if another vCPU has an intersecting
exclusive range set and not for the state of the bitmap.

>
>> Instead of trying to fix it, I came up with this alternative design:
>> we still use 8 bits per page and we group the smp_cpus in clusters of
>> maximum size ceiling(smp_cpus / 8).
>> Given a byte describing a certain page, its first bit will represents
>> the vCPUs of index 0, 9, 17, ... and so on for the other bits.
>>
>> Note that systems with eight vCPUs or less will not be affected by
>> this simplified model.
>>
>> Now, unless I'm missing some advantages of your proposal except the
>> less aggressive memory consumption, I would go with this design.
>
> BTW have you had a look at Emilio's series? I'm about half way through
> reviewing it but I'm looking forward to seeing how his implementation
> differs from yours. Maybe there are some ideas in there that will
> inspire?

I still have to give an in-depth look at them.
However, if I'm not missing something, the problem of the fast-path
not being interceptable is still there. Isn't it?

Regards,
alvise

>
>>
>> Regards,
>> alvise
>>
>> On Wed, Aug 12, 2015 at 2:36 PM, Paolo Bonzini <address@hidden> wrote:
>>>
>>>
>>> On 12/08/2015 09:31, alvise rigo wrote:
>>>> I think that tlb_flush_entry is not enough, since in theory another
>>>> vCPU could have a different TLB address referring the same phys
>>>> address.
>>>
>>> You're right, this is a TLB so it's virtually-indexed. :(  I'm not sure
>>> what happens on ARM, since it has a virtually indexed (VIVT or VIPT)
>>> cache, but indeed it would be a problem when implementing e.g. CMPXCHG
>>> using the TCG ll/sc ops.
>>>
>>> I'm a bit worried about adding such a big bitmap.  It's only used on
>>> TCG, but it is also allocated on KVM and on KVM you can have hundreds
>>> of VCPUs.  Wasting 200 bits per guest memory page (i.e. ~0.6% of guest
>>> memory) is obviously not a great idea. :(
>>>
>>> Perhaps we can use a bytemap instead:
>>>
>>> - 0..253 = TLB_EXCL must be set in all VCPUs except CPU n.  A VCPU that
>>> loads the TLB for this vaddr does not have to set it.
>>>
>>> - 254 = TLB_EXCL must be set in all VCPUs.  A VCPU that
>>> loads the TLB for this vaddr has to set it.
>>>
>>> - 255 = TLB_EXCL not set in at least two VCPUs
>>>
>>> Transitions:
>>>
>>> - ll transitions: anything -> 254
>>>
>>> - sc transitions: 254 -> current CPU_ID
>>>
>>> - TLB_EXCL store transitions: 254 -> current CPU_ID
>>>
>>> - tlb_st_page transitions: CPU_ID other than current -> 255
>>>
>>> The initial value is 255 on SMP guests, 0 on UP guests.
>>>
>>> The algorithms are very similar to yours, just using this approximate
>>> representation.
>>>
>>> ll algorithm:
>>>   llsc_value = bytemap[vaddr]
>>>   if llsc_value == CPU_ID
>>>      do nothing
>>>   elseif llsc_value < 254
>>>      flush TLB of CPU llsc_value
>>>   elseif llsc_value == 255
>>>      flush all TLBs
>>>   set TLB_EXCL
>>>   bytemap[vaddr] = 254
>>>   load
>>>
>>> tlb_set_page algorithm:
>>>   llsc_value = bytemap[vaddr]
>>>   if llsc_value == CPU_ID or llsc_value == 255
>>>      do nothing
>>>   else if llsc_value == 254
>>>      set TLB_EXCL
>>>   else
>>>      # two CPUs without TLB_EXCL
>>>      bytemap[vaddr] = 255
>>>
>>> TLB_EXCL slow path algorithm:
>>>    if bytemap[vaddr] == 254
>>>       bytemap[vaddr] = CPU_ID
>>>    else
>>>       # two CPUs without TLB_EXCL
>>>       bytemap[vaddr] = 255
>>>    clear TLB_EXCL in this CPU
>>>    store
>>>
>>> sc algorithm:
>>>    if bytemap[vaddr] == CPU_ID or bytemap[vaddr] == 254
>>>       bytemap[vaddr] = CPU_ID
>>>       clear TLB_EXCL in this CPU
>>>       store
>>>       succeed
>>>    else
>>>       fail
>>>
>>> clear algorithm:
>>>    if bytemap[vaddr] == 254
>>>       bytemap[vaddr] = CPU_ID
>>>
>>> The UP case is optimized because bytemap[vaddr] will always be 0 or 254.
>>>
>>> The algorithm requires the LL to be cleared e.g. on exceptions.
>>> Paolo
>>>
>>>> alvise
>>>>
>>>> On Tue, Aug 11, 2015 at 6:32 PM, Paolo Bonzini <address@hidden> wrote:
>>>>>
>>>>>
>>>>> On 11/08/2015 18:11, alvise rigo wrote:
>>>>>>>> Why flush the entire cache (I understand you mean TLB)?
>>>>>> Sorry, I meant the TLB.
>>>>>> If for each removal of an exclusive entry we set also the bit to 1, we
>>>>>> force the following LL to make a tlb_flush() on every vCPU.
>>>>>
>>>>> What if you only flush one entry with tlb_flush_entry (on every vCPU)?
>>>>>
>>>>> Paolo
>
> --
> Alex Bennée



reply via email to

[Prev in Thread] Current Thread [Next in Thread]