qemu-arm
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-arm] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1


From: Richard Henderson
Subject: Re: [Qemu-arm] [RFC PATCH v2 1/2] utils: Add helper to read arm MIDR_EL1 register
Date: Fri, 19 Aug 2016 07:57:23 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0

On 08/19/2016 02:05 AM, Vijay Kilari wrote:
On Thu, Aug 18, 2016 at 8:26 PM, Peter Maydell <address@hidden> wrote:
On 18 August 2016 at 15:46, Richard Henderson <address@hidden> wrote:
On 08/18/2016 07:14 AM, Peter Maydell wrote:
While we're on the subject, can somebody explain to me why we
use ifuncs at all? I couldn't work out why it would be better than
just using a straightforward function pointer -- when I tried single
stepping through things the ifunc approach still seemed to indirect
through some table or other so it wasn't actually resolving to
a direct function call anyway.

No reason, I suppose.

It's particularly helpful for libraries, where we don't really want the
overhead of the initialization when it's not used.

Ah, I see.

But (1) we don't have many of these and (2) we really don't care *that* much
about startup time.

So a simple function pointer initialized by a constructor has the same
effect.


 The cutils does not have any initialization function that can init
function/constructor pointer
for zero_check function.

static void __attribute__((constructor)) init_buffer_find_nonzero(void)
{
   ...
}

Also creating separate function with most of repeated code for prefetch does
not look good.

Why do you say that?

So suggest to put check for prefetch outside the for loop and
code for loop with and without prefetch

You're duplicating the inner loop either way, so that can't be your objection to creating a separate function.

I profiled and found that a single check inside the loop is adding 100ms delay
for 8GB RAM migration.

That's about what I expected.

Also,  If you want to make prefetch common for all arm64 platforms,
Then thunder cache line is 128 bytes so the prefetch is performed
at 128 byte index. If the platform has 64 byte cache line, then this
prefetch will fill only 64 byte line instead of 128 bytes required for the loop.

Yes, I had thought of that.

It would make sense to create two versions, that prefetch for and iterate over, cacheline sizes of 64 and 128 (I don't know of any other common sizes).

Preferably, we should then use sysconf(_SC_LEVEL1_DCACHE_LINESIZE) within the init function above to choose the appropriate version.

But I see that glibc doesn't currently implement that for aarch64, so we do want to have a fallback. I know that the "official" cache line data isn't (easily) available to userspace, but a close proxy is the size described by dczid_el0. That seems much better than groveling through a file under /sys.


r~



reply via email to

[Prev in Thread] Current Thread [Next in Thread]