tinycc-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] Huge swings in cache performance


From: KHMan
Subject: Re: [Tinycc-devel] Huge swings in cache performance
Date: Wed, 11 Jan 2017 11:54:32 +0800
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0

On 1/10/2017 12:34 PM, David Mertens wrote:
64-byte alignment adds about 10% to the longer example program I
linked earlier (which actually prints the size of the allocated
code block). 10% isn't bad! For this reason, and because lots of
architectures use 64-byte cache lines, I suggest that we just use
64-byte alignment, independent of architecture.

A quick search gives this discussion on cache line sizes:
https://groups.google.com/forum/#!topic/comp.arch/9DcDSzs28Ow

Thoughts? Or, can anybody provide other memory consumption
benchmarks that paint a more complete picture?

I think memory consumption benchmarks is not a huge issue, if there is a pathological case with thousands of JIT'ed scripts, maybe a custom allocator for the scripts may save some memory.

I had always assumed that your program ran with data and memory separated enough not to mess with the L1 caches. So, is 64 byte align enough for your CPU? No effects beyond 64 bytes?

For BeagleBone Black, according to the AM335x Sitara Processors TRM, both L1 and L2 have cache lines of 16 words == 64 bytes, so that's another fella going with 64 bytes...

David

On Sun, Jan 8, 2017 at 7:40 AM, David Mertens wrote:

    OK, done! And you were right, we only need to align on 64 bytes!

    Follow-up question: since the alignment is only 64-bytes,
    would it be sensible to have all architectures align to this,
    including ARM?

    David

    On Sun, Jan 8, 2017 at 7:19 AM, David Mertens
    <address@hidden <mailto:address@hidden>>
    wrote:

        Thanks for the feedback, grischka.

        On Sat, Jan 7, 2017 at 6:15 AM, grischka <address@hidden
        <mailto:address@hidden>> wrote:

            David Mertens wrote:

                I just pushed a commit that sets up 512-byte
                alignment for x86-64
                architectures. It only uses 512 bytes for x86-64;
                for all others it sticks
                with the default of 16 bytes.


            L1/L2 cache line size is 64 bytes on x86-like
            processors, no matter
            whether run in 32 or 64 bit mode.


        Yes, theoretically we should not need to align on anything
        more than 64 bytes. I chose 512 because I still got
        slowdowns for smaller alignments, including 256. But you
        mention...

            However to make it work reliably the memory from
            malloc needs to be
            aligned as well, like so:

                 offset = 0, mem = (addr_t)ptr;
            +    mem += -(int)mem & SECTION_ALIGNMENT;

            and the possibly additional amount needs to be
            requested in advance:

                 if (0 == mem)
            -        return offset;
            +        return offset + SECTION_ALIGNMENT;


        If I put this in place, then maybe the section alignment
        can be lessened. I'll have to check. FWIW, I've been doing
        this with my own TCC-calling code already and I've seen
        performance benefits. I don't see how the math would work
        to let me reduce SECTION_ALIGNMENT to 64 bytes, but I'll
        experiment and see what happens.

        All of this is a black box to me. From what I've read, I
        don't think we'd need to worry about anything beyond 64
        bytes, but I don't understand the underlying CPU behavior
        well enough to predict. The numbers I actually use will be
        based on real timing from testing on my machine or from
        feedback from others.

                I ran the tests on my BeagleBone Black with
                the original alignment and saw no performance issues,


            Obviously ARM don't automatically clear the
            instruction cache which is
            why we have the explicit __clear_cache() call for ARM
            further down in
            set_pages_executable().

                I am not sure if this quite follows the project
                practices. I define
                SECTION_ALIGNMENT just prior to the function
                tcc_relocate_ex. If anybody
                can think of a better place to put it, to keep
                useful things in one place,
                please move it.


            SECTION_ALIGNMENT seems too general as a name.
            tccelf.c is full of
            section_alignments of various kinds.  I'd suggest
            something prefixed
            with RUN_xxxx  to indicate that it's used only in that
            specific place.


        Can do! I may not have time today, but I should be able to
        push a revised commit in the next couple of days.

        David

        --
          "Debugging is twice as hard as writing the code in the
        first place.
           Therefore, if you write the code as cleverly as
        possible, you are,
           by definition, not smart enough to debug it." -- Brian
        Kernighan




    --
      "Debugging is twice as hard as writing the code in the first
    place.
       Therefore, if you write the code as cleverly as possible,
    you are,
       by definition, not smart enough to debug it." -- Brian
    Kernighan




--
  "Debugging is twice as hard as writing the code in the first place.
   Therefore, if you write the code as cleverly as possible, you are,
   by definition, not smart enough to debug it." -- Brian Kernighan


_______________________________________________
Tinycc-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/tinycc-devel


--
Cheers,
Kein-Hong Man (esq.)
Selangor, Malaysia




reply via email to

[Prev in Thread] Current Thread [Next in Thread]