Slowness with multi-thread TCG?

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Slowness with multi-thread TCG?

From:	Frederic Barrat
Subject:	Slowness with multi-thread TCG?
Date:	Mon, 27 Jun 2022 18:25:59 +0200
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.10.0

Hello,

I've been looking at why our qemu powernv model is so slow when bootinga compressed linux kernel, using multiple vcpus and multi-thread tcg.With only one vcpu, the decompression time of the kernel is what it is,but when using multiple vcpus, the decompression is actually slower. Andworse: it degrades very fast with the number of vcpus!

Rough measurement of the decompression time on a x86 laptop withmulti-thread tcg and using the qemu powernv10 machine:

1 vcpu => 15 seconds
2 vcpus => 45 seconds
4 vcpus => 1 min 30 seconds

Looking in details, when the firmware (skiboot) hands over execution tothe linux kernel, there's one main thread entering some bootstrap codeand running the kernel decompression algorithm. All the other secondarythreads are left spinning in skiboot (1 thread per vpcu). So on paper,with multi-thread tcg and assuming the system has enough availablephysical cpus, I would expect the decompression to hog one physical cpuand the time needed to be constant, no matter the number of vpcus.


All the secondary threads are left spinning in code like this:

        for (;;) {
                if (cpu_check_jobs(cpu))  // reading cpu-local data
                        break;
                if (reconfigure_idle)     // global variable
                        break;
                barrier();
        }

The barrier is to force reading the memory with each iteration. It'sdefined as:


  asm volatile("" : : : "memory");

Some time later, the main thread in the linux kernel will get thesecondary threads out of that loop by posting a job.

My first thought was that the translation of that code through tcg wassomehow causing some abnormally slow behavior, maybe due to somenon-obvious contention between the threads. However, if I send thethreads spinning forever with simply:


    for (;;) ;

supposedly removing any contention, then the decompression time is the same.

Ironically, the behavior seen with single thread tcg is what I wouldexpect: 1 thread decompressing in 15 seconds, all the other threadsspinning for that same amount of time, all sharing the same physicalcpu, so it all adds up nicely: I see 60 seconds decompression time with4 vcpus (4x15). Which means multi-thread tcg is slower by quite a bit.And single thread tcg hogs one physical cpu of the laptop vs. 4 physicalcpus for the slower multi-thread tcg.

Does anybody have an idea of what might happen or have suggestion tokeep investigating?

Thanks for your help!

  Fred

[Prev in Thread]

Current Thread

[Next in Thread]

Slowness with multi-thread TCG?, Frederic Barrat <=
- Slowness with multi-thread TCG?, Frederic Barrat, 2022/06/27
  - Re: Slowness with multi-thread TCG?, Alex Bennée, 2022/06/27
  - Re: Slowness with multi-thread TCG?, Matheus K. Ferst, 2022/06/28
    - Re: Slowness with multi-thread TCG?, Frederic Barrat, 2022/06/28
    - Re: Slowness with multi-thread TCG?, Alex Bennée, 2022/06/28
    - Re: Slowness with multi-thread TCG?, Frederic Barrat, 2022/06/28
    - Re: Slowness with multi-thread TCG?, Alex Bennée, 2022/06/28
    - Re: Slowness with multi-thread TCG?, Frederic Barrat, 2022/06/29
    - Re: Slowness with multi-thread TCG?, Alex Bennée, 2022/06/29
    - Re: Slowness with multi-thread TCG?, Matheus K. Ferst, 2022/06/29

Prev by Date: Re: [PATCH 3/7] target/ppc: use int128.h methods in vaddecuq and vaddeuqm
Next by Date: Re: [PATCH 4/7] target/ppc: use int128.h methods in vaddcuq
Previous by thread: [PATCH] hw/arm/virt: dt: add rng-seed property
Next by thread: Slowness with multi-thread TCG?
Index(es):
- Date
- Thread