Re: [Bug-gnubg] Benchmarks on server class machines and resulting change

bug-gnubg

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-gnubg] Benchmarks on server class machines and resulting change

From:	Michael Petch
Subject:	Re: [Bug-gnubg] Benchmarks on server class machines and resulting change requests
Date:	Thu, 10 Sep 2009 19:36:11 -0600
User-agent:	Microsoft-Entourage/12.20.0.090605

Ingo, regarding your original cache test, would it be possible to send me your matches that you analyzed for this test. These 5 matches must have been pretty short. I am wondering if the way you use Gnugb from the command line with the types of matches you are running result in cache performance that differs significantly from me.

My normal usage from the command line is to load a Match play match, usually between 5 and 15pts run an analysis and save the sgf file along with html. This is done in batch or individually. Having cache on makes considerable difference. If I can get your matches and run them in your experiment I’d like to see what I get.

My system is using Debian Lenny Stable 32Bit (Differs from your 64 bit) with 10GB ram. Its an Intel server board (1333 FSB), DDR2-5300 memory with two Xeon E5405 with 12MB of L2 Cache each. For my tests I turned swap space off, and run Gnubg with Priority -19 (Critical responsiveness).

Overall my tests indicate that a cache with 2^20 cache entries (524288entries, and 21mb on the GUI slider) is where things level off with the type of usage I have. Same results generally up to 2^25 cache entries (above 2^25 it doesn’t work on 32 bit, which I think is expected given the size allowed for a given process). My guess is the shorter your matches, the fewer positions that are cached, and thus the reduced cache performance. The speed increase between 0 and 2^20 steadily increases in my test (815 move match which consider typical for me).

I think for most users who do rollouts and match play analysis on longer matches is significantly improved with the cache on (If it works) and I generally suggest 21mb (bar in the middle). I expect anyone who uses plies >= 3 (and a newer release of code with cache fixes) will also gain from caching.

I’ll probably install 64 bit lenny this weekend. I am curious if there is a change in performance that way as well.

On 06/08/09 8:23 PM, "Ingo Macherius" <address@hidden> wrote:

Jon,

find attached the cleaned up benchmark data for both the 2xXeons 5130 and 2xNocona machines.

I've also done new research which now includes the impact of cache size, single threaded vs. multithreaded binary, and number of threads. The main result graph is attached, the data is in the same spreadsheed as the two other benchmarks (format OpenOffice 3.1) in 3 worksheet tabs.

The basis of the experiment were the same 5 different seven point FIBS matches used for the previous benchmarks. There were two binaries compiled, one with multihreading (GNUBGMT) and one without (GNUBGST). Both were compiled with gcc 4.3.2.1 on Debian 5.0.2, heavily optimized for core2 CPUs. SSE and SSE2 are used, code basis is gnubg.org CVS as per 2. August 2009. The hardware is a Supermicro 2xXeon 5130 machine with 6GB DDR2-5300 memory. The machine was completely idle during testing.

The 5 matches were analyzed 4 times each, resulting in a total 20 match evalutaions at 2ply/no pruning/cubeful. All caches were cleaned before each analysis. Cache size was varied from 2^1 to 2^27 bytes, resulting in 27 runs for each Graph.

* Graphs "Threads=1,2,3,4,5" are done with MT binary and the respektive settings for cache and threads, 20 matches
* Graph "No Threading" was done with GNUBGST, 20 matches
* Graph "4xNo threading á 1/4 work" was done by running 4 instances of GNUBGST with 5 matches to analyze each in parralel
* Graph "4xThreads=1 á 1/4 work" was done by running 4 instances of GNUBGMT set to use one thread, with 5 matches to analyze each in parralel

Additional remarks:
- The "spontaneois speedup" spikes seen especially for Threads=2 are oddd, i did several runs and they didn't disappear but showed in different frequency and cache size positions. I consider them bugs in the Unix time command.
- Data for Threads=6,7,8 was also collected but is not plotted, because as expected performance decreased with growing number of threads. Graph for Threads=5 shows that sufficiently, no need to clutter the diagram with more.
- The "4xThreads..." and "4x No Threading" runs aborted with out of memory for cachesize=2^26 and 2^27 (no suprise), thus no data for them.

I very much liked to hear some comments by you Jonathan (the author of the threading code). Happy with what you see? Well, I think you did a good job :)

Ingo

-----Original Message-----
From: Jonathan Kinsey [mailto:address@hidden]
Sent: Tuesday, August 04, 2009 9:41 AM
To: address@hidden
Cc: address@hidden; address@hidden
Subject: Re: [Bug-gnubg] Benchmarks on server class machines and resulting change requests

It's not clear if you were using the hyper-threaded machine as this might
explain the jump form 1 to 2 cores and the smaller jump to 3 and 4 cores.

If you were using machine "B", try running the test again for 1,2,3,4 threads on
machine "A". Make sure the cache size is set to maximum.

Jon

Ingo Macherius wrote:
> Christian, I've conducted your suggested experiment (batch eval of saved matches) and can confirm your answer. Calibrate ist not a suitable metric to evaluate threading behaviour for gnubg.
>
> The batch experiment did analyze five 7pt matches for 4 times each, with full cache cleaning. The time was taken with unix "time" command. The results are much more like what one would expect:
> - Speed peaked wheen the number of threads equaled the number of cores
> - Adding more threads than cores slowed down the evaluation (albeit, by only a tiny nit)
> - Speed decrease increased in the number of threads
>
> The odd finding is that there still are some anonalies, which are:
> - Going from 1 to 2 threads more than doubles the evaluation
> - It has very little effect adding more threads, i.e. the gain is not linear in # cores
> - 2, 3 and 4 threads result in speeds very close to each other, much closer than expected
>
> I've attached a ZIP which contains the original OpenOffice 3.1 spreadsheet and a PDF version of the graphs with the experiment details.
>
> Thx a lot for your guidance!
>
> Ingo
>
>> -----Original Message-----
>> From: Christian Anthon [mailto:address@hidden]
>> Sent: Monday, August 03, 2009 12:29 PM
>> To: Ingo Macherius
>> Cc: address@hidden
>> Subject: Re: [Bug-gnubg] Benchmarks on server class machines
>> and resulting change requests
>>
>>
>> The calibrate function sucks bit time. The threaded calibrate
>> function sucks even more. I'm tempted to call it useless. I
>> believe that you are observing the following: There is some
>> overhead involved in displaying and updating the calibration,
>> and as you are increasing the number of threads more and more
>> time is allocated to evaluation and less and less to
>> overhead. If you really want to test the speed of the
>> threading then you should analyse a match or perform a rollout.
>>
>> The original calibration was meant to calibrate certain
>> timing functions against the speed of your computer, so
>> overhead didn't really matter. That is the function measures
>> the speed of your computer, not the speed of gnubg.
>>
>> Christian.
>>
>> On Sun, Aug 2, 2009 at 5:06 PM, Ingo
>> Macherius wrote:
>>> I have benchmarked gnubg on two server machines, with
>> particular focus
>>> on multithreading. Both Machines are headless and run Debian 5.x
>>> Lenny, Kernel 2.6.26-2-amd64 #1 SMP x86_64 GNU/Linux. The
>> hardware is:
>>> box_A: 2xXeon 5130 @ 2GHz (4 physical cores in 2 chips)
>>> box_B: 2xXeon Nocona @ 3GHz (2 physical cores plus 2 HT
>> "cores" in 2
>>> chips)
>>>
>>> I found two issues with current gnubg (latest CVS version
>> as of August
>>> 1st 2009, compiled with gcc 4.3.2.1 with -march=native and sse2
>>> support):
>>>
>>> 1) The "calibrate" command output is off by a factor of 1000, i.e.
>>> reports eval/s values 1000 times too high. This holds for
>> the figure
>>> reported in the official Debian binary installed via apt-get.
>>>
>>> 2) The limit of 16 threads is too low, I found that to
>> utilize the CPU
>>> power to 100% 8 threads per core are needed. Interestingly
>> this holds
>>> for the virtual HT cores as well.
>>>
>>> @1: Please check the timer code, the problem seems to be in
>> timer.c.
>>> Obviously the #ifdef part for Windows is fine, but all
>> other machines use a faulty version of the timer. I can't
>> really suggest a solution, but here is some background info
>> from wikipedia: http://en.wikipedia.org/wiki/Rdtsc
>>> I would help to fix this one by testing on the
>> beforementioned machines under 64 bit Linux.
>>> @2: I've tested with a custom gnubg binary with the bug at @1 fixed
>>> the hard way by dividing by 1000 hardcodedly and thread
>> limit raised
>>> to 256. While calibrate was running I've monitored CPU utilization
>>> usiing the mpstat command.
>>>
>>> box_A peaks at about 202K eval/s with 8 threads per core
>> (32 total),
>>> from where on the number is static until it starts decreasing again
>>> when you use hundreds of threads. between 1 and 3 threads I see the
>>> expected gain of almost 100% per thread added. Using 4 threads is
>>> lowering the throughput as compared to 3 threads. Between 5 and 32
>>> threads I see rising throughput which first is linear, and becomes
>>> asymptotic as we get closer to 32 threads. Below 32 threads, mpstat
>>> reports significant idle times for each CPU, at 32 I see
>> each is using
>>> 100% of the cycles.
>>>
>>> A very similar behavior is visible on box_B, despite the
>> fact 2 of its
>>> "cores" are virtual HT cores.
>>>
>>> Extrapolating the results suggests gnubg should increase
>> the limit for
>>> the number of max. threads to 64, maybe even 128 or 256. Rationale:
>>> recent server hardware with dual quadcores has 8 cores,
>> which should
>>> be fully utilizeable only with 64 threads. The suggested 128
>>> anticipates future improvements. As there seems to be little to no
>>> cost with higher values for max. threads, this seems like a
>> cheap way
>>> to speed up gnubg on server class machines and quad cores
>> at little to
>>> no cost.
>>>
>>> Cheers,
>>> Ingo
>>>
>>>
>>>
>>> _______________________________________________
>>> Bug-gnubg mailing list
>>> address@hidden http://lists.gnu.org/mailman/listinfo/bug-gnubg
>>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Bug-gnubg mailing list
>> address@hidden
>> http://lists.gnu.org/mailman/listinfo/bug-gnubg

Celebrate a decade of Messenger with free winks, emoticons, display pics, and more. Get Them Now <http://clk.atdmt.com/UKM/go/157562755/direct/01/>

_______________________________________________
Bug-gnubg mailing list
address@hidden
http://lists.gnu.org/mailman/listinfo/bug-gnubg

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Bug-gnubg] Benchmarks on server class machines and resulting change requests, Michael Petch <=
- Re: [Bug-gnubg] Benchmarks on server class machines and resulting change requests, Michael Petch, 2009/09/10
- RE: [Bug-gnubg] Benchmarks on server class machines and resulting change requests, Ingo Macherius, 2009/09/11
  - Re: [Bug-gnubg] Benchmarks on server class machines and resulting change requests, Michael Petch, 2009/09/11
  - RE: [Bug-gnubg] Benchmarks on server class machines and resultingchange requests, Ingo Macherius, 2009/09/11
    - Re: [Bug-gnubg] Benchmarks on server class machines and resultingchange requests, Michael Petch, 2009/09/11
    - RE: [Bug-gnubg] Benchmarks on server class machines and resultingchange requests, Ingo Macherius, 2009/09/11
    - Re: [Bug-gnubg] Benchmarks on server class machines and resultingchange requests, Michael Petch, 2009/09/11
    - Re: [Bug-gnubg] Benchmarks on server class machines and resultingchange requests, Michael Petch, 2009/09/11

Prev by Date: Re: [Bug-gnubg] How many threads can gnubg (reliably) handle?
Next by Date: Re: [Bug-gnubg] Benchmarks on server class machines and resulting change requests
Previous by thread: [Bug-gnubg] Cubeless numbers in cubeful rollouts
Next by thread: Re: [Bug-gnubg] Benchmarks on server class machines and resulting change requests
Index(es):
- Date
- Thread