bug-gnubg
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-gnubg] Benchmarks on server class machines and resulting change


From: Michael Petch
Subject: Re: [Bug-gnubg] Benchmarks on server class machines and resulting change requests
Date: Thu, 10 Sep 2009 19:36:11 -0600
User-agent: Microsoft-Entourage/12.20.0.090605


Ingo, regarding your original cache test, would it be possible to send me your matches that you analyzed for this test. These 5 matches must have been pretty short. I am wondering if the way you use Gnugb from the command line with the types of matches you are running result in cache performance that differs significantly from me.

My normal usage from the command line  is to load a Match play match, usually between 5 and 15pts run an analysis and save the sgf file along with html. This is done in batch or individually. Having cache on makes considerable difference. If I can get your matches and run them in your experiment I’d like to see what I get.

My system is using Debian Lenny Stable 32Bit (Differs from your 64 bit)  with 10GB ram. Its an Intel server board (1333 FSB), DDR2-5300 memory with two Xeon E5405 with 12MB of L2 Cache each. For my tests I turned swap space off, and run Gnubg with Priority -19 (Critical responsiveness).

Overall my tests indicate that a cache with 2^20 cache entries (524288entries, and 21mb on the GUI slider) is where things level off with the type of usage I have. Same results generally up to 2^25 cache entries (above 2^25 it doesn’t work on 32 bit, which I think is expected given the size allowed for a given process). My guess is the shorter your matches, the fewer positions that are cached, and thus the reduced cache performance. The speed increase between 0 and 2^20 steadily increases in my test (815 move match which  consider typical for me).

I think for most users who do rollouts and match play analysis on longer matches is significantly improved with the cache on (If it works) and I generally suggest 21mb (bar in the middle). I expect anyone who uses plies >= 3 (and a newer release of code with cache fixes) will also gain from caching.

I’ll probably install 64 bit lenny this weekend. I am curious if there is  a change in performance that way as well.

On 06/08/09 8:23 PM, "Ingo Macherius" <address@hidden> wrote:

Jon,

find attached the cleaned up benchmark data for both the 2xXeons 5130 and 2xNocona machines.

I've also done new research which now includes the impact of cache size, single threaded vs. multithreaded binary, and number of threads. The main result graph is attached, the data is in the same spreadsheed as the two other benchmarks (format OpenOffice 3.1) in 3 worksheet tabs.

The basis of the experiment were the same 5 different seven point FIBS matches used for the previous benchmarks. There were two binaries compiled, one with multihreading (GNUBGMT) and one without (GNUBGST). Both were compiled with gcc 4.3.2.1 on Debian 5.0.2, heavily optimized for core2 CPUs. SSE and SSE2 are used, code basis is gnubg.org CVS as per 2. August 2009. The hardware is a Supermicro 2xXeon 5130 machine with 6GB DDR2-5300 memory. The machine was completely idle during testing.

The 5 matches were analyzed 4 times each, resulting in a total 20 match evalutaions at 2ply/no pruning/cubeful. All caches were cleaned before each analysis. Cache size was varied from 2^1 to 2^27 bytes, resulting in 27 runs for each Graph.

* Graphs "Threads=1,2,3,4,5" are done with MT binary and the respektive settings for cache and threads, 20 matches
* Graph "No Threading" was done with GNUBGST, 20 matches
* Graph "4xNo threading á 1/4 work" was done by running 4 instances of GNUBGST with 5 matches to analyze each in parralel
* Graph "4xThreads=1 á 1/4 work" was done by running 4 instances of GNUBGMT set to use one thread, with 5 matches to analyze each in parralel

Additional remarks:
- The "spontaneois speedup" spikes seen especially for Threads=2 are oddd, i did several runs and they didn't disappear but showed in different frequency and cache size positions. I consider them bugs in the Unix time command.
- Data for Threads=6,7,8 was also collected but is not plotted, because as expected performance decreased with growing number of threads. Graph for Threads=5 shows that sufficiently, no need to clutter the diagram with more.
- The "4xThreads..." and "4x No Threading" runs aborted with out of memory for cachesize=2^26 and 2^27 (no suprise), thus no data for them.

I very much liked to hear some comments by you Jonathan (the author of the threading code). Happy with what you see? Well, I think you did a good job :)

Ingo

 
-----Original Message-----
From:
Jonathan Kinsey  [mailto:address@hidden]
Sent: Tuesday, August 04, 2009 9:41  AM
To: address@hidden
Cc: address@hidden;  address@hidden
Subject: Re: [Bug-gnubg] Benchmarks on server  class machines and resulting change requests

It's not  clear if you were using the hyper-threaded machine as this might
explain  the jump form 1 to 2 cores and the smaller jump to 3 and 4 cores.

If  you were using machine "B", try running the test again for 1,2,3,4 threads  on
machine "A". Make sure the cache size is set to  maximum.

Jon

Ingo Macherius wrote:
> Christian, I've  conducted your suggested experiment (batch eval of saved matches) and can  confirm your answer. Calibrate ist not a suitable metric to evaluate threading  behaviour for gnubg.
>
> The batch experiment did analyze five  7pt matches for 4 times each, with full cache cleaning. The time was taken  with unix "time" command. The results are much more like what one would  expect:
> - Speed peaked wheen the number of threads equaled the number  of cores
> - Adding more threads than cores slowed down the evaluation  (albeit, by only a tiny nit)
> - Speed decrease increased in the number  of threads
>
> The odd finding is that there still are some  anonalies, which are:
> - Going from 1 to 2 threads more than doubles  the evaluation
> - It has very little effect adding more threads, i.e.  the gain is not linear in # cores
> - 2, 3 and 4 threads result in  speeds very close to each other, much closer than expected
>
>  I've attached a ZIP which contains the original OpenOffice 3.1 spreadsheet and  a PDF version of the graphs with the experiment details.
>
> Thx  a lot for your guidance!
>
> Ingo
>
>>  -----Original Message-----
>> From: Christian Anthon  [mailto:address@hidden]
>> Sent: Monday, August 03, 2009  12:29 PM
>> To: Ingo Macherius
>> Cc:  address@hidden
>> Subject: Re: [Bug-gnubg] Benchmarks on server  class machines
>> and resulting change  requests
>>
>>
>> The calibrate function sucks bit  time. The threaded calibrate
>> function sucks even more. I'm  tempted to call it useless. I
>> believe that you are observing the  following: There is some
>> overhead involved in displaying and  updating the calibration,
>> and as you are increasing the number of  threads more and more
>> time is allocated to evaluation and less  and less to
>> overhead. If you really want to test the speed of the  
>> threading then you should analyse a match or perform a  rollout.
>>
>> The original calibration was meant to  calibrate certain
>> timing functions against the speed of your  computer, so
>> overhead didn't really matter. That is the function  measures
>> the speed of your computer, not the speed of  gnubg.
>>
>> Christian.
>>
>> On Sun, Aug  2, 2009 at 5:06 PM, Ingo
>> Macherius  wrote:
>>> I have benchmarked gnubg on two server machines, with  
>> particular focus
>>> on multithreading. Both  Machines are headless and run Debian 5.x
>>> Lenny, Kernel  2.6.26-2-amd64 #1 SMP x86_64 GNU/Linux. The
>> hardware  is:
>>> box_A: 2xXeon 5130 @ 2GHz (4 physical cores in 2  chips)
>>> box_B: 2xXeon Nocona @ 3GHz (2 physical cores plus 2 HT  
>> "cores" in 2
>>>  chips)
>>>
>>> I found two issues with current gnubg  (latest CVS version
>> as of August
>>> 1st 2009,  compiled with gcc 4.3.2.1 with -march=native and sse2
>>>  support):
>>>
>>> 1) The "calibrate" command output is  off by a factor of 1000, i.e.
>>> reports eval/s values 1000  times too high. This holds for
>> the figure
>>>  reported in the official Debian binary installed via  apt-get.
>>>
>>> 2) The limit of 16 threads is too  low, I found that to
>> utilize the CPU
>>> power to  100% 8 threads per core are needed. Interestingly
>> this holds  
>>> for the virtual HT cores as  well.
>>>
>>> @1: Please check the timer code, the  problem seems to be in
>> timer.c.
>>> Obviously the  #ifdef part for Windows is fine, but all
>> other machines use a  faulty version of the timer. I can't
>> really suggest a solution,  but here is some background info
>> from wikipedia:  http://en.wikipedia.org/wiki/Rdtsc
>>> I would help to fix this  one by testing on the
>> beforementioned machines under 64 bit  Linux.
>>> @2: I've tested with a custom gnubg binary with the bug  at @1 fixed
>>> the hard way by dividing by 1000 hardcodedly and  thread
>> limit raised
>>> to 256. While calibrate was  running I've monitored CPU utilization
>>> usiing the mpstat  command.
>>>
>>> box_A peaks at about 202K eval/s with  8 threads per core
>> (32 total),
>>> from where on the  number is static until it starts decreasing again
>>> when you  use hundreds of threads. between 1 and 3 threads I see the
>>>  expected gain of almost 100% per thread added. Using 4 threads is  
>>> lowering the throughput as compared to 3 threads. Between 5  and 32
>>> threads I see rising throughput which first is linear,  and becomes
>>> asymptotic as we get closer to 32 threads. Below  32 threads, mpstat
>>> reports significant idle times for each  CPU, at 32 I see
>> each is using
>>> 100% of the  cycles.
>>>
>>> A very similar behavior is visible on  box_B, despite the
>> fact 2 of its
>>> "cores" are  virtual HT cores.
>>>
>>> Extrapolating the results  suggests gnubg should increase
>> the limit for
>>> the  number of max. threads to 64, maybe even 128 or 256. Rationale:  
>>> recent server hardware with dual quadcores has 8 cores,  
>> which should
>>> be fully utilizeable only with 64  threads. The suggested 128
>>> anticipates future improvements.  As there seems to be little to no
>>> cost with higher values for  max. threads, this seems like a
>> cheap way
>>> to  speed up gnubg on server class machines and quad cores
>> at little  to
>>> no cost.
>>>
>>>  Cheers,
>>>  Ingo
>>>
>>>
>>>
>>>  _______________________________________________
>>> Bug-gnubg  mailing list
>>> address@hidden  http://lists.gnu.org/mailman/listinfo/bug-gnubg
>>>
>>
>>  ------------------------------------------------------------------------
>>
>>  _______________________________________________
>> Bug-gnubg mailing  list
>> address@hidden
>>  http://lists.gnu.org/mailman/listinfo/bug-gnubg




 

Celebrate a decade of Messenger with free winks, emoticons, display pics, and  more. Get Them Now <http://clk.atdmt.com/UKM/go/157562755/direct/01/>  


_______________________________________________
Bug-gnubg mailing list
address@hidden
http://lists.gnu.org/mailman/listinfo/bug-gnubg

reply via email to

[Prev in Thread] Current Thread [Next in Thread]