help-cfengine
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Load problem with cfservd


From: Baker, Darryl
Subject: Load problem with cfservd
Date: Mon, 14 Mar 2005 16:08:01 -0500

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

My master machine is Solaris 9 and all systems are running Solaris 8
or 9 and cfengine 2.1.13.

The problem we have with cfservd manifests itself as a periodic clog
that takes about a minute to resolve. This period is characterized by
the following symptoms:

1. Load average spike from ~3 (on a 4-processor system) to the 6-8
range. Occasionally the spike breaks into double digits. 
2. Increase in concurrent  port 5308 (cfengine) sessions from a base
level of 0-4 to peaks in the 12-30 range, with the number of LWP's in
the cfservd processes tracking the number of connections linearly.
(Client systems are set to connect twice an hour with a 25-minute
'splay time.)
3. Running lockstat shows severe contention for a single adaptive
mutex:

root@sysadm05:proc# lockstat sleep 5

Adaptive mutex spin: 157416 events in 5.040 seconds (31233
events/sec)
Count indv cuml rcnt     spin Lock                   Caller          
       
- ----------------------------------------------------------------------
- ---------
136805  87%  87% 1.00       75 0x152ec90             
sfmmu_mlist_enter+0x84        
[...] 
Adaptive mutex block: 648 events in 5.040 seconds (129 events/sec)
Count indv cuml rcnt     nsec Lock                   Caller          
       
- ----------------------------------------------------------------------
- ---------
  547  84%  84% 1.00   391652 0x152ec90             
sfmmu_mlist_enter+0x84  

Both of those types of lock run about 2 orders of magnitude lower in
total, with the specific lock running as much as 3 orders of
magnitude lower, (i.e. ~100 spins and no blocks)  when the system is
in its 'calm' state. 

4. The cfservd process becomes by far the top cpu user, eating 10-25%
of total cpu on a 4-processor system. 
5. The system retains some idle time (5-30%) but the time used by the
kernel jumps to the 40-70% range. 

The history of troubleshooting this leads me to believe that the
heavy ssh usage on this host is a significant compounding factor,
i.e. that we are hitting some common bottleneck when we have cfservd
accepting connections and are spawning batches of 30-100 outbound ssh
connections at once. Reducing the herds of outbound ssh's has reduced
the frequency and severity of these clog periods, but every time we
change much of anything on the system, we end up getting back to a
state where these clogs become common. 



_____________________________________________________________________
Darryl Baker
gedas USA, Inc.
Operational Services Business Unit
3800 Hamlin Road
Auburn Hills, MI 48326
US
phone   +1-248-754-5341
fax     +1-248-754-6399
Darryl.Baker@gedas.com
http://www.gedasusa.com
_____________________________________________________________________



-----BEGIN PGP SIGNATURE-----
Version: PGP Personal Security 7.0.3

iQA/AwUBQjX9Mle1Bhkj9lZeEQLTgQCeNHbP4+Zf+P2luqNx/QRNpLeOYF8AnRvL
BXCjcj0Rs4JDtgcQzjKv016V
=IHlF
-----END PGP SIGNATURE-----


Attachment: PGPexch.rtf.asc
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]