savannah-hackers-public
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Savannah-hackers-public] [gnu.org #649274] colonialone ailing


From: Peter Olson via RT
Subject: [Savannah-hackers-public] [gnu.org #649274] colonialone ailing
Date: Mon, 20 Dec 2010 13:21:12 -0500

This is timeline for savannah disk problem.

2010-12-18T04:25 /dev/sdd was added to /dev/md0,1,2,3

The first three were small, resynched very quickly.

2010-12-18T04:31 /dev/md3 starts to resynch /dev/sdd6 (955 million blocks).

Everything is fine for a while.

Dec 18 06:52:45 colonialone mdadm[5026]: Rebuild20 event detected on md device 
/dev/md3

This means resynch is 20% complete.

I'm not sure what this is all about:

Dec 18 07:00:30 colonialone kernel: [1106742.561695] EXT3-fs: INFO: recovery 
required on readonly 
filesystem.
Dec 18 07:00:30 colonialone kernel: [1106742.568646] EXT3-fs: write access will 
be enabled during 
recovery.
Dec 18 07:00:53 colonialone kernel: [1106765.725031] kjournald starting.  
Commit interval 5 seconds
Dec 18 07:00:53 colonialone kernel: [1106765.732416] EXT3-fs: recovery complete.
Dec 18 07:00:53 colonialone kernel: [1106765.897655] EXT3-fs: mounted 
filesystem with ordered data 
mode.

I had erroneously thought that the monolith backup starts at 07:20, but it 
starts at 20:07 and runs to 
typically midnight or one AM.  Sometimes it runs longer.  For example it ran 
until 1 in the afternoon 
Monday, Dec 13.  Perhaps this is caused by log rotation or some other massive 
change in the file 
system.  We are considering standardizing on datestamped logs for our other 
machines, so the log 
filename doesn't change at every rotation.  This morning the backup ran until 8 
AM:

rsync of savannah-backup.gnu.org started at Sun Dec 12 20:07:01 EST 2010
rsync of savannah-backup.gnu.org finished at Mon Dec 13 13:00:55 EST 2010
rsync of savannah-backup.gnu.org started at Mon Dec 13 20:07:01 EST 2010
rsync of savannah-backup.gnu.org finished at Mon Dec 13 22:07:03 EST 2010
rsync of savannah-backup.gnu.org started at Tue Dec 14 20:07:01 EST 2010
rsync of savannah-backup.gnu.org finished at Wed Dec 15 00:36:57 EST 2010
rsync of savannah-backup.gnu.org started at Wed Dec 15 20:07:01 EST 2010
rsync of savannah-backup.gnu.org finished at Thu Dec 16 00:59:14 EST 2010
rsync of savannah-backup.gnu.org started at Thu Dec 16 20:07:01 EST 2010
rsync of savannah-backup.gnu.org finished at Fri Dec 17 00:36:40 EST 2010
rsync of savannah-backup.gnu.org started at Fri Dec 17 20:07:01 EST 2010
rsync of savannah-backup.gnu.org finished at Sat Dec 18 00:04:56 EST 2010
rsync of savannah-backup.gnu.org started at Sat Dec 18 20:07:01 EST 2010
rsync of savannah-backup.gnu.org finished at Sun Dec 19 00:01:14 EST 2010
rsync of savannah-backup.gnu.org started at Sun Dec 19 20:07:01 EST 2010
rsync of savannah-backup.gnu.org finished at Mon Dec 20 08:17:39 EST 2010

On Saturday the hung task notifications start around 07:20 AM, but my theory 
why this happened is 
faulty (it's not because of load from monolith).

Dec 18 07:21:50 colonialone kernel: [1108026.400926] INFO: task 
blkback.5.sda2:5001 blocked for more 
than 120 seconds.
Dec 18 07:21:51 colonialone kernel: [1108026.406840] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
Dec 18 07:21:51 colonialone kernel: [1108026.418761] blkback.5.sda D 
ffffffff8044af00     0  5001      
2
Dec 18 07:21:51 colonialone kernel: [1108026.424213]  ffff8802bbc439c0 
0000000000000246 
0000000000000000 ffff8805df47a740
Dec 18 07:21:51 colonialone kernel: [1108026.431214]  ffff88053cfad440 
ffffffff804ff460 
ffff88053cfad6c0 0000000080266ba3
Dec 18 07:21:51 colonialone kernel: [1108026.441373]  00000000ffffffff 
0000000000000000 
0000000000000010 ffff8805e25d4c40
Dec 18 07:21:51 colonialone kernel: [1108026.447733] Call Trace:
Dec 18 07:21:51 colonialone kernel: [1108026.450819]  [<ffffffffa00ef395>] 
:raid1:wait_barrier+0xb8/0x12c
Dec 18 07:21:51 colonialone kernel: [1108026.457985]  [<ffffffff80224c8a>] 
default_wake_function+0x0/0xe
Dec 18 07:21:51 colonialone kernel: [1108026.464714]  [<ffffffffa00f83f8>] 
:dm_mod:__split_bio+0x363/0x374
Dec 18 07:21:51 colonialone kernel: [1108026.471151]  [<ffffffffa00f0ff9>] 
:raid1:make_request+0x73/0x5b6
Dec 18 07:21:51 colonialone kernel: [1108026.477958]  [<ffffffff802fdcc1>] 
elv_merged_request+0x30/0x39
Dec 18 07:21:51 colonialone kernel: [1108026.484758]  [<ffffffff80311b13>] 
__up_read+0x13/0x8a
Dec 18 07:21:51 colonialone kernel: [1108026.490338]  [<ffffffff802ffb1e>] 
generic_make_request+0x2fe/0x339
Dec 18 07:21:51 colonialone kernel: [1108026.497154]  [<ffffffff80266ba3>] 
mempool_alloc+0x24/0xda
Dec 18 07:21:51 colonialone kernel: [1108026.503219]  [<ffffffff80300df8>] 
submit_bio+0xdb/0xe2
Dec 18 07:21:51 colonialone kernel: [1108026.509646]  [<ffffffff80387254>] 
dispatch_rw_block_io+0x5b9/0x65d
Dec 18 07:21:51 colonialone kernel: [1108026.515434]  [<ffffffff802ffbb3>] 
blk_unplug+0x5a/0x60
Dec 18 07:21:51 colonialone kernel: [1108026.522210]  [<ffffffffa00f9b19>] 
:dm_mod:dm_table_unplug_all+0x2a/0x3d
Dec 18 07:21:51 colonialone kernel: [1108026.529606]  [<ffffffff80387747>] 
blkif_schedule+0x3c7/0x4dc
Dec 18 07:21:51 colonialone kernel: [1108026.535638]  [<ffffffff80387380>] 
blkif_schedule+0x0/0x4dc
Dec 18 07:21:51 colonialone kernel: [1108026.541776]  [<ffffffff8023f57f>] 
kthread+0x47/0x74
Dec 18 07:21:51 colonialone kernel: [1108026.547366]  [<ffffffff802282ec>] 
schedule_tail+0x27/0x5c
Dec 18 07:21:51 colonialone kernel: [1108026.553793]  [<ffffffff8020be28>] 
child_rip+0xa/0x12
Dec 18 07:21:51 colonialone kernel: [1108026.559343]  [<ffffffff8023f538>] 
kthread+0x0/0x74
Dec 18 07:21:51 colonialone kernel: [1108026.564794]  [<ffffffff8020be1e>] 
child_rip+0x0/0x12

Note that this may not be a fatal problem, it may just be that the machine is 
very overloaded.

Practically speaking it is a show stopper, because the resynchronization 
process makes very little 
progress once this happens.

Dec 18 07:21:51 colonialone kernel: [1108026.572483] INFO: task md3_resync:713 
blocked for more than 
120 seconds.
Dec 18 07:21:51 colonialone kernel: [1108026.580017] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
Dec 18 07:21:51 colonialone kernel: [1108026.588702] md3_resync    D 
ffffffff8044af00     0   713      
2
Dec 18 07:21:51 colonialone kernel: [1108026.595326]  ffff8802bbfffce0 
0000000000000246 
0000000000000000 ffff8805df963ee0
Dec 18 07:21:51 colonialone kernel: [1108026.603863]  ffff8805e25e7040 
ffff8805e245e040 
ffff8805e25e72c0 0000000200000000
Dec 18 07:21:51 colonialone kernel: [1108026.612292]  00000000ffffffff 
ffffffff8023f6b6 
ffff8805dfa705c0 ffffffff80221555

For the DomUs it causes functional problems, such as sshd not being able to 
complete the protocol 
negotiation for a successful login to a VM.

This may have failed on the Dom0 as well, since I wound up using servent to 
contact colonialone 
(perhaps because I could not ssh to it).

At 9 AM Bernie rebooted colonialone:

Dec 18 09:03:54 colonialone kernel: [1114157.641661] md: cannot remove active 
disk sdd6 from md3 ...
Dec 18 09:04:00 colonialone kernel: [1114163.556083] raid1: Disk failure on 
sdd6, disabling device.
Dec 18 09:04:00 colonialone kernel: [1114163.556086] raid1: Operation 
continuing on 2 devices.
Dec 18 09:04:00 colonialone mdadm[5026]: FailSpare event detected on md device 
/dev/md3, component 
device /dev/sdd6
Dec 18 09:04:08 colonialone shutdown[8193]: shutting down for system reboot
Dec 18 09:04:08 colonialone init: Switching to runlevel: 6
Dec 18 09:04:08 colonialone rpc.statd[2395]: Caught signal 15, un-registering 
and exiting.
Dec 18 09:04:08 colonialone snmpd[2839]: Received TERM or STOP signal...  
shutting down...


Dec 18 09:10:10 colonialone kernel: [1114537.764182]  [<ffffffff8020b528>] 
system_call+0x68/0x6d
Dec 18 09:10:10 colonialone kernel: [1114537.770695]  [<ffffffff8020b4c0>] 
system_call+0x0/0x6d
Dec 18 09:10:10 colonialone kernel: [1114537.776042] 

Dec 18 09:12:00 colonialone kernel: imklog 3.18.6, log source = /proc/kmsg 
started.
Dec 18 09:12:00 colonialone rsyslogd: [origin software="rsyslogd" 
swVersion="3.18.6" x-pid="2509" x-
info="http://www.rsyslog.com";] restart
Dec 18 09:12:00 colonialone kernel: [    0.000000] Initializing cgroup subsys 
cpuset
Dec 18 09:12:00 colonialone kernel: [    0.000000] Initializing cgroup subsys 
cpu
Dec 18 09:12:00 colonialone kernel: [    0.000000] Linux version 
2.6.26-2-xen-amd64 (Debian 2.6.26-
26lenny1) (address@hidden) (gcc version 4.1.3 20080704 (prerelease) (Debian 
4.1.2-25)) #1 SMP Thu Nov 
25 06:39:26 UTC 2010
Dec 18 09:12:00 colonialone kernel: [    0.000000] Command line: root=/dev/md2 
ro 
console=ttyS0,115200n8 console=tty0 acpi=off

Here's where /dev/sda6 and /dev/sdd6 are rejected:

Dec 18 09:12:00 colonialone kernel: [   10.978740] md: md3 stopped.
Dec 18 09:12:00 colonialone kernel: [   11.050788] md: bind<sda6>
Dec 18 09:12:00 colonialone kernel: [   11.053846] md: bind<sdb6>
Dec 18 09:12:00 colonialone kernel: [   11.057885] md: bind<sdd6>
Dec 18 09:12:00 colonialone kernel: [   11.061587] md: bind<sdc6>
Dec 18 09:12:00 colonialone kernel: [   11.064449] md: kicking non-fresh sdd6 
from array!
Dec 18 09:12:00 colonialone kernel: [   11.069346] md: unbind<sdd6>
Dec 18 09:12:00 colonialone kernel: [   11.072410] md: export_rdev(sdd6)
Dec 18 09:12:00 colonialone kernel: [   11.075384] md: kicking non-fresh sda6 
from array!
Dec 18 09:12:00 colonialone kernel: [   11.080268] md: unbind<sda6>
Dec 18 09:12:00 colonialone kernel: [   11.083244] md: export_rdev(sda6)
Dec 18 09:12:00 colonialone kernel: [   11.086655] md: md3: raid array is not 
clean -- starting 
background reconstruction
Dec 18 09:12:00 colonialone kernel: [   11.106510] raid1: raid set md3 active 
with 2 out of 3 mirrors

Here is some information from SMART (Bernie started smartd Dec 18 19:46, so we 
have no data prior to 
that to compare):

 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 111 to 112  sdb
 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 111 to 112  sdc
 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 112 to 114  sdc
 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 112 to 115  sdb
 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 114 to 115  sdc
 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 115 to 116  sdb
 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 115 to 116  sdc
 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 117  sdb
 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 117  sdc
 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 117 to 118  sdb
 SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 119 to 111  sdc
 SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 70 to 71 sdd
 SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 71 to 70 sdd
 SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 71 to 72 sdc
 SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 71 sdc
 SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 73 sda
 SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 73 to 72 sda
 SMART Usage Attribute: 194 Temperature_Celsius changed from 27 to 28 sda
 SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 27 sda
 SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 29 sdc
 SMART Usage Attribute: 194 Temperature_Celsius changed from 29 to 28 sdc
 SMART Usage Attribute: 194 Temperature_Celsius changed from 29 to 30 sdd
 SMART Usage Attribute: 194 Temperature_Celsius changed from 30 to 29 sdd
 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 41 to 48  sdc
 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 43 to 44  sdc
 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 44 to 43  sdc
 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 45 to 44  sdc
 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 48 to 45  sdc
 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 48 to 49  sdb
 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 49 to 50  sdb
 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 50 to 49  sdb
 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 50 to 51  sdb
 SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 51 to 50  sdb

Saturday evening we power cycled at 18:28:

Dec 18 18:28:05 colonialone kernel: [26770.798179]  [<ffffffff8023f538>] 
kthread+0x0/0x74
Dec 18 18:28:05 colonialone kernel: [26770.803143]  [<ffffffff8020be1e>] 
child_rip+0x0/0x12
Dec 18 18:28:05 colonialone kernel: [26770.811011] 
Dec 18 18:38:00 colonialone kernel: imklog 3.18.6, log source = /proc/kmsg 
started.
Dec 18 18:38:00 colonialone rsyslogd: [origin software="rsyslogd" 
swVersion="3.18.6" x-pid="2562" x-
info="http://www.rsyslog.com";] restart
Dec 18 18:38:00 colonialone kernel: [    0.000000] Initializing cgroup subsys 
cpuset
Dec 18 18:38:00 colonialone kernel: [    0.000000] Initializing cgroup subsys 
cpu

Sunday morning, the resynchronization reached 81% before things went sour.

Dec 19 07:16:36 colonialone smartd[7514]: Device: /dev/sdb, SMART Prefailure 
Attribute: 1 
Raw_Read_Error_Rate changed from 102 to 113
Dec 19 07:16:36 colonialone smartd[7514]: Device: /dev/sdb, SMART Usage 
Attribute: 195 
Hardware_ECC_Recovered changed from 51 to 52
Dec 19 07:16:36 colonialone smartd[7514]: Device: /dev/sdc, SMART Prefailure 
Attribute: 1 
Raw_Read_Error_Rate changed from 117 to 120
Dec 19 07:16:36 colonialone smartd[7514]: Device: /dev/sdc, SMART Usage 
Attribute: 195 
Hardware_ECC_Recovered changed from 47 to 48

Dec 19 07:17:01 colonialone /USR/SBIN/CRON[3261]: (root) CMD (   cd / && 
run-parts --report 
/etc/cron.hourly)
Dec 19 07:17:19 colonialone kernel: [45888.504962] INFO: task md3_resync:3307 
blocked for more than 120 
seconds.
Dec 19 07:17:19 colonialone kernel: [45888.510214] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
Dec 19 07:17:19 colonialone kernel: [45888.519215] md3_resync    D 
ffff8805bef5be9c     0  3307      2
Dec 19 07:17:19 colonialone kernel: [45888.527748]  ffff8805bef5bce0 
0000000000000246 0000000000000001 
ffff8805df67dee0
Dec 19 07:17:19 colonialone kernel: [45888.535827]  ffff8805e25a2480 
ffff8805e2704e80 ffff8805e25a2700 
0000000300000000
Dec 19 07:17:19 colonialone kernel: [45888.543885]  ffff8805bef5bca0 
ffffffff8023f6b6 ffff8805df50a140 
ffffffff80221555

(It is this, starting at 07:17 which made me realize my theory about the 
monolith backups was wrong.)

Perhaps something else happens shortly after 7 AM each day?  But what about the 
timing of the other 
failures?

For planning to resynchronize two more drives into the arrays, here is the 
timing of things:

Dec 18 06:52:45 colonialone mdadm[5026]: Rebuild20 event detected on md device 
/dev/md3
hung tasks at 07:21
rebooted at 9 AM due to overload
hung tasks at 09:41
Bernie takes down the VMs at 10:21
rebooted at 11 AM

Dec 18 13:34:39 colonialone mdadm[4980]: Rebuild20 event detected on md device 
/dev/md3
Dec 18 16:15:38 colonialone mdadm[4980]: Rebuild40 event detected on md device 
/dev/md3
hung tasks at 18:13
rebooted at 18:38

Dec 18 23:29:03 colonialone mdadm[4987]: Rebuild20 event detected on md device 
/dev/md3
Dec 19 01:59:03 colonialone mdadm[4987]: Rebuild40 event detected on md device 
/dev/md3
Dec 19 04:08:03 colonialone mdadm[4987]: Rebuild60 event detected on md device 
/dev/md3
Dec 19 06:48:04 colonialone mdadm[4987]: Rebuild80 event detected on md device 
/dev/md3
hung tasks at 07:17
rebooted at 07:30

Dec 19 07:36:26 colonialone mdadm[5098]: RebuildStarted event detected on md 
device /dev/md3
Dec 19 08:14:26 colonialone mdadm[5098]: Rebuild20 event detected on md device 
/dev/md3
Dec 19 08:55:26 colonialone mdadm[5098]: Rebuild40 event detected on md device 
/dev/md3
Dec 19 09:38:26 colonialone mdadm[5098]: Rebuild60 event detected on md device 
/dev/md3
Dec 19 10:27:26 colonialone mdadm[5098]: Rebuild80 event detected on md device 
/dev/md3
Dec 19 11:19:24 colonialone mdadm[5098]: RebuildFinished event detected on md 
device /dev/md3
rebooted at 11:22 as a countermeasure in case some corruption had accumulated 
(no theory about
this) and also to prove that the RAID would be synchronized after the boot
no more hung tasks since

So it appears that if we accept 4 hours of downtime (no VMs running) we can add 
one disk to the RAID.  
Adding two disks would take longer, but probably not a lot longer.  Trying to 
add disks without an 
explanation of why the overloads take place seems to be risky.

Peter Olson
FSF Senior Systems Administrator





reply via email to

[Prev in Thread] Current Thread [Next in Thread]