[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Savannah-hackers-public] [gnu.org #649274] colonialone ailing
From: |
Peter Olson via RT |
Subject: |
[Savannah-hackers-public] [gnu.org #649274] colonialone ailing |
Date: |
Mon, 20 Dec 2010 13:21:12 -0500 |
This is timeline for savannah disk problem.
2010-12-18T04:25 /dev/sdd was added to /dev/md0,1,2,3
The first three were small, resynched very quickly.
2010-12-18T04:31 /dev/md3 starts to resynch /dev/sdd6 (955 million blocks).
Everything is fine for a while.
Dec 18 06:52:45 colonialone mdadm[5026]: Rebuild20 event detected on md device
/dev/md3
This means resynch is 20% complete.
I'm not sure what this is all about:
Dec 18 07:00:30 colonialone kernel: [1106742.561695] EXT3-fs: INFO: recovery
required on readonly
filesystem.
Dec 18 07:00:30 colonialone kernel: [1106742.568646] EXT3-fs: write access will
be enabled during
recovery.
Dec 18 07:00:53 colonialone kernel: [1106765.725031] kjournald starting.
Commit interval 5 seconds
Dec 18 07:00:53 colonialone kernel: [1106765.732416] EXT3-fs: recovery complete.
Dec 18 07:00:53 colonialone kernel: [1106765.897655] EXT3-fs: mounted
filesystem with ordered data
mode.
I had erroneously thought that the monolith backup starts at 07:20, but it
starts at 20:07 and runs to
typically midnight or one AM. Sometimes it runs longer. For example it ran
until 1 in the afternoon
Monday, Dec 13. Perhaps this is caused by log rotation or some other massive
change in the file
system. We are considering standardizing on datestamped logs for our other
machines, so the log
filename doesn't change at every rotation. This morning the backup ran until 8
AM:
rsync of savannah-backup.gnu.org started at Sun Dec 12 20:07:01 EST 2010
rsync of savannah-backup.gnu.org finished at Mon Dec 13 13:00:55 EST 2010
rsync of savannah-backup.gnu.org started at Mon Dec 13 20:07:01 EST 2010
rsync of savannah-backup.gnu.org finished at Mon Dec 13 22:07:03 EST 2010
rsync of savannah-backup.gnu.org started at Tue Dec 14 20:07:01 EST 2010
rsync of savannah-backup.gnu.org finished at Wed Dec 15 00:36:57 EST 2010
rsync of savannah-backup.gnu.org started at Wed Dec 15 20:07:01 EST 2010
rsync of savannah-backup.gnu.org finished at Thu Dec 16 00:59:14 EST 2010
rsync of savannah-backup.gnu.org started at Thu Dec 16 20:07:01 EST 2010
rsync of savannah-backup.gnu.org finished at Fri Dec 17 00:36:40 EST 2010
rsync of savannah-backup.gnu.org started at Fri Dec 17 20:07:01 EST 2010
rsync of savannah-backup.gnu.org finished at Sat Dec 18 00:04:56 EST 2010
rsync of savannah-backup.gnu.org started at Sat Dec 18 20:07:01 EST 2010
rsync of savannah-backup.gnu.org finished at Sun Dec 19 00:01:14 EST 2010
rsync of savannah-backup.gnu.org started at Sun Dec 19 20:07:01 EST 2010
rsync of savannah-backup.gnu.org finished at Mon Dec 20 08:17:39 EST 2010
On Saturday the hung task notifications start around 07:20 AM, but my theory
why this happened is
faulty (it's not because of load from monolith).
Dec 18 07:21:50 colonialone kernel: [1108026.400926] INFO: task
blkback.5.sda2:5001 blocked for more
than 120 seconds.
Dec 18 07:21:51 colonialone kernel: [1108026.406840] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
disables this message.
Dec 18 07:21:51 colonialone kernel: [1108026.418761] blkback.5.sda D
ffffffff8044af00 0 5001
2
Dec 18 07:21:51 colonialone kernel: [1108026.424213] ffff8802bbc439c0
0000000000000246
0000000000000000 ffff8805df47a740
Dec 18 07:21:51 colonialone kernel: [1108026.431214] ffff88053cfad440
ffffffff804ff460
ffff88053cfad6c0 0000000080266ba3
Dec 18 07:21:51 colonialone kernel: [1108026.441373] 00000000ffffffff
0000000000000000
0000000000000010 ffff8805e25d4c40
Dec 18 07:21:51 colonialone kernel: [1108026.447733] Call Trace:
Dec 18 07:21:51 colonialone kernel: [1108026.450819] [<ffffffffa00ef395>]
:raid1:wait_barrier+0xb8/0x12c
Dec 18 07:21:51 colonialone kernel: [1108026.457985] [<ffffffff80224c8a>]
default_wake_function+0x0/0xe
Dec 18 07:21:51 colonialone kernel: [1108026.464714] [<ffffffffa00f83f8>]
:dm_mod:__split_bio+0x363/0x374
Dec 18 07:21:51 colonialone kernel: [1108026.471151] [<ffffffffa00f0ff9>]
:raid1:make_request+0x73/0x5b6
Dec 18 07:21:51 colonialone kernel: [1108026.477958] [<ffffffff802fdcc1>]
elv_merged_request+0x30/0x39
Dec 18 07:21:51 colonialone kernel: [1108026.484758] [<ffffffff80311b13>]
__up_read+0x13/0x8a
Dec 18 07:21:51 colonialone kernel: [1108026.490338] [<ffffffff802ffb1e>]
generic_make_request+0x2fe/0x339
Dec 18 07:21:51 colonialone kernel: [1108026.497154] [<ffffffff80266ba3>]
mempool_alloc+0x24/0xda
Dec 18 07:21:51 colonialone kernel: [1108026.503219] [<ffffffff80300df8>]
submit_bio+0xdb/0xe2
Dec 18 07:21:51 colonialone kernel: [1108026.509646] [<ffffffff80387254>]
dispatch_rw_block_io+0x5b9/0x65d
Dec 18 07:21:51 colonialone kernel: [1108026.515434] [<ffffffff802ffbb3>]
blk_unplug+0x5a/0x60
Dec 18 07:21:51 colonialone kernel: [1108026.522210] [<ffffffffa00f9b19>]
:dm_mod:dm_table_unplug_all+0x2a/0x3d
Dec 18 07:21:51 colonialone kernel: [1108026.529606] [<ffffffff80387747>]
blkif_schedule+0x3c7/0x4dc
Dec 18 07:21:51 colonialone kernel: [1108026.535638] [<ffffffff80387380>]
blkif_schedule+0x0/0x4dc
Dec 18 07:21:51 colonialone kernel: [1108026.541776] [<ffffffff8023f57f>]
kthread+0x47/0x74
Dec 18 07:21:51 colonialone kernel: [1108026.547366] [<ffffffff802282ec>]
schedule_tail+0x27/0x5c
Dec 18 07:21:51 colonialone kernel: [1108026.553793] [<ffffffff8020be28>]
child_rip+0xa/0x12
Dec 18 07:21:51 colonialone kernel: [1108026.559343] [<ffffffff8023f538>]
kthread+0x0/0x74
Dec 18 07:21:51 colonialone kernel: [1108026.564794] [<ffffffff8020be1e>]
child_rip+0x0/0x12
Note that this may not be a fatal problem, it may just be that the machine is
very overloaded.
Practically speaking it is a show stopper, because the resynchronization
process makes very little
progress once this happens.
Dec 18 07:21:51 colonialone kernel: [1108026.572483] INFO: task md3_resync:713
blocked for more than
120 seconds.
Dec 18 07:21:51 colonialone kernel: [1108026.580017] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
disables this message.
Dec 18 07:21:51 colonialone kernel: [1108026.588702] md3_resync D
ffffffff8044af00 0 713
2
Dec 18 07:21:51 colonialone kernel: [1108026.595326] ffff8802bbfffce0
0000000000000246
0000000000000000 ffff8805df963ee0
Dec 18 07:21:51 colonialone kernel: [1108026.603863] ffff8805e25e7040
ffff8805e245e040
ffff8805e25e72c0 0000000200000000
Dec 18 07:21:51 colonialone kernel: [1108026.612292] 00000000ffffffff
ffffffff8023f6b6
ffff8805dfa705c0 ffffffff80221555
For the DomUs it causes functional problems, such as sshd not being able to
complete the protocol
negotiation for a successful login to a VM.
This may have failed on the Dom0 as well, since I wound up using servent to
contact colonialone
(perhaps because I could not ssh to it).
At 9 AM Bernie rebooted colonialone:
Dec 18 09:03:54 colonialone kernel: [1114157.641661] md: cannot remove active
disk sdd6 from md3 ...
Dec 18 09:04:00 colonialone kernel: [1114163.556083] raid1: Disk failure on
sdd6, disabling device.
Dec 18 09:04:00 colonialone kernel: [1114163.556086] raid1: Operation
continuing on 2 devices.
Dec 18 09:04:00 colonialone mdadm[5026]: FailSpare event detected on md device
/dev/md3, component
device /dev/sdd6
Dec 18 09:04:08 colonialone shutdown[8193]: shutting down for system reboot
Dec 18 09:04:08 colonialone init: Switching to runlevel: 6
Dec 18 09:04:08 colonialone rpc.statd[2395]: Caught signal 15, un-registering
and exiting.
Dec 18 09:04:08 colonialone snmpd[2839]: Received TERM or STOP signal...
shutting down...
Dec 18 09:10:10 colonialone kernel: [1114537.764182] [<ffffffff8020b528>]
system_call+0x68/0x6d
Dec 18 09:10:10 colonialone kernel: [1114537.770695] [<ffffffff8020b4c0>]
system_call+0x0/0x6d
Dec 18 09:10:10 colonialone kernel: [1114537.776042]
Dec 18 09:12:00 colonialone kernel: imklog 3.18.6, log source = /proc/kmsg
started.
Dec 18 09:12:00 colonialone rsyslogd: [origin software="rsyslogd"
swVersion="3.18.6" x-pid="2509" x-
info="http://www.rsyslog.com"] restart
Dec 18 09:12:00 colonialone kernel: [ 0.000000] Initializing cgroup subsys
cpuset
Dec 18 09:12:00 colonialone kernel: [ 0.000000] Initializing cgroup subsys
cpu
Dec 18 09:12:00 colonialone kernel: [ 0.000000] Linux version
2.6.26-2-xen-amd64 (Debian 2.6.26-
26lenny1) (address@hidden) (gcc version 4.1.3 20080704 (prerelease) (Debian
4.1.2-25)) #1 SMP Thu Nov
25 06:39:26 UTC 2010
Dec 18 09:12:00 colonialone kernel: [ 0.000000] Command line: root=/dev/md2
ro
console=ttyS0,115200n8 console=tty0 acpi=off
Here's where /dev/sda6 and /dev/sdd6 are rejected:
Dec 18 09:12:00 colonialone kernel: [ 10.978740] md: md3 stopped.
Dec 18 09:12:00 colonialone kernel: [ 11.050788] md: bind<sda6>
Dec 18 09:12:00 colonialone kernel: [ 11.053846] md: bind<sdb6>
Dec 18 09:12:00 colonialone kernel: [ 11.057885] md: bind<sdd6>
Dec 18 09:12:00 colonialone kernel: [ 11.061587] md: bind<sdc6>
Dec 18 09:12:00 colonialone kernel: [ 11.064449] md: kicking non-fresh sdd6
from array!
Dec 18 09:12:00 colonialone kernel: [ 11.069346] md: unbind<sdd6>
Dec 18 09:12:00 colonialone kernel: [ 11.072410] md: export_rdev(sdd6)
Dec 18 09:12:00 colonialone kernel: [ 11.075384] md: kicking non-fresh sda6
from array!
Dec 18 09:12:00 colonialone kernel: [ 11.080268] md: unbind<sda6>
Dec 18 09:12:00 colonialone kernel: [ 11.083244] md: export_rdev(sda6)
Dec 18 09:12:00 colonialone kernel: [ 11.086655] md: md3: raid array is not
clean -- starting
background reconstruction
Dec 18 09:12:00 colonialone kernel: [ 11.106510] raid1: raid set md3 active
with 2 out of 3 mirrors
Here is some information from SMART (Bernie started smartd Dec 18 19:46, so we
have no data prior to
that to compare):
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 111 to 112 sdb
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 111 to 112 sdc
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 112 to 114 sdc
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 112 to 115 sdb
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 114 to 115 sdc
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 115 to 116 sdb
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 115 to 116 sdc
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 117 sdb
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 117 sdc
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 117 to 118 sdb
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 119 to 111 sdc
SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 70 to 71 sdd
SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 71 to 70 sdd
SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 71 to 72 sdc
SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 71 sdc
SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 73 sda
SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 73 to 72 sda
SMART Usage Attribute: 194 Temperature_Celsius changed from 27 to 28 sda
SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 27 sda
SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 29 sdc
SMART Usage Attribute: 194 Temperature_Celsius changed from 29 to 28 sdc
SMART Usage Attribute: 194 Temperature_Celsius changed from 29 to 30 sdd
SMART Usage Attribute: 194 Temperature_Celsius changed from 30 to 29 sdd
SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 41 to 48 sdc
SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 43 to 44 sdc
SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 44 to 43 sdc
SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 45 to 44 sdc
SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 48 to 45 sdc
SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 48 to 49 sdb
SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 49 to 50 sdb
SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 50 to 49 sdb
SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 50 to 51 sdb
SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 51 to 50 sdb
Saturday evening we power cycled at 18:28:
Dec 18 18:28:05 colonialone kernel: [26770.798179] [<ffffffff8023f538>]
kthread+0x0/0x74
Dec 18 18:28:05 colonialone kernel: [26770.803143] [<ffffffff8020be1e>]
child_rip+0x0/0x12
Dec 18 18:28:05 colonialone kernel: [26770.811011]
Dec 18 18:38:00 colonialone kernel: imklog 3.18.6, log source = /proc/kmsg
started.
Dec 18 18:38:00 colonialone rsyslogd: [origin software="rsyslogd"
swVersion="3.18.6" x-pid="2562" x-
info="http://www.rsyslog.com"] restart
Dec 18 18:38:00 colonialone kernel: [ 0.000000] Initializing cgroup subsys
cpuset
Dec 18 18:38:00 colonialone kernel: [ 0.000000] Initializing cgroup subsys
cpu
Sunday morning, the resynchronization reached 81% before things went sour.
Dec 19 07:16:36 colonialone smartd[7514]: Device: /dev/sdb, SMART Prefailure
Attribute: 1
Raw_Read_Error_Rate changed from 102 to 113
Dec 19 07:16:36 colonialone smartd[7514]: Device: /dev/sdb, SMART Usage
Attribute: 195
Hardware_ECC_Recovered changed from 51 to 52
Dec 19 07:16:36 colonialone smartd[7514]: Device: /dev/sdc, SMART Prefailure
Attribute: 1
Raw_Read_Error_Rate changed from 117 to 120
Dec 19 07:16:36 colonialone smartd[7514]: Device: /dev/sdc, SMART Usage
Attribute: 195
Hardware_ECC_Recovered changed from 47 to 48
Dec 19 07:17:01 colonialone /USR/SBIN/CRON[3261]: (root) CMD ( cd / &&
run-parts --report
/etc/cron.hourly)
Dec 19 07:17:19 colonialone kernel: [45888.504962] INFO: task md3_resync:3307
blocked for more than 120
seconds.
Dec 19 07:17:19 colonialone kernel: [45888.510214] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
disables this message.
Dec 19 07:17:19 colonialone kernel: [45888.519215] md3_resync D
ffff8805bef5be9c 0 3307 2
Dec 19 07:17:19 colonialone kernel: [45888.527748] ffff8805bef5bce0
0000000000000246 0000000000000001
ffff8805df67dee0
Dec 19 07:17:19 colonialone kernel: [45888.535827] ffff8805e25a2480
ffff8805e2704e80 ffff8805e25a2700
0000000300000000
Dec 19 07:17:19 colonialone kernel: [45888.543885] ffff8805bef5bca0
ffffffff8023f6b6 ffff8805df50a140
ffffffff80221555
(It is this, starting at 07:17 which made me realize my theory about the
monolith backups was wrong.)
Perhaps something else happens shortly after 7 AM each day? But what about the
timing of the other
failures?
For planning to resynchronize two more drives into the arrays, here is the
timing of things:
Dec 18 06:52:45 colonialone mdadm[5026]: Rebuild20 event detected on md device
/dev/md3
hung tasks at 07:21
rebooted at 9 AM due to overload
hung tasks at 09:41
Bernie takes down the VMs at 10:21
rebooted at 11 AM
Dec 18 13:34:39 colonialone mdadm[4980]: Rebuild20 event detected on md device
/dev/md3
Dec 18 16:15:38 colonialone mdadm[4980]: Rebuild40 event detected on md device
/dev/md3
hung tasks at 18:13
rebooted at 18:38
Dec 18 23:29:03 colonialone mdadm[4987]: Rebuild20 event detected on md device
/dev/md3
Dec 19 01:59:03 colonialone mdadm[4987]: Rebuild40 event detected on md device
/dev/md3
Dec 19 04:08:03 colonialone mdadm[4987]: Rebuild60 event detected on md device
/dev/md3
Dec 19 06:48:04 colonialone mdadm[4987]: Rebuild80 event detected on md device
/dev/md3
hung tasks at 07:17
rebooted at 07:30
Dec 19 07:36:26 colonialone mdadm[5098]: RebuildStarted event detected on md
device /dev/md3
Dec 19 08:14:26 colonialone mdadm[5098]: Rebuild20 event detected on md device
/dev/md3
Dec 19 08:55:26 colonialone mdadm[5098]: Rebuild40 event detected on md device
/dev/md3
Dec 19 09:38:26 colonialone mdadm[5098]: Rebuild60 event detected on md device
/dev/md3
Dec 19 10:27:26 colonialone mdadm[5098]: Rebuild80 event detected on md device
/dev/md3
Dec 19 11:19:24 colonialone mdadm[5098]: RebuildFinished event detected on md
device /dev/md3
rebooted at 11:22 as a countermeasure in case some corruption had accumulated
(no theory about
this) and also to prove that the RAID would be synchronized after the boot
no more hung tasks since
So it appears that if we accept 4 hours of downtime (no VMs running) we can add
one disk to the RAID.
Adding two disks would take longer, but probably not a lot longer. Trying to
add disks without an
explanation of why the overloads take place seems to be risky.
Peter Olson
FSF Senior Systems Administrator
- [Savannah-hackers-public] [gnu.org #649274] colonialone ailing,
Peter Olson via RT <=
- [Savannah-hackers-public] Re: [gnu.org #649274] colonialone ailing, Sylvain Beucler, 2010/12/20
- Message not available
- Message not available
- Re: [Savannah-hackers-public] Re: [gnu.org #649274] colonialone ailing, Sylvain Beucler, 2010/12/21
- Re: [Savannah-hackers-public] Re: [gnu.org #649274] colonialone ailing, Michael J. Flickinger, 2010/12/21
- Message not available
- Message not available
- Re: [Savannah-hackers-public] Re: [gnu.org #649274] colonialone ailing, Bernie Innocenti, 2010/12/21
- Re: [Savannah-hackers-public] Re: [gnu.org #649274] colonialone ailing, Sylvain Beucler, 2010/12/23
- Re: [Savannah-hackers-public] Re: [gnu.org #649274] colonialone ailing, Bernie Innocenti, 2010/12/24