savannah-hackers-public
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Savannah-hackers-public] savannah issues


From: Ward Vandewege
Subject: [Savannah-hackers-public] savannah issues
Date: Sun, 7 Feb 2010 09:24:53 -0500
User-agent: Mutt/1.5.18 (2008-05-17)

Hi Sylvain,

Savannah's having problems.

It seems to have been triggered by a combination of having the monthly mdadm
array check switched on in /etc/default/mdadm and the rsycn backup script in
/root/remote_backup.sh that kicked off around 7am this morning.

All the domUs are so starved for CPU and/or IO that they have lots of these
on console:

[2423148.713493]  =======================
[2423152.912986] INFO: task kjournald:506 blocked for more than 120 seconds.
[2423152.913002] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[2423152.913009] kjournald     D ed753dec     0   506      2
[2423152.913018]        ecc13280 00000246 ed6a1f0c ed753dec c010621f ecc13408
c115ab40 00000000
[2423152.913033]        ec1dce40 8abe4250 240b36fe 8abe34a7 0013398d ed753dec
0ed26a5a ed6a1f0c
[2423152.913047]        ed753dec c0133204 0ed26a5a ed753dec 0ed26a5a ed6a1f0c
c115ab40 00d9b000
[2423152.913065] Call Trace:
[2423152.913073]  [<c010621f>] xen_clocksource_read+0xc/0x164
[2423152.913085]  [<c0133204>] getnstimeofday+0x37/0xbc
[2423152.913096]  [<c02caa7c>] io_schedule+0x49/0x80
[2423152.913104]  [<c018e286>] sync_buffer+0x30/0x33
[2423152.913114]  [<c02cac6a>] __wait_on_bit+0x33/0x58
[2423152.913121]  [<c018e256>] sync_buffer+0x0/0x33
[2423152.913130]  [<c018e256>] sync_buffer+0x0/0x33
[2423152.913136]  [<c02cacee>] out_of_line_wait_on_bit+0x5f/0x67
[2423152.913146]  [<c012ecc5>] wake_bit_function+0x0/0x3c
[2423152.913155]  [<c018e222>] __wait_on_buffer+0x16/0x18
[2423152.913162]  [<ee04134a>] journal_commit_transaction+0x7dc/0xcae [jbd]
[2423152.913180]  [<c0126799>] lock_timer_base+0x19/0x35
[2423152.913191]  [<ee044054>] kjournald+0xbc/0x225 [jbd]
[2423152.913204]  [<c012ec98>] autoremove_wake_function+0x0/0x2d
[2423152.913211]  [<ee043f98>] kjournald+0x0/0x225 [jbd]
[2423152.913224]  [<c012ebd5>] kthread+0x38/0x5f
[2423152.913231]  [<c012eb9d>] kthread+0x0/0x5f
[2423152.913238]  [<c010425f>] kernel_thread_helper+0x7/0x10

Meanwhile the resync on md3 is pretty much stuck:

md3 : active raid1 sda6[0] sdb6[3] sdc6[2] sdd6[1]
      955128384 blocks [4/4] [UUUU]
      [============>........]  check = 60.0% (573359936/955128384)
finish=1258443.5min speed=5K/sec

I tried killing the backup rsync, but without success so far (it ignores even
kill -9).

It feels like something is deadlocked; the lvs command simply does not return
either (or, perhaps it takes more than 10 minutes...).

I tried to bring down the vcs-noshell and builder domUs gracefully with xm
shutdown, but they are too locked up to respond to that.

Unless you have a better idea, I think the best course of action would be to
reboot colonialone - we may have to xm destroy the running domUs first. I'm
a little worried about the lvm snapshot and potential filesystem corruption
from shutting down the domUs uncleanly.

Restarting would also bring colonialone and its domUs up to the -21 kernel
packages which fixed a nasty dom0 kernel panic on heavy IO (we suffered from
that on another server).

What do you think?

Thanks,
Ward.

-- 
Ward Vandewege <address@hidden>
Free Software Foundation - Senior Systems Administrator




reply via email to

[Prev in Thread] Current Thread [Next in Thread]