|
From: | Bob Proulx |
Subject: | [Savannah-hackers-public] git, svn, cvs, Outage Postmortem 2019-09-09 |
Date: | Mon, 9 Sep 2019 21:19:33 -0600 |
User-agent: | NeoMutt/20170113 (1.7.2) |
Thursday Sept 5 at about 11am the Ansible configuration management tool installed the latest Trisquel Linux kernel security upgrade on nfs1 our NFS server for the main storage array. Installed but not booted and therefore not yet active. Friday I received the normal notification that there was a new kernel installed on it and therefore it would eventually need to be rebooted. But Friday I was mostly offline and had no time for it. And of course over the weekend there is no FSF admin support in case there are problems. Therefore the reboot slid until today. Today around 12:30 US/Mountain time I rebooted nfs1 for the new kernel and the new systemd packages. The reboot initially appeared to complete successfully. But then rebooting download0 for the same upgrades failed to NFS mount one of the two partitions mounted from nfs1. And vcs0 also started reporting stale nfs mounts. We started debugging the problem immediately. It took a while before I started realize that the problem was the kernel because initially it looked like a networking connectivity problem. Looked like IPv4 failing but IPv6 working. Looked like a firewall blocking the mount handshake. Looked like other things. Very strange was that one of the two mount points on download0 would usually mount okay but the other would would time out. Very bizarre! I chased down those dead ends before I decided that it must be the kernel and should reboot back to the previously installed and previously working one. Had already rebooted with the new kernel multiple times yet it still had these weird failures. Became very happy when booting back to the old kernel returned things to sanity. The problem appears to be in the new kernel. ii linux-image-unsigned-4.4.0-161-generic 4.4.0-161.189+8.0trisquel2 amd64 Linux-libre kernel image for version 4.4.0 ii linux-modules-4.4.0-161-generic 4.4.0-161.189+8.0trisquel2 amd64 Linux-libre kernel extra modules for version 4.4.0 I have marked the working previous kernel as held so as to prevent it being removed in a future upgrade. hi linux-image-unsigned-4.4.0-159-generic 4.4.0-159.187+8.0trisquel2 amd64 Linux-libre kernel image for version 4.4.0 hi linux-modules-4.4.0-159-generic 4.4.0-159.187+8.0trisquel2 amd64 Linux-libre kernel extra modules for version 4.4.0 And that is all I know. Things are back working as before using the previously installed and running kernel. I filed a ticket with the FSF RT system about the issue as it concerns our systems. Bob
[Prev in Thread] | Current Thread | [Next in Thread] |