[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Storage Array Problems
Re: Storage Array Problems
Mon, 1 Mar 2021 13:12:52 -0700
Protesilaos Stavrou wrote:
> "Basil L. Contovounesios" wrote:
> > Bob Proulx writes:
> >> Are all of those messages yours? They all have the same unique string
> >> pattern.
> > This pattern is generated by an Emacs MUA. The @tcd.ie ones are mine,
> > and the @protesilaos.com ones are Prot's (CCed). I think I received the
> > messages locally, but they're clearly missing from
> > https://bugs.gnu.org/45068 and possibly other places too. Should I just
> > resend the missing messages?
> Hello! I noticed that they were missing, but assumed that the sync
> takes some time.
> Please re-send them or tell me how I can do it from here.
When I was provided with a message-id by Lars for one of his missing
messages I was able to grep around and find that message and the
others in the logs. The logs said those message-ids had been
discarded. That's all I know. Sorry.
The group of those all together just stood out as looking unusual to
my eye and therefore I mentioned it. I don't know if there is a
systematic failure that needs to be fixed or if it was simply human
error due to the systems problems and the large spam backlog.
One of the contributing factors may have been related to the storage
array problems yesterday. When a system can't read or write files the
process trying to do so gets "blocked waiting for I/O" and pauses in
an uninterruptable wait state. (In the Linux kernel a ps listing
shows this uninterruptible state as the "D" state.) Since most OS
functions get cached in the file system buffer cache in RAM the OS on
most systems were still able to function at some level of
functionality. As far as I know none of the systems outright
crashed. But these processes blocked waiting for I/O from the
networked storage server did pile up. I saw that fencepost had a
system load of more than 1100!
The FSF admins worked almost all day long Sunday morning through late
afternoon to restore the storage array. As you can imagine it was a
high stress situation for them. Meanwhile after the initial couple of
hours the rest of the systems were mostly restored to normal operation
and they were able to drain down their high cpu load averages. Those
uninterruptible processes completed their I/O reads and writes upon
which they were blocked and were able to exit. However after being
blocked for a long time some processes that have timeouts will time
out and be killed for taking too long to complete.
The large mail backlog that occurred yesterday which meant that humans
looking at the mailman web page hold queue were looking at dozens and
dozens of messages most of which were spam because the anti-spam
"cancel bot" was also backlogged. That's almost worst case for a
human looking at mail messages and trying to pick out the non-spam
messages from the sea of spam. But I really have no idea about any
particular message and am just guessing.
I also don't know the deep details of the storage array problems
either. Perhaps the FSF admins will write up a blog note about it.
That would be interesting to me. From what I could tell there was a
coupled failure of multiple controller nodes causing the array to lose
redundancy. At least one of the arrays went offline completely. They
had to carefully reset and restore redundancy quorom of the disk
storage and the controller nodes. Other than the initial hour when
things were completely offline the subsequent restoration was all done
online and running while the system was functioning in a degraded raid
mode. Which is pretty cool when you think of it!
> [ I am using Emacs+Gnus and this setup has been stable for a fairly long time
Emacs+Gnus worked great. No problems there at all. The only reason
that Emacs+Gnus got mentioned was that it created a message-id format
that I did not recognize and therefore asked if those were all from
Lars. Basil told me those were from Emacs. Which is great. No
problems there at all.