[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
branch master updated: doc: Add a Problems/solutions knowledge base sect
From: |
Maxim Cournoyer |
Subject: |
branch master updated: doc: Add a Problems/solutions knowledge base section. |
Date: |
Thu, 10 Nov 2022 21:09:27 -0500 |
This is an automated email from the git hooks/post-receive script.
apteryx pushed a commit to branch master
in repository maintenance.
The following commit(s) were added to refs/heads/master by this push:
new eee43c5 doc: Add a Problems/solutions knowledge base section.
eee43c5 is described below
commit eee43c569c1a87ee3cf9991c649cec1d2522c04f
Author: Maxim Cournoyer <maxim.cournoyer@gmail.com>
AuthorDate: Thu Nov 10 21:01:23 2022 -0500
doc: Add a Problems/solutions knowledge base section.
* doc/infra-handbook.org (Specifications): Mention the QLogic
adapters.
(Btrfs compression and mount options): Use 'compress' instead of
'compress-force', as the later can cause too many file extents, which
in turn translate into a slow mount for a very large file system.
(Problems/solutions knowledge base): New section.
---
doc/infra-handbook.org | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 47 insertions(+), 1 deletion(-)
diff --git a/doc/infra-handbook.org b/doc/infra-handbook.org
index 956173d..0a67037 100644
--- a/doc/infra-handbook.org
+++ b/doc/infra-handbook.org
@@ -22,6 +22,7 @@ Dell PowerEdge R7425 server with the following specifications:
- 2x AMD EPYC 7451 24-Core processors
- Storage Area Network (SAN) of 100 TiB
+- SAN connected to two QLogic QLE2692 16G Fibre Channel adapters (qla2xxx)
- 188 GiB of memory
The machine can be remotely administered via iDRAC, the Dell server
@@ -162,7 +163,7 @@ file:../hydra/deploy-node-129.scm for a build machine when
high
availability is preferred over data safety (degraded):
#+begin_src scheme
-(define %common-btrfs-options '(("compress-force" . "zstd")
+(define %common-btrfs-options '(("compress" . "zstd")
("space_cache" . "v2")
"degraded"))
#+end_src
@@ -191,3 +192,48 @@ file:../hydra/deploy-node-129.scm machine configuration:
"balance" "start" "-dusage=5" "/"))
"btrfs-balance"))
#+end_src
+
+* Problems/solutions knowledge base
+** The boot fails with kernel panick on qla2xxx-related errors
+Here's an example:
+#+begin_example
+[ 51.266790] Call Trace:
+[ 51.266792] <TASK>
+[ 51.266794] _raw_spin_lock_irqsave+0x46/0x60
+[ 51.266799] qla2xxx_dif_start_scsi_mq+0x2b7/0xe60 [qla2xxx
124f4fec4ef588623af420625c6af8b5bcce53fd]
+[ 51.266823] qla2xxx_mqueuecommand+0x222/0x2d0 [qla2xxx
124f4fec4ef588623af420625c6af8b5bcce53fd]
+[ 51.266838] qla2xxx_queuecommand+0x1a1/0x3d0 [qla2xxx
124f4fec4ef588623af420625c6af8b5bcce53fd]
+[ 51.266852] scsi_queue_rq+0x390/0xc00
+[ 51.266857] __blk_mq_try_issue_directly+0x176/0x1e0
+[ 51.266861] blk_mq_plug_issue_direct.constprop.0+0x93/0x180
+[ 51.266865] blk_mq_flush_plug_list+0x23d/0x2a0
+[ 51.266868] __blk_flush_plug+0xed/0x130
+[ 51.266872] blk_finish_plug+0x31/0x50
+[ 51.266874] read_pages+0x1f5/0x300
+[ 51.266879] page_cache_ra_unbounded+0x131/0x180
+[ 51.266882] force_page_cache_ra+0xc7/0x100
+[ 51.266885] page_cache_sync_ra+0x34/0x90
+[ 51.266887] filemap_get_pages+0x127/0x700
+[ 51.266893] filemap_read+0xde/0x420
+[ 51.266898] blkdev_read_iter+0xbd/0x1e0
+[ 51.266901] new_sync_read+0x13e/0x1c0
+[ 51.266905] vfs_read+0x151/0x1a0
+[ 51.266908] ksys_read+0x73/0xf0
+[ 51.266911] __x64_sys_read+0x1e/0x30
+[ 51.266913] do_syscall_64+0x60/0xc0
+[ 51.266919] entry_SYSCALL_64_after_hwframe+0x63/0xcd
+[ 51.266922] RIP: 0033:0x4e73de
+[ 51.266924] Code: 0f 1f 40 00 48 c7 c2 bc ff ff ff f7 d8 64 89 02 48 c7 c0
ff ff ff ff eb ba 0f 1f 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00
f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
+[ 51.266926] RSP: 002b:00007ffc403f39e8 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
+[ 51.266928] RAX: ffffffffffffffda RBX: 0000000001a98738 RCX:
00000000004e73de
+[ 51.266929] RDX: 0000000000000100 RSI: 0000000001a98748 RDI:
0000000000000006
+[ 51.266930] RBP: 0000000001a51bc0 R08: 0000000001a98720 R09:
0000000001a3ef10
+[ 51.266932] R10: 0000000000000007 R11: 0000000000000246 R12:
000009ffffffe000
+[ 51.266933] R13: 0000000000000100 R14: 0000000001a98720 R15:
0000000001a51c10
+[ 51.266936] </TASK>
+[ 54.246148] NMI watchdog: Watchdog detected hard LOCKUP on cpu 64
+#+end_example
+Solution: Stop the server, update the firmware of the QLogic cards,
+then start the server. The exact failure reason is unknown but it is
+possible that the QLogic cards firmware becomes incompatible with that
+of the SAN, which is always kept up to date.
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- branch master updated: doc: Add a Problems/solutions knowledge base section.,
Maxim Cournoyer <=