Reboots Applied After Network Outage

tech-volunteer-meeting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Reboots Applied After Network Outage

From:	Bob Proulx
Subject:	Reboots Applied After Network Outage
Date:	Tue, 1 Feb 2022 17:11:11 -0700

The network outage this morning separated the VMs from their root file
systems running on the Ceph partition.  For something like a little
over an hour.  Some systems seemed to recover better than others.
Thanks to Amin Bandali to reboot vcs2 which was really broken this
morning.

The details look like the kernel log snippet included at the bottom of
this message.  Those types of failures appeared repeating from the
start until the end of the network event around 9:38am US/Eastern.
If any VM shows this type of problem then it's not good.  It's hard to
predict the state of the running processes afterward.  Also some of
the systems didn't have anything logged because the log directory is
on the same root file system and couldn't log it.

We saw something almost exactly like this January 5th.  Some of the
active daemons at the time were killed by it.  Which isn't surprising.
Things weren't happy.  And all of the systems were rebooted then for
the same reason of resetting everything to a known state.

I suggest that all VM owners reboot their VMs in order to ensure that
they have no problems.  I have rebooted all of the systems I can,
that's 16 systems, and they are all rebooted to a good happy state.

Bob


Feb  1 08:16:57 vcs2 kernel: [3513312.220439] INFO: task xfsaild/dm-0:341 
blocked for more than 120 seconds.
Feb  1 08:16:57 vcs2 kernel: [3513312.220443]       Not tainted 
4.15.0-161-generic #169+9.0trisquel9
Feb  1 08:16:57 vcs2 kernel: [3513312.220444] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb  1 08:16:57 vcs2 kernel: [3513312.220446] xfsaild/dm-0    D    0   341      
2 0x80000000
Feb  1 08:16:57 vcs2 kernel: [3513312.220450] Call Trace:
Feb  1 08:16:57 vcs2 kernel: [3513312.220509]  __schedule+0x24e/0x890
Feb  1 08:16:57 vcs2 kernel: [3513312.220536]  ? lock_timer_base+0x6b/0x90
Feb  1 08:16:57 vcs2 kernel: [3513312.220538]  schedule+0x2c/0x80
Feb  1 08:16:57 vcs2 kernel: [3513312.220754]  _xfs_log_force+0x159/0x2a0 [xfs]
Feb  1 08:16:57 vcs2 kernel: [3513312.220772]  ? wake_up_q+0x80/0x80
Feb  1 08:16:57 vcs2 kernel: [3513312.220807]  ? xfsaild+0x1b6/0x7e0 [xfs]
Feb  1 08:16:57 vcs2 kernel: [3513312.220841]  xfs_log_force+0x2c/0x80 [xfs]
Feb  1 08:16:57 vcs2 kernel: [3513312.220876]  xfsaild+0x1b6/0x7e0 [xfs]
Feb  1 08:16:57 vcs2 kernel: [3513312.220884]  kthread+0x121/0x140
Feb  1 08:16:57 vcs2 kernel: [3513312.220887]  ? kthread+0x121/0x140
Feb  1 08:16:57 vcs2 kernel: [3513312.220921]  ? 
xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
Feb  1 08:16:57 vcs2 kernel: [3513312.220923]  ? 
kthread_create_worker_on_cpu+0x70/0x70
Feb  1 08:16:57 vcs2 kernel: [3513312.220926]  ret_from_fork+0x22/0x40
Feb  1 08:16:57 vcs2 kernel: [3513312.220938] INFO: task monit:664 blocked for 
more than 120 seconds.
Feb  1 08:16:57 vcs2 kernel: [3513312.220939]       Not tainted 
4.15.0-161-generic #169+9.0trisquel9
Feb  1 08:16:57 vcs2 kernel: [3513312.220940] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb  1 08:16:57 vcs2 kernel: [3513312.220941] monit           D    0   664      
1 0x00000000
Feb  1 08:16:57 vcs2 kernel: [3513312.220943] Call Trace:
Feb  1 08:16:57 vcs2 kernel: [3513312.220947]  __schedule+0x24e/0x890
Feb  1 08:16:57 vcs2 kernel: [3513312.220961]  ? 
__filemap_fdatawrite_range+0xcf/0x100
Feb  1 08:16:57 vcs2 kernel: [3513312.220964]  schedule+0x2c/0x80
Feb  1 08:16:57 vcs2 kernel: [3513312.220998]  _xfs_log_force_lsn+0x2cf/0x350 
[xfs]
Feb  1 08:16:57 vcs2 kernel: [3513312.221000]  ? wake_up_q+0x80/0x80
Feb  1 08:16:57 vcs2 kernel: [3513312.221033]  xfs_file_fsync+0xfd/0x230 [xfs]
Feb  1 08:16:57 vcs2 kernel: [3513312.221055]  vfs_fsync_range+0x51/0xb0
Feb  1 08:16:57 vcs2 kernel: [3513312.221057]  do_fsync+0x3d/0x70
Feb  1 08:16:57 vcs2 kernel: [3513312.221060]  SyS_fsync+0x10/0x20
Feb  1 08:16:57 vcs2 kernel: [3513312.221075]  do_syscall_64+0x73/0x130
Feb  1 08:16:57 vcs2 kernel: [3513312.221077]  
entry_SYSCALL_64_after_hwframe+0x41/0xa6
Feb  1 08:16:57 vcs2 kernel: [3513312.221080] RIP: 0033:0x7f662e5a9bf7
Feb  1 08:16:57 vcs2 kernel: [3513312.221081] RSP: 002b:00007ffc255d2e00 
EFLAGS: 00000293 ORIG_RAX: 000000000000004a
Feb  1 08:16:57 vcs2 kernel: [3513312.221082] RAX: ffffffffffffffda RBX: 
0000000000000004 RCX: 00007f662e5a9bf7
Feb  1 08:16:57 vcs2 kernel: [3513312.221084] RDX: 0000000000000000 RSI: 
00007ffc255d2e30 RDI: 0000000000000004
Feb  1 08:16:57 vcs2 kernel: [3513312.221085] RBP: 00007ffc255d2e30 R08: 
0000000000000000 R09: 0000000000000004
Feb  1 08:16:57 vcs2 kernel: [3513312.221086] R10: 00000000fffffffc R11: 
0000000000000293 R12: 00005632626b7b68
Feb  1 08:16:57 vcs2 kernel: [3513312.221087] R13: 0000000000000000 R14: 
00007ffc255d2f74 R15: 0000000000000000
Feb  1 08:18:00 vcs2 kernel: [3513375.708341] nfs: server nfs1 not responding, 
still trying
Feb  1 08:18:09 vcs2 kernel: [3513384.156205] nfs: server nfs1 not responding, 
still trying

[Prev in Thread]

Current Thread

[Next in Thread]

Reboots Applied After Network Outage, Bob Proulx <=

Prev by Date: Update 2022-02-01
Next by Date: Re: Update 2022-02-01
Previous by thread: Update 2022-02-01
Next by thread: Forgotten ViewVC discussion 2022-02-16
Index(es):
- Date
- Thread