I believe I have located the major source of instability but unfortunately at a sacrifice to performance.
I have an Nvidia 210 video card in this machine for the console. It’s a very low end card but adequate for that purpose, however, in 2019, Nvidia discontinued driver support so I had to switch to using the Linux nouveau driver which given the relatively low performance of the card was not a big deal.
Well recent Linux kernels have a bug in the driver for this card which results in the card DMA’ing into memory that it has not allocated, and when that memory happens to be used by something else, crash.
But as it happens Nvidia has again decided to support that card however the drivers, now 340.108, are not compatible with newer kernels so I was forced to go back to 5.4.0 which is considerably less efficient than 5.7.
5.7.7 kernel was still unstable, so was 5.8rc3, but at least with the latter it logged some information that showed some memory allocations failed with the contiguous memory allocater, a new feature recently introduced into the Linux kernel.
I am building a new kernel with that disabled, it really isn’t required since there are no huge streaming I/O devices like video that might require it and most everything can DMA through the MMU on this particular machine (which can map disparate memory regions into contiguous memory). If it does not spontaneously boot into the new kernel, I will boot it this evening.
There is also the possibility of hardware errors but so far it has not logged any.
Iglulik spontaneously rebooted again tonight, this time on 5.7.7 it made it four days between spontaneous boots but this time I discovered what triggered it so I’ve got a bug report files with bugzilla.kernel.org and I’m going to give 5.8pre4 a try if it proves semi-stable on my workstation. I normally avoid pre-release kernels but 5.7 has been buggy and so far 5.8pre3 has been totally stable on my workstation.
One of our servers has been unstable on 5.7.6 and rebooted spontaneously twice in the last few days. Oddly, only this server seems to be impacted but it is a newer CPU than the others so it may be a kernel problem specific to this CPU.
I am going to reboot into 5.7.7 tonight IF it hasn’t spontaneously booted into it on it’s own between now and then. This will happen just after midnight.
This machine services the web, /home directories, and several shell servers. Because basically everything relies on /home, everything will be briefly interrupted shortly after midnight except virtual private servers which will not be affected.
If you are not on Mint, Debian, or Ubuntu, you should just see things lock up briefly, if you are on one of these servers you will be disconnected and will need to re-establish your connection after the boot completes.
I am discontinuing opensuse.eskimo.com shell server because users have been unable to authenticate for several months owing to a broken library in opensuse that incorrectly attempts to originate connections to ypserv on an unprivileged port.
I filed a bug report with Suse several months ago, nothing has come of it in the way of a fix.
I attempted to login to the bug reporting system but was informed that authentication methods had changed and I would have to convert my password. I’ve tried to do that many times but their system keeps telling me the system module is down.
Obviously maintenance is not happening there anymore so I’m going to abandon OpenSuse and take suggestions for a new Linux distro to replace it. I prefer distros that are based upon .deb packages verses RPM’s given the choice, the former just has much less tendency to scramble it’s database and have problems with dependencies.
Iglulik spontaneously booted today.
NFS partitions did not properly remount on one mail server, this may result in some spam not being properly filtered and mail that should have gone to spam and/or other folders, instead being placed in your INBOX. I apologize for this inconvenience.
I have modified systemd unit files for postfix to make these mounts a requirement for postfix to start but unfortunately there is no provision to kill postfix if they go away. I may be able to script something if I can find a way to check a mount point without the check itself hanging.
5.7.6 appears to have resolved NFS issues that came into being in 5.7.0 through 5.7.4 and is now the active kernel on our server and debian based systems.
Going to take the mail subsystem down for about 1/2 hour to troubleshoot kernel problem as well as vps1-7 virtual private servers. Had planned this for yesterday but problems building kernels delayed.
This evening I will be booting ice into a new kernel to test NFS.
I am working with a Linux kernel developer to debug some kernel nfs server issues that are new to the 5.7.x kernels. I plan to boot into the new kernel around 11pm, then things may or may not be interrupted for some time if NFS does not work while I gather information that may be helpful to the kernel developers before reverting to a known good kernel.
This will affect mail, and also vps1 through vps7 but not higher numbered virtual machines. The virtual machines will only be down for minutes during the boot, mail be unavailable for a longer period as it is what is affected by the NFS problems and it may take some time to gather all of the necessary information to allow a fix to be developed.
Between 11PM Sunday June 21st and 2AM Monday June 22nd:
vps2, vps3, vps4, vps5, and vps7 about 1/2 hour each for imaging.
scientific7, uucp, and mx1 about 1/2 hour for scientific7 and about 15 minutes for the other two. mx1 will not be service affecting as mx2 will handle the traffic while it is down. You can use centos7 as an alternative to scientific7 during it’s downtime, they are essentially the same code base.
mint, ubuntu about an hour each, ftp/web about 1/2 hour.
Mint and Ubuntu have similar code base to Debian, Julinux, and Zorin, suggest using one of those as alternate during this work.