I’ve been experiencing some perplexing and frustrating issues with my server, and need some advice from those more knowledgeable than me.
Recently I decided to upgrade my raspberry pi server and I found a good deal on an HP Elite Mini 600 G9 on eBay so I took the plunge. It’s got an Intel Core i5-12500T and came with 8gb ram and a 256 gb ssd. I bumped it up to 32gb ram and added a 4tb ssd. It came with windows installed but I installed Debian on there.
With the basics taken care of I got setup with my couple of docker containers (if it matters: caddy, actual budget, immich, prometheus, grafana). But ever since then, anytime some CPU-heavy process runs, the whole machine freezes and stays frozen (I’ve tried letting it go to see if it recovers but it stays frozen for days), and I am forced to physically power it down. I tried to isolate it, thinking it was one of the docker containers but it happened with immich, prometheus, & grafana individually, as well as a borg backup running directly on the machine. When I power it back on after one of these freezes there are not even any system logs from the entire period of the freeze, so I can’t learn anything from them to indicate the issue.
Anyone have any ideas what the issue could be or even where to look? I’m starting to think it’s a hardware problem but I’m not sure and I don’t know what my next step should be.
I had a similar issue on an HP Elitedesk 800 mini G3. It turned out to be a faulty ram stick. Memtest revealed the issue and after unpligging the defective stick the issue was resolved.
Thank you!!! I think this is the issue. With one of the ram sticks in I can memtest just fine. With the other it never finishes and it just reboots.
Definitely suspect.
You should be able to let memtest run for days with no problems, so a reboot would either be a faulty stick or possibly a faulty motherboard slot.
Swap the RAM between slots to isolate the root cause
Memtest is the easiest and best place to start.
CPU-heavy process
Sounds to me like a hardware issue: you’re overheating. Find a way to monitor your temps. I’m not sure how to do this on Linux, so I’m open to suggestions too.
This was my first thought too.
Leave the console visible on an attached monitor. I don’t recall if Debian out-of-box has Ctrl-Alt-F1 disabled, but if not, that’ll put you on the first console. If the kernel panics, it’ll display something there.
If you can’t do that — no spare monitor — you can set up a serial port console to another machine. I don’t know off the top of my head how to have the kernel emit errors there by default if it’s not the default, but I’m quite sure that it’s possible; I’ve debugged machines with kernel stack traces on serial port consoles. Sending a BREAK was equivalent to Magic Sysrq, as I recall.
in addition to the other suggestions of checking the rame stick, do you have resource limits on your containers? It’s generally a good thing to have anyway, but I’d do that after checking the ram and cooling situation. Check your cpu temps as well.
Ram issue or CPU overheat. Monitor CPU temperature over time, and run an extensive me test, like for an entire night…
As others have said, it’s probably overheating.
That’s a mini, and likely doesn’t have any fans at all (or something perfunctory), so probably won’t handle being run at high cpu for more than a few minutes.
I currently have a small-form-factor pc with the same issue - drive and general box temps were high (drive was 110f, continuous, within range but on the edge). It would randomly reboot.
Replacing the paste on the cpu cooler helped a lot (no more random reboots), but adding a compressor-type fan dropped box temps (and more importantly drive temps), down to room temp.
I think the best you may be able to do is add an external compressor fan with some duct tape.
Memtest? Boot a live image and stress test each component?
I don’t think it’s overheating, usually that presents as throttling followed by a thermal protection power off.