Help diagnosing server freeze issue

swipernoswipey@piefed.social · 21 hours ago

Help diagnosing server freeze issue

basic_user@lemmy.world · 20 hours ago

I had a similar issue on an HP Elitedesk 800 mini G3. It turned out to be a faulty ram stick. Memtest revealed the issue and after unpligging the defective stick the issue was resolved.

swipernoswipey@piefed.social · 14 hours ago

Thank you!!! I think this is the issue. With one of the ram sticks in I can memtest just fine. With the other it never finishes and it just reboots.

SayCyberOnceMore@feddit.uk · 12 hours ago

Definitely suspect.

You should be able to let memtest run for days with no problems, so a reboot would either be a faulty stick or possibly a faulty motherboard slot.

Swap the RAM between slots to isolate the root cause

azron@lemmy.ml · 17 hours ago

Memtest is the easiest and best place to start.

AMillionMonkeys@lemmy.world · 20 hours ago

CPU-heavy process

Sounds to me like a hardware issue: you’re overheating. Find a way to monitor your temps. I’m not sure how to do this on Linux, so I’m open to suggestions too.

yaroto98@lemmy.world · 20 hours ago

This was my first thought too.

tal@lemmy.today · 13 hours ago

Leave the console visible on an attached monitor. I don’t recall if Debian out-of-box has Ctrl-Alt-F1 disabled, but if not, that’ll put you on the first console. If the kernel panics, it’ll display something there.

If you can’t do that — no spare monitor — you can set up a serial port console to another machine. I don’t know off the top of my head how to have the kernel emit errors there by default if it’s not the default, but I’m quite sure that it’s possible; I’ve debugged machines with kernel stack traces on serial port consoles. Sending a BREAK was equivalent to Magic Sysrq, as I recall.

ragingHungryPanda@piefed.keyboardvagabond.com · 14 hours ago

in addition to the other suggestions of checking the rame stick, do you have resource limits on your containers? It’s generally a good thing to have anyway, but I’d do that after checking the ram and cooling situation. Check your cpu temps as well.

Shimitar@downonthestreet.eu · 20 hours ago

Ram issue or CPU overheat. Monitor CPU temperature over time, and run an extensive me test, like for an entire night…

Onomatopoeia@lemmy.cafe · 20 hours ago

As others have said, it’s probably overheating.

That’s a mini, and likely doesn’t have any fans at all (or something perfunctory), so probably won’t handle being run at high cpu for more than a few minutes.

I currently have a small-form-factor pc with the same issue - drive and general box temps were high (drive was 110f, continuous, within range but on the edge). It would randomly reboot.

Replacing the paste on the cpu cooler helped a lot (no more random reboots), but adding a compressor-type fan dropped box temps (and more importantly drive temps), down to room temp.

I think the best you may be able to do is add an external compressor fan with some duct tape.

frongt@lemmy.zip · edit-2 18 hours ago

Memtest? Boot a live image and stress test each component?

I don’t think it’s overheating, usually that presents as throttling followed by a thermal protection power off.