CompFlu Sun logbook

aka: Taking notes when things break.

An issue since the beginning ever since the mid-2021 cluster upgrade. Most of the new nodes are handling full load just fine, but some would just shut down by themselves. Sensor, syslog, and IPMI surveillance is inconclusive, but it's probably an overheating issue due to the high power density in the new rack. Better spacing of nodes in the rack reduced the rate of spontaneous outages, but it is still a thing. Relocation of nodes into other cases does show that it really is a per-node issue and not related to the case.

RMA: sun-[33,35] (2023-05-11)

Nodes that are currently known to be unstable under high load: sun-36

Nodes that used to be unstable, but currently behave unsuspicious: sun-[39,53,60]

Self-defence script (automated power-on for nodes that should be running): hn-1:/apps/local/setup/nodes/ + cron job

https://support.sysgen.de/en/support/tickets/213

  • In the night from 2022-05-10 to 2022-05-11, the cooling system of the building broke, cutting the LCP from cold water supply. Rack doors were configured to open automatically in this case, which they did; however, a good part of the servers still didn't survive the heat in the server room, and quit service via auto-shutdown (sun-43 was the first at 23:20 CEST; 24 of 28 epic nodes were offline by 8am).
  • 2022-05-11, 14:30: Provisionally fixed cooling supply, holds up a few hours…
  • In the following days, we (Peter Burzler, Matthias Fuchs/WISAG, me) installed a mobile air conditioner, cardboard screen to keep the door open but block airstream except for A/C exhaust.
  • Cold water supply started working again on 2022-05-18, just to break on 2022-06-17.
  • Final all-clear from Haustechnik came on 2022-07-11, and we went online again.

Loss of (potential) compute time: 31 days emergency operation ≈ 3 mio. core-hours

  • Driven by institute/admin team, Olivier, Johannes
  • 2022-12-07: Switching off the machines of the haswell12, haswell and gold partitions. With these queues very rarely used, the loss for scientific work is negligible, but it saves about 2.3 kW of power, that we don't need to pay for, and don't need to carry away the heat via cooling water.
  • 2023-03-23: Switch on sun-[08-24] of the haswell queue, as an experiment about the projected usage of this compute resource extension.
  • 2023-05-04: Usage of haswell was 2.4%, so I switched the queue off again.

Loss of (potential) compute time (2022-12-07 through 2023-05-04): 2.84 mio. core-hours (savings still running)

  • 2023-05-24: Release Upgrade of sn-1 and hn-1 to Qlustar 13, cumulated fixes for Qlustar 12 on all end-user machines (login, sun-[01-60], workstations) and dam (which only went online again on the second reboot cycle).

Old: focal (Qlustar 12)

New: jammy (Qlustar 13)

chroot update workflow

Getting components to a working state

UPDATE 2023-09-18: Proper deb repo and PPA handling

  • 2023-07-04 1am, max water temp 32.6°C: cold water supply breaks down. According to Peter (11am), it's not a physical damage, but a software defect.
  • 2023-07-05 in the morning: cold water supply appears stable again
  • 2023-07-15, 15:00-20:00, max water temp 24.9°C: emergency shutdowns of several machines (but probably linked to the very hot weather on that afternoon)
  • 2023-07-19, 10:00-13:00, max water temp 22.2°C: emergency shutdowns of 10 machines sun-{37,38,39,40,42,50,53,54,56,60}. (This triggers the thermal DDoS protection of the reboot-if-offline cron job on hn-1, so that the machines stay offline for the time being). Water temps appear stable and comfortable after 13:00.

sbatch: error: Batch job submission failed: I/O error writing script/environment to file when users tried to schedule jobs into the queue. On clients, systemd-timesyncd is failing.

Root cause: On 2023-10-22, 23:53 the InfiniBand card of hn-1 started misbehaving (syslog: mlx5_core 0000:01:00.0: poll_health:723:(pid 0): Fatal error 1 detected and so on), opensm log quickly filled up the whole /var partition, syslog and slurm status could not be written any more.

Resolution:

  • Increase size of hn-1:/var/log LVM slice, save logs, stop opensm daemon
  • Replace InfiniBand cable for a new one (I had swapped cables between hn-1 and sn-1 before)
  • Reboot hn-1
  • Modify /usr/lib/opensm/opensm-monitor (which comes from Qlustar's own opensm package) to write logs to the new /var/log/opensm/ volume (rather than /var)
  • Stable solution? I don't know.

A similar issue had happened in September 2022. Back then we concluded that the IB card was in fact broken, and RMA'd it to SysGen. (Until replacement arrived in early October, hn-1 provisionally ran with an IB card from a GPU node).

Above-average rates of sun-gpu-e stalling (not crashing, but unresponsive) without apparent reason. At the same time, the InfiniBand connection has become dysfunctional (NFS traffic is transparently redirected over Ethernet).

2024-02-16: Narrowing down on a mainboard issue (broken PCIe slot?). Moving IB card to another slot, appears to work fine for now.

Installed via mamba install -c conda-forge mscorefonts. Added symlinks from /apps/local/software/Anaconda4/2023.03/fonts to /apps/chroots/*/usr/share/fonts/Anaconda4.

This node has been notorious for thermal shutdown, just not annoying enough to RMA it back then. The rome nodes out of warranty for a few months now, I am willing to give a try to some handcraft caretaking. The only user-serviceable parts in the nodes are the IB card (dual PCIe), RAM, and CPU heatsink. I screwed off the heatsinks, and inspected the thermal paste. It looks applied okayish (even, generous, with few air bubbles), and still pasty, though it appears somewhat flaky and dried out at some places.

  1. Unscrew heatsink (Torx)
  2. Thorough removal of old thermal paste
  3. Apply new thermal paste Arctic MX-6. Since the gap between CPU IHS and the heatsink is substantial (about 1 mm), we need quite a lot of thermal paste
  4. Trial screw down heatsink, check if the paste spreads across the whole IHS, apply more when in doubt
  5. Final screw down.

Afterwards, sun-36 passes a full-load burn-in test over 17 hours, with CPU temperatures k10temp-pci-00cb Tccd{1..8}=56.5(26)°C and k10temp-pci-00c3 Tccd{1..8}=70.7(15)°C. In comparison to another smoothly running node (54.4(25)°C, 71.8(27)°C), these values, particularly of CPU 2, are looking very promising.

The next compute node that shows the slightest tendency for misbehaviour will be subject to the following measures:

  • Factory state of the cooling system: take a full-load measurement (multi-hour synthetic CPU benchmark, if possible)
  • Expose, inspect, and clean CPUs. Apply new thermal paste
  • Old CPU order, new paste: take another full-load measurement (ideally, under similar conditions of overall cluster load) → Impact of thermal paste
  • Take out the node again, swap CPU1 ↔ CPU2, repeat measurements → estimate inherent SKU spread, and even out overall wear better.

2024-03-12 ...and the next node was – sun-36, again :-/

Debug/stress-test sun-36

  • compflu/backstage/logbook.txt
  • Last modified: 2024-04-04 09:53
  • by j.hielscher