CompFlu sun cluster: Operation and maintenance
Compute Nodes (sun-[01-60])
A node behaves weird (OS/hardware level)
- check status → IPMI
ipmi-power -h ${NODE}-ipmi -u $IPMI_USER -p $IPMI_PW --stat
- offline → IPMI boot
ipmi-power -h ${NODE}-ipmi -u $IPMI_USER -p $IPMI_PW --on
- online but unreactive → IPMI reboot
- online but fallen off the queue
ssh $NODE systemctl restart slurmd
- unknown → Wire up IPMI vconsole (web browser to
https://192.168.64.${NODE}
) and look up things one step after another
A node behaves weird (Slurm queue level)
Blocking node state
Empty nodes are idle
, occupied nodes are mix
or alloc
. Sometimes they happen to fall in drain
state for several reasons.
To know why: scontrol show node=$NODE
- Send a node into drain state (for reboot/update/hardware maintenance/repurpose etc.):
scontrol update NodeName=$NODE state=DRAIN reason=“You should (and must) give a reason”
- If it is in drain state, but shouldn't:
scontrol update NodeName=$NODE state=UNDRAIN
The asterisk suffix in sinfo
(e. g. down*
) denotes that the node is unreachable by Slurm.
Blocking job queues
Symptom: A node appears to behave properly, but executes less jobs than expected. A common cause is greedy job scripts that request most/all of CPU or RAM resources of the host. (For GPU nodes, improper GRES setup is another possible reason.)
Display details about a job:
scontrol show jobid <job id>
Head Node
hn-1
- also hosts
/home-er
via NFS (InfiniBand within the cluster rack, Ethernet to the offices) - also hosts the
login
VM - also hosts active (dnsmasq, LDAP) and passive (squashfs images, rootfs overlays, chroots,
/srv/ql-common
) Qlustar infrastructure
Supervision
ganglia
https://monitor.hi-ern.de/?c=sun&m=load_one&r=hour&s=by%20name&hc=4&mc=2
If $NODE
is online but invisible in Ganglia (e. g. after restarting the host gather process gmetad
on hn-1
), re-wire the ganglia client daemon via ssh $NODE systemctl restart ganglia-monitor
.
mailing list
server.nbg@fz-juelich.de
- nagios
- Rittal warnings
- system mailer from servers
Slurm
The HPC queue is managed by Slurm. Slurm keeps a verbose record of its usage over time.