CompFlu sun cluster: Operation and maintenance

  • check status → IPMI ipmi-power -h ${NODE}-ipmi -u $IPMI_USER -p $IPMI_PW --stat
  • offline → IPMI boot ipmi-power -h ${NODE}-ipmi -u $IPMI_USER -p $IPMI_PW --on
  • online but unreactive → IPMI reboot
  • online but fallen off the queue ssh $NODE systemctl restart slurmd
  • unknown → Wire up IPMI vconsole (web browser to https://192.168.64.${NODE}) and look up things one step after another

Blocking node state

Empty nodes are idle, occupied nodes are mix or alloc. Sometimes they happen to fall in drain state for several reasons.

To know why: scontrol show node=$NODE

  • Send a node into drain state (for reboot/update/hardware maintenance/repurpose etc.): scontrol update NodeName=$NODE state=DRAIN reason=“You should (and must) give a reason”
  • If it is in drain state, but shouldn't: scontrol update NodeName=$NODE state=UNDRAIN

The asterisk suffix in sinfo (e. g. down*) denotes that the node is unreachable by Slurm.

Blocking job queues

Symptom: A node appears to behave properly, but executes less jobs than expected. A common cause is greedy job scripts that request most/all of CPU or RAM resources of the host. (For GPU nodes, improper GRES setup is another possible reason.)

Display details about a job:

scontrol show jobid <job id>

hn-1

  • also hosts /home-er via NFS (InfiniBand within the cluster rack, Ethernet to the offices)
  • also hosts the login VM
  • also hosts active (dnsmasq, LDAP) and passive (squashfs images, rootfs overlays, chroots, /srv/ql-common) Qlustar infrastructure

https://monitor.hi-ern.de/?c=sun&m=load_one&r=hour&s=by%20name&hc=4&mc=2

If $NODE is online but invisible in Ganglia (e. g. after restarting the host gather process gmetad on hn-1), re-wire the ganglia client daemon via ssh $NODE systemctl restart ganglia-monitor.

FIXME server.nbg@fz-juelich.de

  • nagios
  • Rittal warnings
  • system mailer from servers

The HPC queue is managed by Slurm. Slurm keeps a verbose record of its usage over time.

Examples

  • compflu/backstage/cluster-operation.txt
  • Last modified: 2024-04-16 13:05
  • by j.hielscher