Table of Contents

HPC Logs and Log Management

What Logs Should I Check?

The logs have been configured so that all important information can be found on the headnode. There should not be a need to log in to nodes and check log files. The following is some general guidance when looking for problems.

General Issues

Always start with the headnode /var/log/messages file. Also, ssh login issues can be found in /var/log/secure.

Node Boot-up Issues

On the headnode, check /var/log/messages or /var/log/nodes/n*.log. The /var/log/messages file also contains host DHCP information that can be important in debugging node booting issues.

Specific Node Issues

Check the node log file in /var/log/nodes/. This file contains the logs from the node. In the event of an rsyslog issue, check the node directly by logging in to the node and examining the /var/log/messages file.

Slurm Issues

Start with the /var/log/slurmctld.log. If there is a specific node issue, check the /var/log/slurmd.log file and grep for the node name (e.g. grep n2 /var/log/slurmd.log). If there does not seem to be any information in the headnode slurmd.log, try logging directly on to the node and examining /var/logs/slurmd.log directly.

Node Log Management

On HPC systems, the nodes are booted and run as RAM disks (see Warewulf Worker Node Images). Log management on the worker nodes is important so that the RAM disk does not fill up or take too much space during the normal course of operation. The logs on the nodes are managed as follows.

The rsyslog and logrotate packages are used to manage logs.

Log Routing and Message Suppression on Worker Nodes

The rsyslog configuration is defined in /etc/rsyslog.conf on the worker nodes. There are also configuration files in /etc/rsyslog.d to remove repetitive sshd/login messages that fill up the logs. On the worker nodes, any configuration changes to these files will not survive a reboot unless they are made in the VNFS. (See Warewulf Worker Node Images.)

Log Rotation

General log rotation is determined by /etc/logrotate.conf and any package configuration files in /etc/logrotate.d. On the worker nodes, any configuration changes to these files will not survive a reboot unless they are made in the VNFS. (See Warewulf Worker Node Images.)