The logs have been configured so that all important information can be found on the headnode. There should not be a need to log in to nodes and check log files. The following is some general guidance when looking for problems.
Always start with the headnode /var/log/messages
file. Also, ssh
login issues can be found in /var/log/secure
.
On the headnode, check /var/log/messages
or /var/log/nodes/n*.log
. The /var/log/messages
file also contains host DHCP information that can be important in debugging node booting issues.
Check the node log file in /var/log/nodes/
. This file contains the logs from the node. In the event of an rsyslog
issue, check the node directly by logging in to the node and examining the /var/log/messages
file.
Start with the /var/log/slurmctld.log
. If there is a specific node issue, check the /var/log/slurmd.log
file and grep
for the node name (e.g. grep n2 /var/log/slurmd.log
). If there does not seem to be any information in the headnode slurmd.log
, try logging directly on to the node and examining /var/logs/slurmd.log
directly.
On HPC systems, the nodes are booted and run as RAM disks (see Warewulf Worker Node Images). Log management on the worker nodes is important so that the RAM disk does not fill up or take too much space during the normal course of operation. The logs on the nodes are managed as follows.
The rsyslog
and logrotate
packages are used to manage logs.
The rsyslog
configuration is defined in /etc/rsyslog.conf
on the worker nodes. There are also configuration files in /etc/rsyslog.d
to remove repetitive sshd/login messages that fill up the logs. On the worker nodes, any configuration changes to these files will not survive a reboot unless they are made in the VNFS. (See Warewulf Worker Node Images.)
/var/log/boot.log
information goes to the headnode /var/log/messages
file and to the /var/log/nodes/n*
file for the specific node. This setting allows viewing the system log (tail -f var/log/messages
to watch all node boot-up information). slurmd
and cron
logs) goes to the appropriate /var/log/nodes/n*
file for each node. /var/log/slurmd.log
files are written locally and a copy is routed and aggregated to /var/log/cluster-slurmd.log
on the headnode. The routing includes the headnode slurmd.log
file if it exists. The node /var/log/slurmd.log
files are purged each week. This setting allows all slurmd
messages to be isolated to a single file on the headnode; grep
can be used to search on worker node names./var/log/cron
is only written to the node (and purged each night)./etc/rsyslog.d
are used to remove repetitive sshd/login messages on the nodes. These messages occur when the headnode is using ssh to check on conditions on the worker nodes. Each of these *.conf
files contains a simple rule for detecting and ignoring these messages.
General log rotation is determined by /etc/logrotate.conf
and any package configuration files in /etc/logrotate.d
. On the worker nodes, any configuration changes to these files will not survive a reboot unless they are made in the VNFS. (See Warewulf Worker Node Images.)
var/log/nodes/n*
) are rotated each week, and the past four weeks are kept (and possibly compressed depending on the configuration). These settings can be changed by consulting the /etc/{logrotate.conf,logrotate.d}
files./var/log/nodes/n*
.slurmd.log
goes to the headnode): /var/log/nhc/nhc.log
/var/log/warewulf/messages
/var/log/warewulf/wwgetfiles.log
/var/log/slurmd.log
/var/log/boot.log
/var/log/warewulf/provision/*.log