User Tools

Site Tools


hpc_logs_and_log_management

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
hpc_logs_and_log_management [2020/07/03 16:03]
deadline updated log policy
hpc_logs_and_log_management [2021/05/20 16:53] (current)
brandonm [Node Log Management] Case, word, formatting, punctuation, and whitespace fixes
Line 3: Line 3:
 ====What Logs Should I Check?==== ====What Logs Should I Check?====
  
-The logs have been configures so that all important information can be found on the headnode. There should not a need to log  in to nodes and check log files. The following is some general guidance when looking for problems.+The logs have been configured so that all important information can be found on the headnode. There should not be a need to log  in to nodes and check log files. The following is some general guidance when looking for problems.
  
 ===General Issues=== ===General Issues===
  
-Always start with the headnode ''/var/log/messages'' file. Also ''ssh'' login issue can be found in ''/var/log/secure''+Always start with the headnode ''/var/log/messages'' file. Also''ssh'' login issues can be found in ''/var/log/secure''.
  
-===Node Boot-up Issue===+===Node Boot-up Issues===
  
-On the headnode, check ''/var/log/messages'' or ''/var/log/nodes/n*.log'' The ''/var/log/messages'' file also contains host DHCP information that can be important in debugging node booting issues.+On the headnode, check ''/var/log/messages'' or ''/var/log/nodes/n*.log''The ''/var/log/messages'' file also contains host DHCP information that can be important in debugging node booting issues.
  
-===Specific Node Issue===+===Specific Node Issues===
  
-Check the node log file in ''/var/log/nodes/'' This file contains the logs from the node. In the event of an ''rsyslog'' issuecheck the node directly by logging in to the node and examining the ''/var/log/messages'' file. +Check the node log file in ''/var/log/nodes/''This file contains the logs from the node. In the event of an ''rsyslog'' issuecheck the node directly by logging in to the node and examining the ''/var/log/messages'' file. 
  
 ===Slurm Issues=== ===Slurm Issues===
  
-Start with the ''/var/log/slurmctld.log''. If there is a specific node issue, check the ''/var/log/slurmd.log'' file and ''grep'' for the node name (e.g. ''grep n2 /var/log/slurmd.log'') If there does not seem to be any information in the headnode ''slurmd.log'' try logging directly on to the node and examining ''/var/logs/slurmd.log'' directly.+Start with the ''/var/log/slurmctld.log''. If there is a specific node issue, check the ''/var/log/slurmd.log'' file and ''grep'' for the node name (e.g. ''grep n2 /var/log/slurmd.log'')If there does not seem to be any information in the headnode ''slurmd.log''try logging directly on to the node and examining ''/var/logs/slurmd.log'' directly.
  
 ====Node Log Management==== ====Node Log Management====
  
-ON HPC systems, the nodes are booted and run as RAM disk (see [[warewulf_worker_node_images|Warewulf Worker Node Images]]). Log management on the worker nodes is important so that the RAM disk does not fill up or take too much space during the normal course of operation. The logs on the nodes are managed as follows.+On HPC systems, the nodes are booted and run as RAM disks (see [[warewulf_worker_node_images|Warewulf Worker Node Images]]). Log management on the worker nodes is important so that the RAM disk does not fill up or take too much space during the normal course of operation. The logs on the nodes are managed as follows.
  
 The ''rsyslog'' and ''logrotate'' packages are used to manage logs. The ''rsyslog'' and ''logrotate'' packages are used to manage logs.
Line 29: Line 29:
 ===Log Routing and Message Suppression on Worker Nodes=== ===Log Routing and Message Suppression on Worker Nodes===
  
-The ''rsyslog'' configuration is defined in ''/etc/rsyslog.conf '' on the worker nodes. There are also configuration files in ''/etc/rsyslog.d'' to remove repetitive sshd/login messages that fill up the logs. On the worker nodes, any configuration changes to these files will not survive a reboot unless they are made in the VNFS. (See [[warewulf_worker_node_images|Warewulf Worker Node Images]])+The ''rsyslog'' configuration is defined in ''/etc/rsyslog.conf'' on the worker nodes. There are also configuration files in ''/etc/rsyslog.d'' to remove repetitive sshd/login messages that fill up the logs. On the worker nodes, any configuration changes to these files will not survive a reboot unless they are made in the VNFS. (See [[warewulf_worker_node_images|Warewulf Worker Node Images]].)
  
-  * Each Node's ''/var/log/boot.log'' information goes to the headnode ''/var/log/messages'' file and to the ''/var/log/nodes/n*'' for the specific node. This setting allows viewing the system log (''tail -f var/log/messages'' to watch all node boot-up information)  +  * Each node's ''/var/log/boot.log'' information goes to the headnode ''/var/log/messages'' file and to the ''/var/log/nodes/n*'' file for the specific node. This setting allows viewing the system log (''tail -f var/log/messages'' to watch all node boot-up information) 
-  * All other node log information (with exception of the node ''slurmd'' and ''cron'' logs) goes to the appropriate ''/var/log/nodes/n*'' for each node.  +  * All other node log information (with the exception of the node ''slurmd'' and ''cron'' logs) goes to the appropriate ''/var/log/nodes/n*'' file for each node.  
-  * All node ''/var/log/slurmd.log'' files are written locally and a copy is routed and aggregated to ''/var/log/cluster-slurmd.log'' on the headnode. The routing includes the headnode slurmd.log file if it exists. The node ''/var/log/slurmd.log'' files are purged each week. This setting allows all ''slurmd'' messages to be isolated to a single file on the headnode''grep'' can be used to search on worker node names. +  * All node ''/var/log/slurmd.log'' files are written locally and a copy is routed and aggregated to ''/var/log/cluster-slurmd.log'' on the headnode. The routing includes the headnode ''slurmd.log'' file if it exists. The node ''/var/log/slurmd.log'' files are purged each week. This setting allows all ''slurmd'' messages to be isolated to a single file on the headnode''grep'' can be used to search on worker node names. 
- +  * Each node'''/var/log/cron'' is only written to the node (and purged each night). 
-  * Each node ''/var/log/cron'' is only written to the node (and purged each night). +  * The files in ''/etc/rsyslog.d'' are used to remove repetitive sshd/login messages on the nodes. These messages occur when the headnode is using ssh to check on conditions on the worker nodes. Each of these ''*.conf'' files contains a simple rule for detecting and ignoring these messages.
-  * The files in ''/etc/rsyslog.d'' are used to remove repetitive sshd/login messages on the nodes. These messages occur when the headnode is using ssh to check on conditions on the worker nodes. Each of these ''*.conf'' files contain a simple rule for detecting and ignoring these messages.+
  
 ===Log Rotation=== ===Log Rotation===
  
-General Log rotation is determined by ''/etc/logrotate.conf '' and any package configuration files in ''/etc/logrotate.d''. On the worker nodes, any configuration changes to these files will not survive a reboot unless they are made in the VNFS. (See [[warewulf_worker_node_images|Warewulf Worker Node Images]]).+General log rotation is determined by ''/etc/logrotate.conf'' and any package configuration files in ''/etc/logrotate.d''. On the worker nodes, any configuration changes to these files will not survive a reboot unless they are made in the VNFS. (See [[warewulf_worker_node_images|Warewulf Worker Node Images]].)
  
-  * On the **headnode** all logs (including ''var/log/nodes/n*'') are rotated each week and the past four weeks are kept (and possibly compressed depending on the configuration). These setting can be changed by consulting the ''/etc/{logrotate.conf,logrotate.d}'' files. +  * On the **headnode**all logs (including ''var/log/nodes/n*'') are rotated each weekand the past four weeks are kept (and possibly compressed depending on the configuration). These settings can be changed by consulting the ''/etc/{logrotate.conf,logrotate.d}'' files. 
-  * On the **worker nodes** all main logs are purged each day because the logs have been sent to the headnode and placed in ''/var/log/nodes/n*'') +  * On the **worker nodes**all main logs are purged each day because the logs have been sent to the headnode and placed in ''/var/log/nodes/n*''. 
-  * On **worker nodes** the following logs are removed each **week** (''slurmd.log'' goes to the headnode): +  * On **worker nodes**the following logs are removed each **week** (''slurmd.log'' goes to the headnode): 
     * ''/var/log/nhc/nhc.log''     * ''/var/log/nhc/nhc.log''
     * ''/var/log/warewulf/messages''     * ''/var/log/warewulf/messages''
     * ''/var/log/warewulf/wwgetfiles.log''     * ''/var/log/warewulf/wwgetfiles.log''
     * ''/var/log/slurmd.log''     * ''/var/log/slurmd.log''
-  * On **worker nodes** the following logs are **not** removed or rotated (they contain static boot-up information): +  * On **worker nodes**the following logs are **not** removed or rotated (they contain static boot-up information): 
     * ''/var/log/boot.log''     * ''/var/log/boot.log''
     * ''/var/log/warewulf/provision/*.log''     * ''/var/log/warewulf/provision/*.log''
  
hpc_logs_and_log_management.1593792191.txt.gz · Last modified: 2020/07/03 16:03 by deadline