User Tools

Site Tools


monitoring_system_resources

This is an old revision of the document!


HPC Cluster Monitoring

Ganglia

On HPC systems the popular Ganglia monitoring tool is available. To bring up the Ganglia interface simple enter:

http://localhost/ganglia/

In Firefox (or any other browser that is installed on the system) the default screen is shown below.

Clicking on the Limulus OHPC Cluster in the Choose a Source drop-down menu will show the individual nodes in the cluster. The load_one (one minute load) is displayed in total and for the individual nodes (shown below).

Note that in addition to a myriad of other metrics it is possible to observe the CPU temperatures by selecting cpu_temp (as shown below)

More information on using and configuration can be found at the Ganglia web site

Warewulf Top (wwtop)

Warewulf Top is a command line tool for monitoring the state of the cluster. Similar to the top command wwtop is part of the Warewulf cluster provisioning and management system used on Limulus HPC systems. wwtop has been augmented to work directly with Limulus systems. Real time CPU temperatures and frequencies are now reported. To run Warewulf Top enter:

$ wwtop

The following screen will update in real time for nodes that are active (booted).

Operation of the wwtop interface is described by the command help option shown below.

USAGE: /usr/bin/wwtop [options]
  About:
    wwtop is the Warewulf 'top' like monitor. It shows the nodes ordered by
    the highest utilization, and important statics about each node and
    general summary type data. This is an interactive curses based tool.

  Options:
   -h, --help       Show this banner
   -o, --one_pass   Perform one pass and halt

  Runtime Options:
    Filters (can also be used as command line options):
       i   Display only idle nodes
       d   Display only dead and non 'Ready' nodes
       f   Flush any current filters
    Commands:
       s   Sort by: nodename, CPU, memory, network utilization
       r   Reverse the sort order
       p   Pause the display
       q   Quit
    Views:
       You can use the page up, page down, home and end keys to scroll through
       multiple pages.

  This tool is part of the Warewulf cluster distribution
     http://warewulf.lbl.gov/

 

In addition to the temperature updates, wwtop now offers a new “one pass” option where a single report for all nodes is sent to the screen. This output is useful for grabbing snapshots of cluster activity. To provide a clean text output (no escape sequences) use the following command:

$ wwtop -o|sed "s,\x1B\[[0-9;]*[a-zA-Z],\n,g" |grep .

A report similar to the following will be written to the screen (or file as directed):

Cluster totals: 4 nodes, 24 cpus, 33 GHz, 77.84 GB mem
Avg:    0% cputil, 923.00 MB memutil, load 0.04, 251 procs, uptime   4 day(s)
High:   0% cputil, 2856.00 MB memutil, load 0.08, 523 procs, uptime  19 day(s)
Low:    0% cputil, 251.00 MB memutil, load 0.00, 148 procs, uptime   0 day(s)
Node status:    4 ready,    0 unavailable,    0 down,    0 unknown
Node name
  CPU  MEM SWAP    Up GHz  Temp    Arch Proc  Load  Net:KB/s Stats/Util
headnode    0%  17%   0%  19.1  12  38.0  x86_64  523  0.08        22 |        |
n0          0%   0%   0%   0.2   4  30.0  x86_64  158  0.04         0 |        |
n1          0%   1%   0%   0.0  13  33.8  x86_64  175  0.04         0 |  IDLE  |
n2          0%   1%   0%   0.0   4  28.0  x86_64  148  0.00         1 |        |

Data Analytics Cluster Monitoring

monitoring_system_resources.1604957979.txt.gz · Last modified: 2020/11/09 21:39 by deadline