This is an old revision of the document!
On HPC systems the popular Ganglia monitoring tool is available. To bring up the Ganglia interface simple enter:
http://localhost/ganglia/
In Firefox (or any other browser that is installed on the system) the default screen is shown below.
Clicking on the Limulus OHPC Cluster
in the Choose a Source
drop-down menu will show the individual nodes in the cluster. The load_one
(one minute load) is displayed in total and for the individual nodes (shown below).
Note that in addition to a myriad of other metrics it is possible to observe the CPU temperatures by selecting cpu_temp
(as shown below)
More information on using and configuration can be found at the Ganglia web site
Warewulf Top is a command line tool for monitoring the state of the cluster. Similar to the top
command wwtop
is part of the Warewulf cluster provisioning and management system used on Limulus HPC systems. wwtop
has been augmented to work directly with Limulus systems. Real time CPU temperatures and frequencies are now reported. To run Warewulf Top enter:
$ wwtop
The following screen will update in real time for nodes that are active (booted).
Operation of the wwtop
interface is described by the command help option shown below.
USAGE: /usr/bin/wwtop [options] About: wwtop is the Warewulf 'top' like monitor. It shows the nodes ordered by the highest utilization, and important statics about each node and general summary type data. This is an interactive curses based tool. Options: -h, --help Show this banner -o, --one_pass Perform one pass and halt Runtime Options: Filters (can also be used as command line options): i Display only idle nodes d Display only dead and non 'Ready' nodes f Flush any current filters Commands: s Sort by: nodename, CPU, memory, network utilization r Reverse the sort order p Pause the display q Quit Views: You can use the page up, page down, home and end keys to scroll through multiple pages. This tool is part of the Warewulf cluster distribution http://warewulf.lbl.gov/
In addition to the temperature updates, wwtop
now offers a new “one pass” option where a single report for all nodes is sent to the screen. This output is useful for grabbing snapshots of cluster activity. To provide a clean text output (no escape sequences) use the following command:
$ wwtop -o|sed "s,\x1B\[[0-9;]*[a-zA-Z],\n,g" |grep .
A report similar to the following will be written to the screen (or file as directed):
Cluster totals: 4 nodes, 24 cpus, 33 GHz, 77.84 GB mem Avg: 0% cputil, 923.00 MB memutil, load 0.04, 251 procs, uptime 4 day(s) High: 0% cputil, 2856.00 MB memutil, load 0.08, 523 procs, uptime 19 day(s) Low: 0% cputil, 251.00 MB memutil, load 0.00, 148 procs, uptime 0 day(s) Node status: 4 ready, 0 unavailable, 0 down, 0 unknown Node name CPU MEM SWAP Up GHz Temp Arch Proc Load Net:KB/s Stats/Util headnode 0% 17% 0% 19.1 12 38.0 x86_64 523 0.08 22 | | n0 0% 0% 0% 0.2 4 30.0 x86_64 158 0.04 0 | | n1 0% 1% 0% 0.0 13 33.8 x86_64 175 0.04 0 | IDLE | n2 0% 1% 0% 0.0 4 28.0 x86_64 148 0.00 1 | |