This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
monitoring_system_resources [2020/08/24 16:09] deadline added text for ganglia section |
monitoring_system_resources [2021/04/30 15:33] (current) brandonm [Data Analytics Cluster Monitoring] Punctuation and word fixes; also fix 404 link |
||
---|---|---|---|
Line 4: | Line 4: | ||
====Ganglia==== | ====Ganglia==== | ||
- | On HPC systems the popular [[http:// | + | On HPC systems the popular [[http:// |
< | < | ||
http:// | http:// | ||
</ | </ | ||
- | In Firefox (or any other browser that is installed on the system) | + | In Firefox (or any other browser that is installed on the system). The default screen is shown below. |
{{ : | {{ : | ||
Line 17: | Line 17: | ||
{{ : | {{ : | ||
- | Note that in addition to a myriad of other metrics it is possible to observe the CPU temperatures by selecting '' | + | Note that in addition to a myriad of other metrics it is possible to observe the CPU temperatures by selecting '' |
{{ : | {{ : | ||
- | More information on using and configuration can be found at the [[http:// | + | More information on usage and configuration can be found at the [[http:// |
====Warewulf Top (wwtop)==== | ====Warewulf Top (wwtop)==== | ||
- | Warewulf Top is a command line tool for monitoring the state of the cluster. Similar to the '' | + | Warewulf Top is a command line tool for monitoring the state of the cluster. Similar to the '' |
- | wwtop | + | |
| | ||
- | The following screen will update in real time. | + | The following screen will update in real time for nodes that are active (booted). |
- | + | {{ :wiki:wwtop-with-temps.png?600 |}} | |
- | {{ : | + | |
Operation of the '' | Operation of the '' | ||
Line 45: | Line 44: | ||
Options: | Options: | ||
-h, --help | -h, --help | ||
+ | -o, --one_pass | ||
Runtime Options: | Runtime Options: | ||
Line 62: | Line 62: | ||
This tool is part of the Warewulf cluster distribution | This tool is part of the Warewulf cluster distribution | ||
| | ||
- | </ | + | </ |
+ | |||
+ | In addition to the temperature updates, '' | ||
+ | |||
+ | < | ||
+ | $ wwtop -o|sed " | ||
+ | </ | ||
+ | |||
+ | A report similar to the following will be written to the screen (or file as directed): | ||
+ | |||
+ | < | ||
+ | Cluster totals: 4 nodes, 24 cpus, 33 GHz, 77.84 GB mem | ||
+ | Avg: 0% cputil, 923.00 MB memutil, load 0.04, 251 procs, uptime | ||
+ | High: 0% cputil, 2856.00 MB memutil, load 0.08, 523 procs, uptime | ||
+ | Low: 0% cputil, 251.00 MB memutil, load 0.00, 148 procs, uptime | ||
+ | Node status: | ||
+ | Node name | ||
+ | CPU MEM SWAP Up GHz Temp Arch Proc Load Net:KB/s Stats/ | ||
+ | headnode | ||
+ | n0 0% | ||
+ | n1 0% | ||
+ | n2 0% | ||
+ | |||
+ | </ | ||
+ | |||
+ | ==== Slurm Top (slop)==== | ||
+ | |||
+ | A real-time text-based Slurm " | ||
+ | |||
+ | {{ : | ||
+ | |||
+ | The above example shows the Slurm batch queue in the top pane with job-ID partition, user, etc. The bottom pane displays the cluster nodes' metrics, similar to '' | ||
+ | |||
+ | < | ||
+ | Slop (SLurm tOP) displays node statistics and the batch queue on a cluster. | ||
+ | |||
+ | The top window is the batch queue and the bottom window are the hosts. The | ||
+ | windows update automatically and are scrollable with the arrow keys. A " | ||
+ | indicates that the list will scroll further. Available options: | ||
+ | q - to quit userstat | ||
+ | h - to get this help | ||
+ | b - to make the batch window active | ||
+ | n - to make the nodes window active | ||
+ | spacebar - update windows (automatic update after 20 seconds) | ||
+ | up_arrow - to move though the jobs or nodes window | ||
+ | down_arrow - to move though the jobs or nodes window | ||
+ | Pg Up/Down - move a whole page in the jobs or nodes window | ||
+ | |||
+ | Queue Window Commands: | ||
+ | j - sort on job-ID | ||
+ | u - sort on user name a - redisplay all hosts | ||
+ | p - sort on program name | ||
+ | a - redisplay all jobs | ||
+ | d - delete a job from the queue | ||
+ | return - display only the nodes for that job | ||
+ | (When sorting on multiple parameters all matches are displayed.) | ||
+ | |||
+ | Press ' | ||
+ | </ | ||
+ | |||
+ | A useful feature of '' | ||
+ | |||
+ | {{ : | ||
+ | |||
+ | If more specific node information is needed, a standard '' | ||
+ | on a node in the lower Host pane (switch to the host pane by entering " | ||
+ | |||
+ | {{ : | ||
+ | |||
+ | For more information on '' | ||
======Data Analytics Cluster Monitoring====== | ======Data Analytics Cluster Monitoring====== | ||
+ | |||
+ | Data analytics systems (i.e. Hadoop/ | ||
+ | |||
+ | {{ : |