This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
|
monitoring_system_resources [2020/06/09 17:50] deadline created |
monitoring_system_resources [2021/04/30 15:33] (current) brandonm [Data Analytics Cluster Monitoring] Punctuation and word fixes; also fix 404 link |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | =====HPC Cluster Monitoring===== | + | ======HPC Cluster Monitoring====== |
| - | =====Data Analytics Cluster Monitoring===== | + | |
| + | ====Ganglia==== | ||
| + | |||
| + | On HPC systems the popular [[http:// | ||
| + | < | ||
| + | http:// | ||
| + | </ | ||
| + | |||
| + | In Firefox (or any other browser that is installed on the system). The default screen is shown below. | ||
| + | |||
| + | {{ : | ||
| + | |||
| + | Clicking on the '' | ||
| + | |||
| + | {{ : | ||
| + | |||
| + | Note that in addition to a myriad of other metrics it is possible to observe the CPU temperatures by selecting '' | ||
| + | |||
| + | {{ : | ||
| + | |||
| + | More information on usage and configuration can be found at the [[http:// | ||
| + | |||
| + | ====Warewulf Top (wwtop)==== | ||
| + | |||
| + | Warewulf Top is a command line tool for monitoring the state of the cluster. Similar to the '' | ||
| + | |||
| + | $ wwtop | ||
| + | |||
| + | The following screen will update in real time for nodes that are active (booted). | ||
| + | |||
| + | {{ : | ||
| + | |||
| + | Operation of the '' | ||
| + | |||
| + | < | ||
| + | USAGE: / | ||
| + | About: | ||
| + | wwtop is the Warewulf ' | ||
| + | the highest utilization, | ||
| + | general summary type data. This is an interactive curses based tool. | ||
| + | |||
| + | Options: | ||
| + | -h, --help | ||
| + | -o, --one_pass | ||
| + | |||
| + | Runtime Options: | ||
| + | Filters (can also be used as command line options): | ||
| + | | ||
| + | | ||
| + | | ||
| + | Commands: | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | Views: | ||
| + | You can use the page up, page down, home and end keys to scroll through | ||
| + | | ||
| + | |||
| + | This tool is part of the Warewulf cluster distribution | ||
| + | | ||
| + | </ | ||
| + | |||
| + | In addition to the temperature updates, '' | ||
| + | |||
| + | < | ||
| + | $ wwtop -o|sed " | ||
| + | </ | ||
| + | |||
| + | A report similar to the following will be written to the screen (or file as directed): | ||
| + | |||
| + | < | ||
| + | Cluster totals: 4 nodes, 24 cpus, 33 GHz, 77.84 GB mem | ||
| + | Avg: 0% cputil, 923.00 MB memutil, load 0.04, 251 procs, uptime | ||
| + | High: 0% cputil, 2856.00 MB memutil, load 0.08, 523 procs, uptime | ||
| + | Low: 0% cputil, 251.00 MB memutil, load 0.00, 148 procs, uptime | ||
| + | Node status: | ||
| + | Node name | ||
| + | CPU MEM SWAP Up GHz Temp Arch Proc Load Net:KB/s Stats/ | ||
| + | headnode | ||
| + | n0 0% | ||
| + | n1 0% | ||
| + | n2 0% | ||
| + | |||
| + | </ | ||
| + | |||
| + | ==== Slurm Top (slop)==== | ||
| + | |||
| + | A real-time text-based Slurm " | ||
| + | |||
| + | {{ : | ||
| + | |||
| + | The above example shows the Slurm batch queue in the top pane with job-ID partition, user, etc. The bottom pane displays the cluster nodes' metrics, similar to '' | ||
| + | |||
| + | < | ||
| + | Slop (SLurm tOP) displays node statistics and the batch queue on a cluster. | ||
| + | |||
| + | The top window is the batch queue and the bottom window are the hosts. The | ||
| + | windows update automatically and are scrollable with the arrow keys. A " | ||
| + | indicates that the list will scroll further. Available options: | ||
| + | q - to quit userstat | ||
| + | h - to get this help | ||
| + | b - to make the batch window active | ||
| + | n - to make the nodes window active | ||
| + | spacebar - update windows (automatic update after 20 seconds) | ||
| + | up_arrow - to move though the jobs or nodes window | ||
| + | down_arrow - to move though the jobs or nodes window | ||
| + | Pg Up/Down - move a whole page in the jobs or nodes window | ||
| + | |||
| + | Queue Window Commands: | ||
| + | j - sort on job-ID | ||
| + | u - sort on user name a - redisplay all hosts | ||
| + | p - sort on program name | ||
| + | a - redisplay all jobs | ||
| + | d - delete a job from the queue | ||
| + | return - display only the nodes for that job | ||
| + | (When sorting on multiple parameters all matches are displayed.) | ||
| + | |||
| + | Press ' | ||
| + | </ | ||
| + | |||
| + | A useful feature of '' | ||
| + | |||
| + | {{ : | ||
| + | |||
| + | If more specific node information is needed, a standard '' | ||
| + | on a node in the lower Host pane (switch to the host pane by entering " | ||
| + | |||
| + | {{ : | ||
| + | |||
| + | For more information on '' | ||
| + | |||
| + | ======Data Analytics Cluster Monitoring====== | ||
| + | |||
| + | Data analytics systems (i.e. Hadoop/ | ||
| + | |||
| + | {{ : | ||