Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
monitoring_system_resources [2021/01/02 20:55] deadline [Data Analytics Cluster Monitoring] added Ambari |
monitoring_system_resources [2021/04/30 15:33] (current) brandonm [Data Analytics Cluster Monitoring] Punctuation and word fixes; also fix 404 link |
====Ganglia==== | ====Ganglia==== |
| |
On HPC systems the popular [[http://ganglia.sourceforge.net/|Ganglia monitoring tool]] is available. To bring up the Ganglia interface simple enter: | On HPC systems the popular [[http://ganglia.sourceforge.net/|Ganglia monitoring tool]] is available. To bring up the Ganglia interface simply enter: |
<code> | <code> |
http://localhost/ganglia/ | http://localhost/ganglia/ |
</code> | </code> |
| |
In Firefox (or any other browser that is installed on the system) the default screen is shown below. | In Firefox (or any other browser that is installed on the system). The default screen is shown below. |
| |
{{ :wiki:ganglia-main.png?600 |}} | {{ :wiki:ganglia-main.png?600 |}} |
{{ :wiki:ganglia-load1.png?600 |}} | {{ :wiki:ganglia-load1.png?600 |}} |
| |
Note that in addition to a myriad of other metrics it is possible to observe the CPU temperatures by selecting ''cpu_temp'' (as shown below) | Note that in addition to a myriad of other metrics it is possible to observe the CPU temperatures by selecting ''cpu_temp'' (as shown below). |
| |
{{ :wiki:ganglia-temps.png?600 |}} | {{ :wiki:ganglia-temps.png?600 |}} |
| |
More information on using and configuration can be found at the [[http://ganglia.sourceforge.net/|Ganglia web site]] | More information on usage and configuration can be found at the [[http://ganglia.sourceforge.net/|Ganglia web site]]. |
| |
====Warewulf Top (wwtop)==== | ====Warewulf Top (wwtop)==== |
| |
Warewulf Top is a command line tool for monitoring the state of the cluster. Similar to the ''top'' command ''wwtop'' is part of the Warewulf cluster provisioning and management system used on Limulus HPC systems. ''wwtop'' has been augmented to work directly with Limulus systems. Real time CPU temperatures and frequencies are now reported. To run Warewulf Top enter: | Warewulf Top is a command line tool for monitoring the state of the cluster. Similar to the ''top'' command, ''wwtop'' is part of the Warewulf cluster provisioning and management system used on Limulus HPC systems. ''wwtop'' has been augmented to work directly with Limulus systems. Real-time CPU temperatures and frequencies are now reported. To run Warewulf Top enter: |
| |
$ wwtop | $ wwtop |
This tool is part of the Warewulf cluster distribution | This tool is part of the Warewulf cluster distribution |
http://warewulf.lbl.gov/ | http://warewulf.lbl.gov/ |
| </code> |
</code> | |
| |
In addition to the temperature updates, ''wwtop'' now offers a new "one pass" option where a single report for all nodes is sent to the screen. This output is useful for grabbing snapshots of cluster activity. To provide a clean text output (no escape sequences) use the following command: | In addition to the temperature updates, ''wwtop'' now offers a new "one pass" option where a single report for all nodes is sent to the screen. This output is useful for grabbing snapshots of cluster activity. To provide a clean text output (no escape sequences) use the following command: |
==== Slurm Top (slop)==== | ==== Slurm Top (slop)==== |
| |
A real-time text-based Slurm "Top like tool" called slop (SLurm tOP) is provided on all HPC Limulus systems. ''Slop'' allows the batch queue and node status (similar to ''wwtop'' above) to be viewed from a text terminal. The screen will update every 20 seconds, but can be updated at any time by hitting the space bar. And example of the ''slop'' interface is shown below. | A real-time text-based Slurm "Top-like tool" called slop (SLurm tOP) is provided on all HPC Limulus systems. ''slop'' allows the batch queue and node status (similar to ''wwtop'' above) to be viewed from a text terminal. The screen will update every 20 seconds, but can be updated at any time by hitting the space bar. And example of the ''slop'' interface is shown below. |
| |
{{ :wiki:slop-main.png?600 |}} | {{ :wiki:slop-main.png?600 |}} |
| |
The above example shows the Slum batch queue in the top pane with job-ID partition, user, etc. The bottom pane displays the cluster nodes metrics similar to ''wwtop''. Pressing the ''h'' key will bring up the help menu (as shown below) Note: additional help with Slurm job states is also available. | The above example shows the Slurm batch queue in the top pane with job-ID partition, user, etc. The bottom pane displays the cluster nodes' metrics, similar to ''wwtop''. Pressing the ''h'' key will bring up the help menu (as shown below). Note: additional help with Slurm job states is also accessible by pressing the ''S'' key in the help menu. |
| |
<code> | <code> |
</code> | </code> |
| |
A useful feature of ''slop'' is the the ability to "drill down"" into job resources. For instance, if the cursor is placed on a job and then "Return" is presses, only those nodes that are used by that particular job are displayed. In the following image, the node used for job 41622 is displayed in the lower pane. | A useful feature of ''slop'' is the the ability to "drill down" into job resources. For instance, if the cursor is placed on a job and then "Return" is pressed, only those nodes that are used by that particular job are displayed. In the following image, the node used for job 41622 is displayed in the lower pane. |
| |
{{ :wiki:slop-node.png?600 |}} | {{ :wiki:slop-node.png?600 |}} |
{{ :wiki:slop-top.png?600 |}} | {{ :wiki:slop-top.png?600 |}} |
| |
For more information on ''slop'' consult the man page. | For more information on ''slop'', consult the man page. |
| |
======Data Analytics Cluster Monitoring====== | ======Data Analytics Cluster Monitoring====== |
| |
Data analytics systems (i.e. Hadoop/Spark/Kafka etc.) are managed by [[ https://ambari.apache.org/Ambari |Apache Ambari]]. Ambari is a web based management tool designed to make manage Hadoop clusters simpler. This includes provisioning, managing, and monitoring Apache Hadoop clusters (that often include tools like Spark and Kafka and others). Ambari provides an intuitive, easy-to-use Hadoop management web UI. An example of the Ambari dashboard is provided below. | Data analytics systems (i.e. Hadoop/Spark/Kafka/etc.) are managed by [[ https://ambari.apache.org/|Apache Ambari]]. Ambari is a web-based management tool designed to make managing Hadoop clusters simpler. This includes provisioning, managing, and monitoring Apache Hadoop clusters (that often include tools like Spark, Kafka and others). Ambari provides an intuitive, easy-to-use Hadoop management web UI. An example of the Ambari dashboard is provided below. Basic Ambari usage is provided in the [[Using the Apache Ambari Cluster Manager|Using the Apache Ambari Cluster Manager]] section. |
| |
{{:wiki:ambari-control-panel.png?400|}} | {{ :wiki:ambari-control-panel.png?600 |}} |