Systems Manual

Limulus HPC Systems Quick Start

The following is basic information about your Limulus system. Please consult the other sections in this manual for further details. A general description of the systems can be found by consulting the Functional Diagram.

Adding Users:

There are two scripts for adding and deleting users:

AddUser
DelUser

There is also a GUI tool for adding and deleting users. The GUI can be be found in the root Applications/System Tools menu or started from the command line using:

$ AddUser -g

See Adding Users.

The Case:

The case has two doors on both sides and a removable front bezel. The front bezel can be “popped” off by opening one (or both) of the doors and pulling the bezel forward. Behind the bezel you will find the three node blades. The blades are placed behind the bezel to provide protection and allow limited access (under normal conditions you will not need to access the blades). There are also six removable (not hot swap) 3.5-inch drive bays across the top on the inside. Depending on the model, there may also be user accessible removable SSD drives (also not hot swap). Local LAN port and Video are labeled on the back. (Note: On AMD Systems, the video port on northerner board is non-functional.)

Login Node (Main Node):

The login node is named limulus (alias headnode). A unique root password will be provided. The login node is the “outward facing node”. It is the traditional motherboard location in the case. Motherboard ports (Video, LAN, USB, etc.) are available on the back of and front of the case. See Logging In.

Worker Nodes:

The worker nodes are on blades that are behind the bezel. Blades are removed by turning off the blade power (see below), removing the cables, unscrewing the captive bolts (top and bottom), and pulling the blade out using the captive bolts. The blade position is unique (the same blade must always be placed in the same slot, otherwise the nodes will need to be re-registered). The blades have a number on the small power-supply on the back of the board. Node names begin with “n0” and continue to the total number of workers in the system. Typically, this is n0, n1, n2. Looking at the front of the system with the bezel off, node n0 is in the rightmost position, node n1 is in the middle, and node n2 is in the leftmost position and can be seen when the through the door.

Networking:

All systems have a 1 GbE network. All administrative daemons and shared file systems (NFS) are assigned to this network. The login node acts as a gateway node (i.e. it has an external LAN connection and an internal cluster network connection). The LAN port (as labeled on the case) is set to use DHCP. The internal cluster 1 GbE network uses 10.0.0.0/24 addresses (the login node is 10.0.0.1; the nodes, such as n0, start at 10.0.0.10 and are sequential). The login node is connected to this network by the short blue Ethernet cable on the back of the unit. This cable connects the login node to the internal 1 GbE switch.

The internal Ethernet switch (as seen inside the case above the power supply) connects to the internal cluster network. In addition to the blue cable connected to the login node, one additional external cluster Ethernet port is available on the back of the unit for expansion (i.e. a cluster NAS or more nodes).

If you have the 10/25 GbE option, nodes can be addressed by using “limulusg, n0g, n1g, n2g.” (i.e. addressing nodes with these names will use the 10/25 GbE network instead of the 1 GbE network). The IP addresses mirror the 1 GbE address but use a 10.1.0.0 network. Typically, these are used by MPI applications. If installed, Hadoop systems are configured to use the 10/25 GbE network.

The 10/25 GbE network is “switched” by using a 4-port 10/25 GbE card plugged into the login node. Like the 1 GbE network, this board has an open port for expansion. Note, the three ports currently in use on the 10/25 GbE card must be used for the cluster network; the cables do not need to be inserted in a particular order, however, the same three ports on the 10/25 GbE card must be used. On the large eight-motherboard systems two of these 4-port cards are bridged to provide full 10/25 network throughout the cluster. See Internal Networks.

Power Control:

Power to the three worker nodes is controlled by the login node. There is a set of relays underneath the 1 GbE switch. These relays can be controlled directly by using the relayset program (n0 is connected to relay-2, n1 is connected to relay-3, and n2 is connected to relay-4). However, it is not recommended to control the nodes using relayset. We highly recommended using the node-poweron and node-poweroff utilities to control the node power. (Using these commands, as root, with no arguments will power-on/off all nodes. An individual node can be supplied as an argument.) There is also a GUI tool for controlling node power. See Powering Up/Down Nodes.

!!On HPC Limulus systems: All nodes are in the powered down state when the system boots!!

To turn the nodes on, enter node-poweron. To turn the nodes off, enter node-poweroff (the OS on the nodes is shut down gracefully). Use the “-h” option to get a full explanation.

SSH Access:

The root user has ssh access to all nodes, e.g. ssh n0. Users do not have ssh access to the worker nodes. These nodes are used though the Slurm WorkFlow Scheduler. See Submitting Jobs to the Slurm Workload Manager.

File System:

The file system configuration can vary depending on installation options. In general, there is a solid state NVMe drive on the head node that stores the OS. There is an /opt and /home partition NFS mounted to all nodes. If a RAID option has been included, the RAID drives will be installed in the top drive area in the case and mounted on the head node. Worker nodes use the disk-less Warewulf VNFS filesystem.

See RAID Storage Management.

Software:

The system has been installed with CentOS 7. Specific Limulus and OpenHPC software has been installed as RPM packages. The yum utility can be used to add additional software to the login node. See OpenHPC Components.

The worker nodes are provisioned by the Warewulf Cluster Toolkit. Adding and removing software from Warewulf provisioned nodes requires some additional steps. See Warewulf Worker Node Images.

Adaptive Cooling and Filters:

The nodes have adaptive cooling. There are two intake fans underneath the worker nodes pushing air up into the nodes from underneath the case. The speed of the intake fans will increase or decrease depending on the node processor temperatures. Note that all intake locations on the case have magnetic dust filters (including the bottom of the case). In dusty environments, these filters will become dirty and can adversely affect the ability of the fans to move air through the case. It is recommended that you check and clean these filters regularly. They can be easily cleaned with water. See Adaptive Cooling.

Monitoring:

There are two basic ways to monitor the cluster as a whole. The first is the web-based Ganglia interface. (Open a browser and point to http://localhost/ganglia.) Please note, when the nodes start it may require 5-10 minutes before they are fully reporting to the Ganglia host interface. See Monitoring System Resources.

There is also a command line utility called wwtop. This utility provides a real-time “top”-like interface for the entire cluster.

Additional Information:

See the rest of the manual for additional topics and information.