pdsh Parallel Shell

2117

For HPC, one of the fundamentals is being able to run a command across a number of nodes in a cluster. A parallel shell is a simple but powerful tool that allows you to do so on designated (or all) nodes in the cluster, so you do not have to log in to each node and run the same command. This single tool has an infinite number of ways to be useful, but I like to use it when performing administrative tasks, such as:

  • discovering the status of the nodes in the cluster quickly,
  • checking the versions of particular software packages on each node,
  • checking the OS version on all nodes,
  • checking the kernel version on all nodes,
  • searching the system logs on each node (if you do not store them centrally),
  • examining the CPU usage on each node,
  • examining local I/O (if the nodes do any local I/O),
  • checking whether any nodes are swapping,
  • spot-monitoring the compute nodes, and
  • debugging.

This list is just the short version; the real list is extensive. Anything you want to do on a single node can be done on a large number of nodes using a parallel shell tool. However, for those that might be asking if they can use parallel shells on their 50,000-node clusters, the answer is that you can, but the time skew in the results will be large enough that the results might not be useful (which is a completely different subject). Parallel shells are more practical when used on a smaller number of nodes, on specific nodes (e.g., those associated with a specific job in a resource manager), or for gathering information that varies somewhat slowly. However, some techniques will allow you to run parallel commands on a large number of nodes.

Read more at ADMIN