Moab / Torque Primer Part 2: Resources
The main purpose of Torque and Moab is to manage, monitor and schedule resources. At a very high level the most obvious types of resources are compute nodes. Each of these compute nodes, however, contributes resources at a much finer level to the resource pool and resources monitors such as Torque often monitor processors, main memory, storage, etc. In addition to these rather common resource types, nodes may also be associated with other more specific resource types, such as network bandwidth, software licenses or power consumption. However, for a general understanding considering compute nodes and processors as resources is sufficient.
Torque is responsible for monitoring the resources provided by compute
nodes. Each compute node runs a small daemon program called
pbs_mom which collects all relevant resource information on
the compute node it is running on and sends this information to Torque’s
pbs_server which typically runs on the headnode of the
cluster. The pbs_server then stores this information in an
internal database which can be used by schedulers such as Moab to get
real-time information about the availability of resources.
Torque provides access to its list of available resources through the
pbsnodes
command. Without any command-line arguments the command lists all nodes
that Torque knows about and the resources it monitors for each node:
$ pbsnodes
compute-0-0.local
state = job-exclusive
np = 8
ntype = cluster
jobs = 0/45.cluster.local, 1/46.cluster.local, 2/47.cluster.local, 3/48.cluster.local, 4/49.cluster.local, 5/50.cluster.local, 6/51.cluster.local, 7/52.cluster.local
status = opsys=linux,uname=Linux compute-0-0.local 2.6.9-42.ELsmp #1 SMP Tue Aug 15 10:35:26 BST 2006 x86_64,sessions=9878 9881 9882 9883 9889 9918 9924 9927,nsessions=8,nusers=1,idletime=16682405,totmem=12258160kb,availmem=10771408kb,physmem=8161596kb,ncpus=8,loadave=9.43,netload=1074777221096,state=free,jobs=31.cluster.local 46.cluster.local 45.cluster.local 47.cluster.local 49.cluster.local 48.cluster.local 50.cluster.local 51.cluster.local 52.cluster.local,varattr=,rectime=1250282323
...
compute-0-3.local
state = free
np = 8
ntype = cluster
status = opsys=linux,uname=Linux compute-0-3.local 2.6.9-42.ELsmp #1 SMP Tue Aug 15 10:35:26 BST 2006 x86_64,sessions=? 0,nsessions=? 0,nusers=0,idletime=27314517,totmem=12258140kb,availmem=11926516kb,physmem=8161576kb,ncpus=8,loadave=0.00,netload=4289316359815,state=free,jobs=972217.cluster.local,varattr=,rectime=1250282329
For each node it lists the node’s state, the number of processors provided
by the node, as well as other attributes such as available memory or jobs
currently executed on the node. In most cases the default
pbsnodes output is too verbose and a number of command-line
switches can help filtering the information:
pbsnodes -l -Nlists all nodes marked with states offline, down, or unknown and respective user-defined comments associates with these nodes:$ pbsnodes -l -n compute-0-0.local offline reimaging compute-0-5.local offline blinking HDD compute-0-7.local down,offline replacing CPU
-
pbsnodes -l freelists all nodes that are free, i.e. which can potentially run jobs:$ pbsnodes -l free compute-0-1.local free compute-0-2.local free compute-0-3.local free compute-0-4.local free compute-0-6.local free
In addition to reporting node states, pbsnodes may be used to
set a node state manually. This is useful for maintenance or debugging of
hard-/software problems. pbsnodes -o [nodename] is used to
mark a node as offline. This allows jobs that may currently be
running on this node to complete, but no new jobs will be allocated to the
node. In Moab terminology the node is considered to be drained if
it is marked as offline. Additionally, when putting a node into an
offline state, a comment or note should always be provided, such
that the reason for taking the node offline can be easily
determined. The user-defined comment can be set with the -N
command line option:
$ pbsnodes -o -N "installing new hard-drive" compute-0-5.local $ pbsnodes -l -n compute-0-5.local offline installing new hard-drive
Once a node can be placed back into the pool of available resources its
offline state needs to be cleared. This is done by using the
-c option to pbsnodes. At the same time the
comment can be removed from the node using -N "":
$ pbsnodes -c -N "" compute-0-5.local
$ pbsnodes -l -n free
...
compute-0-5.local free
...
$ pbsnodes compute-0-5.local
compute-0-5.local
state = free
np = 8
ntype = cluster
status = opsys=linux,uname=Linux compute-0-5.local 2.6.9-42.ELsmp #1 SMP Tue Aug 15 10:35:26 BST 2006 x86_64,sessions=18901,nsessions=1,nusers=1,idletime=36154599,totmem=12258140kb,availmem=11888652kb,physmem=8161576kb,ncpus=8,loadave=0.01,netload=3655128653370,state=free,varattr=,rectime=1250283674
In addition to Torque’s tools to manage nodes, Moab also provides tools to
gain an insight into Moab’s view at the nodes and the resources they
provide. This is done using the checknode command.
The checknode command is only used to report information about
a node and cannot be used to modify any of the node’s parameters. A typical
command output for a busy node looks like:
$ checknode compute-0-3.local node compute-0-3.local State: Busy (in current state for 00:32:26) Configured Resources: PROCS: 8 MEM: 7970M SWAP: 11G DISK: 1M Utilized Resources: PROCS: 8 SWAP: 1189M Dedicated Resources: PROCS: 8 MTBF(longterm): INFINITY MTBF(24h): INFINITY Vars: RACK,SLOT,Rack Opsys: linux Arch: --- Speed: 1.00 CPULoad: 8.080 Network Load: 10.35 kB/s Flags: rmdetected Network: DEFAULT Classes: [queue1 8:8][queue2 0:8][queue3 8:8] RM[cluster] TYPE=PBS STATE=Busy EffNodeAccessPolicy: SHARED Total Time: 78:12:29:09 Up: 78:12:04:28 (99.98%) Active: 13:20:15:23 (17.63%) Reservations: res1.609x8 User 9:51:12 -> 11:51:12 (2:00:00) Blocked Resources@9:51:12 Procs: 8/8 (100.00%) Mem: 0/7970 (0.00%) Swap: 0/11970 (0.00%) Disk: 0/1 (0.00%) 66x8 Job:Running -00:32:26 -> 1:27:34 (2:00:00) res1.614x8 User 1:09:51:12 -> 1:11:51:12 (2:00:00) Blocked Resources@1:09:51:12 Procs: 8/8 (100.00%) Mem: 0/7970 (0.00%) Swap: 0/11970 (0.00%) Disk: 0/1 (0.00%) Jobs: 66
The output provides a number of details about the compute node including information on what resources are available on the node (Configured), which of those are currently in use (Utilized), and which resources have been dedicated to a job (Dedicated). Note that a node may show a utilization of resources that are not explicitly dedicated. In the case above, the resources that are dedicated to a job are 8 CPUs, however, the node also reports a utilization of 1189 MB of swap space. In most cases this is ok until the utilized resources reach the limit of physically available resources or the load of the system dramatically exceeds the number of available processors. In the ideal case a job should always only utilize at most the resources that are dedicated to it and if this is not the case the job’s owner should be informed to adjust the resources requested by any subsequent similar jobs.
The checknode output additionally provides an overview of any current and
future resource reservations. The example above shows reservations for 2
future standing reservations (res1) and one current reservation
for a job (66). One can identify current reservations by their start
time being negative.
Aside of typical computation resources such as CPU, main memory and disk,
Torque and Moab allow user-defined resources to be associated with nodes.
Such resources can for example be special hardware devices, a limited
number of software licenses or application specific resource measures (e.g.
Dreamhost’s
Conueries). These kinds of resources are typically defined through the
NODECFG
directive in Moab’s configuration file.