DISTRIBUTED QUEUEING
SYSTEM - 3.1.3
SUPERCOMPUTER COMPUTATIONS RESEARCH INSTITUTE
Table
of ContentsIntroduction 3A DQS job 3submitting a job 4querying job status 5modifying a job request 6holding, deleting jobs 7requesting resources 7hard and soft resources 8consumable resources 8forming resource requests 8"potential" resources 10moving jobs 10cells and queues 11Qmove 12suspending queues and jobs 12parallel jobs 12job execution environment 13DQS scheduling strategies 13Problem Solving 14
DQS is actually a simple system which provides a
multitude of options to accommodate the requirements of a wide
variety of sites, and users. As the number of options increase,
as they do with each succeeding generation of DQS, a user might
mistakenly come to view the system as quite complex. This user
guide is intended to provide an introduction to DQS for the new
user as well as explaining those features most often used by the
experienced user. In particular the concept of "resources"
is explored with attention focused on the new DQS 3.1.3 feature,
"consumable resources".
Any job which a user needs to execute on one or more computers can be a "DQS job". For those whose sole contact with computers has been through the means of personal UNIX workstations the concept of running their jobs in a "batch" mode may be somewhat disconcerting. Users accustomed to submitting their jobs to mainframe computers will be more familiar with the attributes of DQS. But, unlike the mainframe system the DQS batch environment customarily includes multiples of autonomous UNIX based computational platforms heterogeneous in hardware architecture and operating system variant.
In its most fundamental form, a DQS job is an extension of a UNIX script used to run an application, as one might even on their own personal workstation. Let us use the "traditional" example of a FORTRAN compilation and execution of a simple application:
FTN test.f -o test
test
where test simply produces the classical "Hello World" output which is sent to standard error.
If we then wish to run this same application within a UNIX script we would create a file called "test.run" with the following lines:
Note that we redirect the stdout and stderr files to "test.out" and "test.errors" respectively.
This script would then be executed by the user on a machine of their choice, most likely their own workstation.
What then is needed to turn this script into a DQS job ? Nothing
as long as one doesn't care what machine it will be executed on.
All that is needed is to "submit" this job to the DQS
batch queuing system.
The simple example becomes a "DQS job" by submitting it to the DQS system with the "qsub" utility :
qsub test.run
The qsub utility will contact the qmaster and request that the job be "validated" for execution within the system. This "validation" process of determining whether or not the not the job requires something which does not exist in the current system. Since our test script makes no obvious requests for resources (the FTN command is not recognized as a request for a compiler resource known by DQS) all that is needed is for any host in this hypothetical cell to be idle, and available to execute the job.
Let us now take advantage of some basic DQS facilities. First we would like to have an email message sent to us upon job termination. We must instruct DQS to perform this task by inserting a "DQS directive" into the test.run script. By default DQS interprets any line of script as "DQS directive" if the first two characters of the line are the string "#$". This can be changed by the user (see qsub -C option in the Reference Manual).
Thus we add one line to our script:
The DQS directive #$ -me tells the system that a mail message should be sent to the person submitting the job at the end of the job. We could also have directed that we wish to have a mail message sent at the beginning of the job and also if the job aborts with the directive #$ -meab. The order of the symbol 'e' 'a' and 'b' in this list is not significant.
Note that the directive could also be communicated with DQS on the qsub command line. Instead of inserting the directive in the script, we could perform the submission with:
qsub -me test.run
In cases where only a few directives are needed this approach might be used, but as the user will see many job submissions will benefit from more complex sets of DQS directives which are better "captured" in the job script.
Once a user has relinquished their job to the "welcoming arms" of a queuing system they need a means for monitoring and controlling its destiny. A first step is to query the system to establish the status and "DQS identity" of the job. The "qstat" utility is used to display the state of queues and jobs. There are three forms of this display:
default (no options) Displays the state of user jobs in summary form
full listing (-f option) Displays summary queue and job status
extended listing (-ext option ) Displays the full queue and job descriptions
The simplest command then to get in touch with our job is to execute the command:
qstat
and scan through the output looking for jobs we have submitted. Instead of being deluged with information about every other job in the system one can execute:
qstat -u <my user name>
where <my user name> is the login name of the user who submitted the job.
The output of this variant might look like:
---Pending Jobs------------------------------------------------------------------------------------------
<my user name> my-job-name dqs-job-number 0:0 QUEUED 03/25 20:40
Which would indicate with some dismay accruing to <my user name> in that the job is not RUNNING on any machine in the system. But it is queued with a priority of zero ( the leftmost digit from "0:0". And our sub-priority is zero (rightmost digit ) indicating that there are no prior jobs for this user.
or more optimistically the display might offer:
Queue Name Queue Type Quan Load State
queue1 batch 1/1 0.14 er UP
<my user name> 2183 0:1 r
RUNNING 02/12/96 19:25:56
Which would hearten us in our endeavors, because our job is (apparently) executing. The symbols on the output lines may be a bit confusing because the first line shows the status of the queue while the second describes "our job" .
Let us examine the queue description first:
Queue Name queue1 each queue is given a unique name by the administrator
Queue Type batch the default mode of all DQS queues
Quan 1/1 one resource ("1/ ") of one available (" /1") are utilized
Load 0.14 the load average measured by the queue1 CPU is 0.14
er all of the queue states are displayed in single character symbols.. The most important of these is presented under the heading "State" . The "e" shows that the queue is ENABLED. The ""r" shows that the queue is running.
State UP The normal more of operation will be shown as "UP"
.
The job description is a bit less cryptic. The entry begins with <my user name> and followed by the DQS assigned job number (2183). The values 0:1 give the submission priority of the job, defaulted to zero and the sub-priority :1 which indicates that this is the first job running for this user. The submission priority is assigned by the user with the QSUB option flag "-p" while the sub-priority is an internal parameter computed during each scheduling pass for all the queues.
The command "qstat -ext" produces a comprehensive
display of queue and job parameters as well as the status obtained
with the "-f" option. Discussion of relevant portions
of these extended displays will appear in later sections.
Often a user will find one or more of their jobs in the pending queue awaiting assignment to an execution queue. After review of their pending jobs, this user may decide to change the jobs submission parameters to affect the jobs future scheduling. One method for this would be to delete the job and resubmit it. A more convenient technique is to use the QALTER utility to modify one or more of the parameters which the user assigned at the time of QSUB, or defaulted by DQS when not explicitly designated by the user.
In the simple example given here, the user provided no parameters to the QSUB command and hence the submission priority has been set to the default value of zero. If the user wishes to increase that priority the QALTER utility would be invoked with :
qalter -jid <job number> -p <new priority>
The <job number> is that which DQS assigned to the job in the pending queue, and the <new priority> value must be in the range -1024 to +1023.
Except for the job number, any parameter which can be employed with the QSUB command can be used with the QALTER command, including replacing the script file which originally accompanied the QSUB command. The QALTER command may not be used for jobs already in the RUNNING state, with exception of the return of "consumable resources" (see below).
The user has a number of tools available to work with their jobs once the those jobs are in the queuing system. For example they may decide to place a "hold" on one of their jobs in the pending queue so that another job may progress ahead of it or to delay scheduling until some other event or job has occurred. First the user may chose to submit a job to the system with a "hold" placed on the job at the time of the submission. This step involves the use of the "-h" option in the QSUB command. Once a job is submitted the user can use the "QHOLD" utility to place a hold on a job if it is still in the PENDING queue. The QHOLD uses the same "-h" option.
The "-h" option is used for system administration tasks as well as user access. Thus the DQS 3.1.3 Reference Manual describes four alternatives. The user is permitted only the "u" (or user hold) or the "n" (no hold) variants. Thus at job submission the user might place a hold:
qsub . . -h u .. test.run
Or if the job is in the pending queue:
qhold -jid <job number> -h u
Once a "hold" has been placed on a job in the pending queue it will not be considered eligible for scheduling until it has either been "released" from the hold or it is deleted from the queue entirely. A job can be released from a user invoked "hold" with the QRLS utility:
qrls -jid <job number> -h u
or the user may modify the "hold" state by using the QALTER command:
qalter -jid <job number> .. .. -h n
Which will set the user accessible hold to "none".
A use may delete one or more of their own jobs from the queuing system if the jobs are in either the pending queue or the executing queue:
QDEL <job number>
or:
QDEL <job number>,<job number>, ..
Note that the job numbers are separated by commas and NOT spaces.
The simple example we have been using so far (test.run) has made no unusual demands for system resources. It presumes that all queues in the system have a FORTRAN compiler and that the FORTRAN dialect in our test program is consistent with all the compilers. Further, memory, disk-space and data-base locality are also not consequential in this example. These are unrealistic assumptions in most cases. Most sites using DQS contain heterogeneous collections of hardware and software and often subdivide these collections into types of use (long-term jobs , short-term jobs, etc. ) .
The DQS administrator is supplied with tools for organizing the system and defining resources to be accessible by the user. Typical resources are CPU memory sizes. hardware architecture and operating system versions.
Most jobs will have one or more imperative requirements. One of the most common is the need for a particular hardware/software system (i.e. AIX-3.2.5). By default requested resources are considered essential (or "hard") unless the user precedes the request in the QSUB command with the option "-soft'.
Requirements for multiples of various resources in parallel jobs, such as 2 or more CPUs can be either "hard" or "soft". Many users choose to request at least 2 CPUs to run their parallel job and then request more processors following the option "-soft" flag in the QSUB command line or job script. While a non-parallel user might expect to use the "-soft" option for a request of the form "I need at least 32 MB of memory but would be much happier with 64 MB), most site resource allocations will not make effective use of such a request. The most common use of the "-soft" option for non-parallel jobs is to state a preference for a queue without making it a "hard" demand.
Site resources are by and large static over periods of time like days or weeks. CPU memory sizes and CPU computing power are not subject to moment-by-moment changes. When they are modified the DQS site manager can adjust the resource descriptions to match the new configurations.
There is a class of resource which does vary within short periods of time. A very common commercial practice, these days, is to manage software licenses for Compilers, Data Base Managers, etc. dynamically at a given site. Many sites do not purchase licenses for all of their extant platforms. A job submitted to DQS must not be scheduled for execution if that job needs one or more software licenses in order to complete but those licenses are already in use by another job.
Another common form of a time-varying resource would be the amount of shared memory available to a processor in a shared-memory multi-processor system. Shared local disk space might be another resource which is depleted and restored as jobs startup and terminate. Resources of this type are called, by DQS, "consumable resources".
A user specifies the resources they require in the QSUB command line or in the DQS script file. A most direct method is to identify a specific queue as the place for the submitted job to execute:
QSUB -q <my queue>
That request will require <my queue> for execution. If the user would prefer, but not insist on that queue they might make the command line request:
QSUB -soft -q <my queue>
Note that DQS scans the command line and script commands from left to right. During that process any resource requests to the right of a "-hard" or "-soft" option flag will be interpreted as requiring that type of resource. Hence one could mix hard and soft resources thus :
QSUB
-soft -q <my queue> .. .. -hard <some other
resource>..
The typical job request will not demand a specific queue. Instead the user will request one or more classes of resources which have been established by the DQS administrator. Let us presume a site with three different hardware platform architectures for which there are several CPUs available each. The site administrator has named the resources with their operating system tags, AIX325, IRIX53, SOLARIS24. In addition this example site will own one FORTRAN license each for the different operating systems. The administrator will name these , XLF, SGIFTN and FORTRAN.
To further complicate our example, each brand of CPU has a different amount of memory on each of its three separate CPUs, 32 Megabytes, 64 Megabytes and 128 Megabytes.
The example we have been using (test.run) will now be submitted in a more realistic manner:
qsub -me -l AIX325.and.(mem.gt.32).and (XLF.eq.1) test.run
The command line now has the resource request appended to it. Requests for resources other than specific queue names begin with the "-l" flag and consists of a string of resource names, interspersed with logical and relational operators. Since the string must have NO imbedded blanks, parenthesis make be used to aid readability.
The resource request is interpreted by DQS as follows:
A command line or DQS script may contain one or more request strings beginning with the "-l" option flag. Each one of these strings will request at least one queue to meet the requirement. Thus:
qsub -l AIX325 -l AIX325
Would request that two queues/CPUs be allocate to this job. This same request can be restated more simply:
qsub -l (qty.eq.2).and.AIX325
Depending upon the topology of the DQS site and the requirements of a given job, resource requests can contain a number of elements. Obviously parallel jobs will require more complex resource requests than simple single-processor jobs.
Note: Relational operators can be given in FORTRAN or "C" syntax (.eq. == , .ne. != , .lt. <, .gt.. > , .le. <=, .ge. >= ). Logical operators can also be given in either language syntax ( .and. &&, .or. ||, .not. !). For compatibility with DQS 3.2.4 the comma (.) may be used in place of the logical ".and." operator.
The consumable resource "XLF" requested by the job can be returned to the license pool by a RUNNING job by executing the DQS command QALTER with the "-rc" option:
QALTER -rc XLF=1
This command would return one XLF license to the system
DQS 3.1.3 performs a pre-validation of jobs before accepting them into the queuing system. This pre-validation consists of searching all queue definitions to see if the "hard" resources requested for the job actually exist, even if they may be in use by some other job at the time this job was submitted. If all of the "hard" resources so not exist, the job is rejected, and an error message with the reason for the rejection returned to the QSUB utility and displayed for the user.
In some cases a user may be aware that a resource (such as a new) queue will be added or returned to the DQS at some point in the future. They may wish to submit their job and place it into the pending queue to await the appearance of the new resource. This can be done by adding the "FORCE REQUEST" flag ('-F')to the QSUB command line or DQS script:
qsub -F -l (wild_eyed_scheme).and.mem.gt.1000000
The "-F" flag should be used with care as no pre-validation is performed and a job may have an erroneous resource request which will leave it "orphaned" in the pending queue until someone deletes it at a later time.
Once a job has been placed into the RUNNING state and is executing in one or more queues its parameters cannot be modified nor can it be moved to another location in the system. Pending jobs can be moved from one target queue to another by one of the following methods:
What is a cell? It is the collection of computer hosts and DQS
software which make up a single entity managed by a daemon called
the "qmaster". In the following diagram are displayed
four CPUs. One of these is executing the qmaster daemon. Two processors
are executing the dqs_execd daemons. These two processors are
related to the queues shown here and would execute any job assigned
to those queues. The computer labeled "dqs host" is
not running any of the DQS daemons. It is known to the qmaster
because the site administrator has added that name to the qmaster's
host list. This action makes this host a "trusted DQS host"
as are any hosts running the daemon.
A DQS site may have more than one "cell". The site administrator may choose to keep each cell independent and separate from the others. On the other hand they may organize the system so that one or more cells will have authorized communications with others.
A user logged into a host in one cell can submit jobs to the other cells, or they can perform the QSTAT function for the other cells.
The user can move one of their jobs in a pending queue in one cell to the pending queue in another cell. The qmove utility is provided for this inter-cell transfer purpose only. The usual command would be:
qmove <job number>@CELL_C2
Which would move the numbered job from CELL_C2 to the cell in which the qmove utility is being executed. Where a user in CELL_C3 wishes to move a job from CELL_C2 to CELL_C1 the command would be:
qmove -cell CELL_C1 <job number>@CELL_C2
The effects of this move process can be somewhat surprising:
The user will note that from time to time one or more queue may display the SUSPENDED status. When this occurs any job executing on that queue is suspended also, but NOT terminated. As the queue is un-suspended the job is continued from the point where it was submitted, During the period of its suspension all of its files remain open and all memory and paging space allocated to the job remain in that state.
When does a queue get suspended? The DQS administrator and anyone designated as the queue's owner can suspend that queue using the QMOD command. There is one additional method which may appear in some site configurations. If a queue is assigned to a host which is also serving as the personal workstation for some user of the system, they may chose to use the QIDLE command at that workstation. This utility is a X-windows facility which monitors the keyboard and mouse on a workstation. If these devices are being used the QIDLE facility will suspend the queues on that workstation.
One additional means by which a queue may be suspended is when it is designated as a subordinate queue to another queue, by the DQS administrator. The usual application of this facility is when a host serves both a s a parallel and single processor resource. The single processor queue is made subordinate to the parallel queue. When a parallel job is started the subordinate queue and any job being executed there will be suspended.
A major feature of DQS is its support for the scheduling and management of parallel jobs to be run on two or more of the hosts in a system. There are three components in submitting parallel jobs:
qsub -me -l (qty.eq.4).and.(exec.eq.mpirun).and.AIX325
This will request four AIX325 hosts to run a parallel job. After the job is put into execution, but before the user's job script is executed, the function "mpirun" will be executed in the working directory of that user.
The simple "test.run" example we have been using so far will have operated with the following characteristics
For detailed instructions on changing the jobs' environment please see QSUB in the DQS reference manual.
Once a job has disappeared into the maw of DQS it is subjected to a variety of manipulations which are intended to utilize the entire system resources in the most optimum way while ensuring that each user is given "fair" access to those resources. The default operation of the scheduler is often adapted by each site to its own requirements. The basic process consists of:
After a job has been validated as to requesting "real" resources, it is tested against the site's queues to determine which ones it would be eligible for. Of the eligible queues , the values of the "maximum user jobs" for each queue is extracted and the smallest one selected. At the same time the number of jobs in RUNNING state for this user is computed. If the minimum queue-maximum-user-jobs is not greater than the number of that user's jobs RUNNING.. the job is rejected at QSUB time and an error message returned to the user.
This last scheduling pre-validation most certainly may confuse the reader but it is the core of the "fair play" method developed at SCRI and needs to be used for a while to demonstrate its behavior and value.
Even when one starts with the simple test case with which we began this User Guide. It is possible to get into one or more dead-ends on one's first, second, or whatever ,attempts at using the DQS. We will proceed through a number of typical problems which a user may encounter along the way:
The DQS error file (err_file) and accounting file (acc_file ) contain valuable information which can assist the knowledgeable user the means for analyzing and correcting their problems with the system. Refer to Appendix A - DQS 3.1.3 Error Messages for further information.