Contents

Overview

QED is a batch job queue manager in the likes of NQS or GNU queue for GNU/Linux.
It allows programmes to execute in background as batch jobs. Jobs may be scheduled to run at a later time as in at(1). They may also be run on a resource controlled environment which includes, among other, the enforcing of I/O bandwidth limits. An optional job completion notification delivery via email is also available.

QED is network aware and has a limited support for clustering by automatically rerouting requests to the least loaded node, thus achieving a simple form of load balancing.

Communication between clients and QED follows a simple and buzzword compatible XML based protocol. There's, beside the obvious primitive for submitting jobs, provision for job cancellation, pending queue inspection and status reporting for completed jobs. Support for job suspension and resuming may be on the way.

QED can operate as a privileged (zero uid) process or an unprivileged one. In the former case a submission request must bear valid user credentials in order to authenticate in the system. Jobs will subsequently run with the uid and gid of the calling user. In unprivileged mode all jobs are accepted and will run with the same user id QED is running under.

For logging and bookkeeping purposes, QED produces a completion certificate for each completed job. A job is considered to have terminated successfully if it returned a null status. This behaviour follows the convention pervasive in Unix systems. The completion certificate includes, among other fields, the exit time, the return status and, if available, a dump of the job's stdout and stderr.

Please note that QED is currently alpha software and may have bugs, use at your own risk.
You can (and should) report bugs to the author's email contained in the AUTHORS file present in the distribution.

Networking

QED must be set up listening on at least one interface. It can optionally send load average information to a multicast group, thereby allowing each member to have a snapshot at the load distribution among the cluster. Balancing is thus achieved by rerouting to the least busy node.

There's no support for SSL. I'm not a big fan of it: it doesn't work for datagrams and defeats the use of zero-copy optimisations present in the kernel (most notably sendfile(2)). Anyway, the lack of built-in support may be regarded as a good thing: it lets you make a choice between using stunnel or IPsec ESP.

Resource usage control

QED supports the specification of certain system resource limits both globally or job-wise. The latter may be conveyed in the submission request itself. Global values will take precedence over any other ones whenever they're more restrictive. There's currently no support for a more fine-grained control (e.g. specifying limits on a user basis).

The following resources are supported:

Starting with fsize, the limits are enforced via setrlimit(2) and reaching those values will trigger the semantics documented in the man pages.

The bandwidth enforcing is only available for the x86 architecture.
It is implemented using ptrace(2) tricks and code injection, thereby hijacking all system calls related to I/O (read(2), write(2) et al.) and introducing a suitable delay whenever the instantaneous byte-rate exceeds the allowed one. Asynchronous I/O system calls are currently not trapped.

Byte-rate limits may be specified as a whole, thus constraining all I/O or alternatively can be break down into the following families:

Each category may be further subdivided into inbound I/O, outbound I/O or joint I/O (a limit regardless of "direction").

The remaining resources are controlled by peeping periodically at the suitable entries in /proc.

pcpu_params is a more elaborate directive. It comprises the following sub-parameters:

The overcommitting jobs killed by QED will have the reason for the coerced exit inscribed in the respective completion certificate.

Security

If QED is launched as root it will turn on mandatory authentication for all requests made. QED uses Linux-PAM API to authenticate the user. Upon successful authentication jobs will run under that user id after chdir(2) to his/her home directory. The environment will comprise the following variables:

  • HOME equates to the user's home directory.
  • PATH equates to the path defined in the configuration appended with the component $HOME/bin.
  • Cancelling and listing jobs will be limited to those ones submitted by the authenticated user.

    Note that scheduled jobs cannot be rerouted to another host. This is because, since authentication is deferred till the time the job is to be started, the user credentials would have to be stored persistently in the meantime, possibly in plain text, which is notoriously a security breach.

    Please set up an stunnel or IPsec ESP (this is a good starting point) when running QED under root unless confidentiality, specially of user credentials, is of no concern to you.

    When running as a non-privileged user QED will accept every request from everyone allowed to connect to it. Jobs will run under the same user id of the QED process.

    Download

    Download QED from sourceforce.net here.

    Installing

    QED is written in C++. It has been built with gcc 3.3 and 4.1 and tested under Linux kernel 2.6.15 and newer, with libc 2.3 and 2.4.
    It should work with previous kernel releases in the 2.6 series. Previous versions may or may not work.

    As part of the distribution it is also provided a simple QED client written in perl. Make sure you have the following perl modules installed:

    POSIX
    IO::Socket
    Term::ReadKey
    HTML::Entities
    Getopt::Long
    
    Alternatively you can write you own dedicated client with such prosaic tools as a shell script and netcat(1). It's not difficult since QED uses a simple XML based protocol.

    For proper installation start by unpacking the QED release and change directory into the extracted top directory:

    $ tar xvjf qed-x.x.x.tar.bz2
    $ cd qed-x.x.x
    

    Then run:

    $ ./configure [--prefix=<dir>] [--enable-debug] [--disable-optimization] [--disable-pam]
    
    The available options are: After this, just follow the the usual mantra:

    $ make && make install
    

    Running

    Simply put, after making sure you've set up your PATH correctly, start QED with the command :
    $ qed [-p <pid-file>] [-c <config-file>] [-d] [-l]
    

    Options are as follows:

    Now, for something more detailed. If you intend to run QED as root so you can authenticate different users and run jobs under their respective ids, you'll have to add a file entry named qed in the pam configuration directory present in your installation, which typically is /etc/pam.d.

    Most probably, your system has unix based authentication ie /etc/passwd and /etc/shadow, and, in this case, the qed file contents shall be:
    auth    required        pam_env.so
    auth    required        pam_unix.so nodelay
    
    In case you use ldap authentication, replace the above lines with:
    auth    required        pam_env.so
    auth    required        pam_ldap.so
    

    You get the idea. If you don't don't want to mess with PAM or just don't trust this code, then don't run QED as root, period.

    As mentioned above, a simple QED client is included in the distribution. It is, unsurprisingly, named qed-client. You can use it to submit jobs, query their state or cancel them. The usage is:

    qed-client [-a] [-l <limits-file>] [qed-host:port] {submit|queue-stat|job-stat|cancel} ...
    
    qed-host:port specifies the address and port the QED server is bound to. It defaults to localhost:3345.

    Options are as follows:

    The available commands are:

    Protocol

    Herein is described the protocol between clients and QED. It's fundamentally a synchronous request-response XML-based protocol. Requests are made by clients and the corresponding responses returned by the QED daemon. Primitives are presented in informal XML with interspersed comments.

    submit_request

    Submits a job to a QED host.
    
    <submit-request>
    	<auth-info> <!-- authentication descriptor: optional -->
    		<user>string</user>
    		<cred>string</cred> <!-- credential, typically a password -->
    	</auth-info>
    		
    	<!-- executable file to be submitted ie argv[0] -->
    	<command>string</command> 
    	<arg>string</arg> <!-- further optional argument argv[1] -->
    	<!-- ... -->
    	<arg>string</arg> <!-- further optional argument argv[n] -->
    
    	<!-- alternatively the program can be specified as it would be invoked 
    		in a shell although no meta characters are supported -->
    	<command-line>string</command-line> 
    		
    	<!-- if present, the submitted job is to be deferred for execution at the the specified timestamp. 
    	Use the ISO 8601 format, eg yyyy-mm-dd [hh:mm:ss] [{+|-}hh]. Default is to submit right away.-->
    	<submit-time>string</submit-time> 
    
    	<!-- number of times this job will be resubmitted if not successful ie.
    		whenever its exit code is non null. default: 0 -->
    	<retries>int</retries> 
    	
    	<!-- amount of seconds the job will be postponed before being retried. It's only meaningful if <retries> 
    	has been specified. default: 0 -->
    	<retry-period>int</retry-period> 
    
    	<!-- if true allows this job to be submitted on another qed host if it is
    	deemed convenient according to the load balancing algorithm. default: false -->
    	<redirect>bool</redirect>	
    
    	<!-- email address used for mail notification -->
    	<notify-rctp>string</notify-rctp> 
    	<!-- ... -->
    	<notify-rctp>string</notify-rctp> <!-- further optional address -->
    	
    	<!-- job priority. queued jobs are maintained in descending priority order
    		so misuse of this parameter could lead to resource starvation of the lower
    		priority jobs. default: 0 -->
    	<priority>int</priority>	
    	
    	<!-- resource limits description -->
    	<limits>see here</limits>
    	
    </submit-request>
    
    

    submit_response

    
    <submit-response>
    	<!-- error code. 0 if successful -->
    	<status>int</status>
    	
    	<!-- an error message, if that is the case -->
    	<reason>string</reason>
    
    	<!-- job id -->
    	<jid>int</jid>
    	
    	<!-- uri of the qed host the request has been redirected to, if applicable -->
    	<redir-uri>string</redir-uri> 
    
    </submit-response>
    
    

    cancel_request

    This primitive allows a client to cancel a running or pending job submitted on the QED host the client is connected to.
    
    <cancel-request>
    	<auth-info> <!-- authentication descriptor: optional -->
    		<user>string</user>
    		<cred>string</cred> <!-- credential, typically a password -->
    	</auth-info>
    
    	<!-- job id to cancel -->
    	<jid>int</jid>
    
    </cancel-request>
    
    

    cancel_response

    
    <cancel-response>
    	<!-- error code. 0 if successful -->
    	<status>int</status>
    	
    	<!-- an error message, if that is the case -->
    	<reason>string</reason>
    	
    </cancel-response>
    
    

    queue_stat_request

    This primitive allows a client to dump the internal QED queues in order to gather information about the status of running or pending jobs. Completed job information must be obtained locally on each QED host in the spool directories, or alternatively, by issuing the job-stat-request primitive.
    
    <queue-stat-request>
    	<auth-info> <!-- authentication descriptor: optional -->
    		<user>string</user>
    		<cred>string</cred> <!-- credential, typically a password -->
    	</auth-info>
    
    	<!-- if true this request is multicast to every qed host in order to retrieve
    		job status from any job in the farm submitted by this very user
    		default: false -->
    	<global>bool</global>
    
    </queue-stat-request>
    
    

    queue_stat_response

    This response have a different layout depending on the global tag in the corresponding request. In the case it is omitted or has a 0 (false) value, only information about jobs submitted to the specified QED host will be returned. The format is as follows:
    
    <queue-stat-response>
    	<!-- error code. 0 if successful -->
    	<status>int</status>
    	
    	<!-- an error message, if that is the case -->
    	<reason>string</reason>
    
    	<!-- job id -->
    	<jid>int</jid>
    
    	<uri>localhost</uri> <!-- literally "localhost" but it really doesn't matter -->
    	
    	<entry> <!-- one entry per job -->
    		<!-- job's corresponding executable, argv[0] -->
    		<command>string</command> 
    		<arg>string</arg> <!-- argv[1] -->
    		<!-- ... -->
    		<arg>string</arg> <!-- argv[n] -->
    	
    		<submit-time>string</submit-time> <!-- submission timestamp -->
    		<running>bool</running> <!-- whether job is running or still pending -->
    	
    		<start-time>string</start-time> <!-- timestamp of process creation if running -->
    		<pid>int</pid> <!-- corresponding process id if job is running -->
    
    		<retries>int</retries> <!-- current retry count if specified at submission -->
    	</entry>
    	
    </queue-stat-response>
    
    
    In the case global has a true (non null) value, the response will be like:
    
    <queue-stat-response>
    	<!-- error code. 0 if successful -->
    	<status>int</status>
    	
    	<!-- an error message, if that is the case -->
    	<reason>string</reason>
    
    	<!-- job id -->
    	<jid>int</jid>
    	
    	<peer> <!-- one entry per QED host that currently has jobs submitted by the
    		user -->
    
    		<uri>string</uri> <!-- QED host address and port -->
    		
    		<entry> <!-- one entry per job -->
    			<!-- job's corresponding executable, argv[0] -->
    			<command>string</command> 
    			<arg>string</arg> <!-- argv[1] -->
    			<!-- ... -->
    			<arg>string</arg> <!-- argv[n] -->
    	
    			<submit-time>string</submit-time> <!-- submission timestamp -->
    			<running>bool</running> <!-- whether job is running or still pending -->
    	
    			<start-time>string</start-time> <!-- timestamp of process creation if running -->
    			<pid>int</pid> <!-- corresponding process id if job is running -->
    
    			<retries>int</retries> <!-- current retry count if specified at submission -->
    		</entry>
    	</peer>
    </queue-stat-response>
    
    

    job_stat_request

    Asks QED for the completion certificate of the given job id. Note that this request must be made to the QED host that has actually run the job since the job id by itself does not convey that information. Therefore, if applicable, a compliant client will take note of the redir-uri field in the submit_response for future use when making this request.
    
    <job-stat-request>
    	<auth-info> <!-- authentication descriptor: optional -->
    		<user>string</user>
    		<cred>string</cred> <!-- credential, typically a password -->
    	</auth-info>
    
    	<!-- if true this request is multicast to every qed host in order to retrieve
    		job status from any job in the farm submitted by this very user
    		default: false -->
    	<global>bool</global>
    
    </job-stat-request>
    
    

    job_stat_response

    Conveys the completion certificate of the specified job if it exists, and is accessible by the user.
    
    <job-stat-response>
    	<!-- error code. 0 if successful -->
    	<status>int</status>
    	
    	<!-- an error message, if that is the case -->
    	<reason>string</reason>
    
    	<!-- if status is null, the following fields are the ones found in the completion certificate. -->
    	
    	</job-stat-response>
    
    

    error

    This is a primitive returned by the QED server whenever a fatal error occurs, typically a response to malformed request or a timeout while reading input from the client.
    
    <error>
    	<msg>string</msg> <!-- an error message -->
    </error>
    
    

    Completion certificates

    Completion certificates refer to the files produced by QED for each terminated job. There will be typically two distinct directories for holding successful and failed jobs respectively whose paths are derived from the spool-dir configuration directive.

    Furthermore, when QED runs as root, one additional subdirectory per user will be created. It has the appropriate ownership and a permission mode of 0700, which is to say, only accessible by the owner. This subdirectory will contain all certificates pertaining to jobs submitted by that user.

    Occasional purging of the older files residing in these directories is recommended.

    An exit status is considered to be a success one if it's null, and a failure otherwise. There are myriads of reasons for a job to abort including crashes and most notably forced exits whenever resource control usage is in force and a process has hit a limit. A process may also be coerced into a premature exit when QED receives a TERM signal.

    The format of the completion record is as follows:

    
    <job>
    	<!-- job id -->
    	<jid>int</jid>
    
    	<!-- program file:  argv[0] -->
    	<command>string</command> 
    	<arg>string</arg> <!-- argv[1] -->
    	<!-- ... -->
    	<arg>string</arg> <!-- argv[n] -->
    
    	<!-- user id -->
    	<uid>int</uid>
    	<!-- group id -->
    	<gid>int</gid>
    
    	<start-time>string</start-time> <!-- timestamp of process creation -->
    
    	<pid>int</pid> <!-- process id -->
    
    	<exit-time>string</exit-time> <!-- timestamp of process exit -->
    		
    	<!-- exit status -->
    	<status>int</status>
    		
    	<!-- captured dump of process stdout up to a configurable limit -->
    	<stdout>string</stdout>
    
    	<!-- captured dump of process stderr up to a configurable limit -->
    	<stderr>string</stderr>
    
    	<!-- message explaining the reason for a forced exit, if applicable -->
    	<coerced-exit>string</coerced-exit>
    	
    </job>
    
    

    Configuration directives

    QED configuration file is a collection of directives most of which fall into the following categories: