Top Tips for the UoM Condor Pool

1160562_digital_dreams

This latest instalment of our Top Tips blog posts is sure to be of interest to users of our popular Condor Pool.  Ian Cottam, Research Software Engineer (RSE) in Research IT, has put together some handy tips for you to get the most out of this great computational resource.

Tip 1. Use macros and command line arguments to make your submit scripts more versatile and save on having to edit them between runs.

Here are two examples. First, if you are using the new feature of cloud bursting from our local Condor Pool to Amazon Web Services (AWS), an extra line is needed in your submit script. But, rather than edit all your scripts for jobs that you want to permit cloud bursting, you can just add it as an argument to when you run condor_submit, like so:

  condor_submit -a "+MayUseAWS=True" submit.txt

The effect of the -a argument is equivalent to it being added just before the Queue line in the submit script.  For more information see our webpage on cloud bursting.

Relatedly, many Condor users edit their scripts to define the number of jobs to queue up; e.g. Queue 1000. Typically you need a small number for testing, say 1 to 5, and then when all looks good the full batch. To save edits, put this in your scripts:

    Queue $(qnum)

If qnum is not defined, it defaults to the empty string, and “Queue” is the same as “Queue 1”. Again, use the -a argument to condor submit, e.g.:

    condor_submit -a "qnum=3" submit.txt

    condor_submit -a "qnum=1000" submit.txt

Multiple -a arguments are allowed.

Tip 2. Some Condor Pool users are unsure what the H (Held) state means. Here is what it means and a tip for fixing things.

The Held state means something went wrong after your job was started, preventing Condor from completing it. At the time of writing some 1200 jobs are in this state. You, not the system, have to find and fix the problem. The simplest way is to use the — recently restored to health — status page.

If your job or jobs are Held, click on the link to your username and then on down until you reach the job’s details. There you will see the reason for the job being Held. A common one is getting the name or letter case of a filename wrong. Remove the Held jobs with condor_rm, fix the problem and simply resubmit.

Although the above tips are written in Condor syntax, it is likely that similar can be done with other HPC/HTC systems here and externally.

If you have any questions about using our Condor Pool then please get in touch.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s