Setting up a Condor cluster

2997

Author: M. Shuaib Khan

Have you ever been in a situation where you had to run multiple instances of the same application, with different input data each time, in sequence, because the job was too computation-intensive and your machine not powerful enough to run all the instances simultaneously? The solution to that problem could be to harness the machines that are already connected to your local network and apply their unused CPU cycles to your projects. Condor, a specialized batch system for managing compute-intensive jobs, may be your answer.

Condor lets you queue multiple jobs, searches for free machines on the network (those with no keyboard activity, no load average, and no active Telnet users), and submits jobs to them, then returns the results to the machine from where it was submitted. Condor is a batch system — once a job is submitted, there is no interaction between the job and the user. Any input to the job must be in a file which is submitted along with the executable to the Condor pool, while all the output during the execution of the job is written to a file, which is sent back as the result of execution to the submitting machine.

Setting up Condor

Before you set up a Condor pool, you need to know the four different roles a machine can play in a your pool:

  • Central manager — The central manager collects information about the resources available to the pool, and negotiates between a machine that is submitting a job and the machine that will execute the job. Only one machine in a pool can play this role.
  • Execute machine — Any machine (including the central manager) configured to execute jobs submitted to the pool.
  • Submit machine — Any machine (including the central manager) configure to submit jobs to the pool.
  • Checkpoint server — Any one machine in the pool can act as a backup machine for the jobs running on the pool. Setting one up is optional, and for our basic pool, we are going to ignore it.

Before you set up a Condor pool, you must decide which machine will play the central manager role, and which of the remaining clients are going to be the submit and execute machines (or both). For the simplest case, we’ll set up a pool of two machines. One will be the central manager and also a submit and execute machine; the other will be only a submit/execute machine. You can use the same procedure to set up Condor on as many machines as you want.

Before you set up Condor on a machine, create a Condor user on that machine whose home directory will hold Condor-related files, such as logs.

#groupadd condor
#useradd -m -g condor condor

Now copy the downloaded Condor tar-archive into /home/condor and unpack it. Change into the unpacked directory, which I’ll refer to as the release directory, and run the condor_configure script in order to install Condor on the machine:


#condor-configure --install --type=execute,submit,manager --local-dir=/home/condor --verbose

The command above configures the central manager. To configure a submit/execute machine, use slightly different syntax:


#condor-configure --install --type=execute,submit --local-dir=/home/condor --central-manager=hostname of central manager --verbose

If you ever want to change the configuration of Condor on a machine, you can run the script again.

Open the /etc/condor_config file in the release directory. Set the LOCAL_DIR variable to /home/condor, and set the HOSTALLOW_WRITE variable to an appropriate value (e.g. ‘*’). Make sure /dev/mouse is pointing to your mouse device, and /var/run/utmp is pointing to utmp on your machine. Next, edit the /home/condor/condor_local_config file and set CONDOR_IDS to ‘0.0’. This tells Condor to run its daemons as root.

Copy the files in the bin subdirectory of the release directory to a well-known location (such as /local/bin) so that Condor users can have access to them, and copy the files in the sbin subdirectory to a location that gives only the administrator access to them in his path.

Now you’re ready to run condor_master on each machine to start the daemons:

#condor_master

On the central manager you should see the following daemons running if you run $ps aux | egrep condor_:

  • condor_ master
  • condor_ collector
  • condor_ negotiator
  • condor_ startd
  • condor_ schedd

On other machines, the following daemons should be running:

  • condor_ master
  • condor_ startd
  • condor_ schedd

If you don’t see these daemons running, there is a problem with your configuration. Look at /home/condor/logs/Masterlog to try to figure out what might be wrong.

You can run condor_status on any machine to list the machines that are currently in the pool, and their status (acclaimed, available, etc.).

Once Condor is running, it’s time to put it to use. To submit a job to Condor, you need to write a description file for it. Writing a description file is easy, and the example below will show you how to write one.


#Example description file foo.cmd for job foo
Executable = foo
Universe = vanilla
input = test.data
output = foo.out
error = foo.error
Log = foo.log

Queue

The Executable variable points to the job which is to be run (it’s a good idea to specify the absolute path to the executable), input is set to the file from which foo is supposed to take its input, output is set to the file to which foo is to write its output, error variable is set to the file to which any errors will be reported, and a log of whatever happened during the the job will be written to the file pointed to by Log variable.

Now you can submit the description file as a Condor job:

$condor_submit foo.cmd

If you would like to run multiple instances of the same job with different input files for each instance, here is how to write the description files:


#Example 2:
Executable = foo
Error = error.$(Process)
Input = input.$(Process)
Output = output.$(Process)
Log = foo.log

Queue 100

Note the entry Queue 100. It tells Condor to run 100 instances of the job, with the input file for each being input.<job number>, and output and error files being similarly numbered.

To check the Condor queue and have a look at the status of the jobs being submitted, run:

$condor_q

To remove a job from the queue, use the job ID that condor_q returns:

$condor_rm <job_id>

Conclusion

Condor is a powerful yet easy-to-use software system for managing a cluster of workstations. You can configure it in various ways, such as allowing it to run jobs only at night, or run jobs only on particular machines or machines with particular resources. The owner of any machine in the Condor pool can change the configuration of Condor to his likes so that jobs that are being executed on his machine are of a particular type or are executed at a particular time. Turn to the official documentation for ways to tune Condor for your needs.