Adding support for execution on slurm grids

- Adding support for SlurmTaskExecutor in `daggyd` if DAGGY_ENABLE_SLURM is defined.
- Renaming some test cases
- Enabling compile-time slurm support
- Adding slurm documentation
This commit is contained in:
Ian Roddis
2021-09-10 10:53:58 -03:00
parent d15580f47f
commit d731f9f5b1
19 changed files with 460 additions and 31 deletions

View File

@@ -321,14 +321,52 @@ jobs on slurm with a specific set of restrictions, or allow for local execution
| pool | Names the executor the DAG should run on |
| poolParameters | Any parameters the executor accepts that might modify how a task is run |
Executors
=========
Different executors require different structures for the `job` task member.
Default Job Values
------------------
A DAG can be submitted with the extra section `jobDefaults`. These values will be used to fill in default values for all
tasks if they aren't overridden. This can be useful for cases like Slurm execution, where tasks will share default
memory and runtime requirements.
Executors
=========
Different executors require different structures for the `job` task member.
Local Executor (ForkingTaskExecutor)
------------------------------------
The ForkingTaskExecutor runs tasks on the local box, forking to run the task, and using threads to monitor completion
and capture output.
| Field | Sample | Description |
|---------|--------|--------------|
| command | `[ "/usr/bin/echo", "param1" ]` | The command to run on a slurm host |
Slurm Executor (SlurmTaskExecutor)
----------------------------------
The slurm executor requires that the daggy server be running on a node capable of submitting jobs.
To enable slurm support use `cmake -DDAGGY_ENABLE_SLURM=ON ..` when configuring the project.
Required `job` config values:
| Field | Sample | Description |
|---------|--------|--------------|
| command | `[ "/usr/bin/echo", "param1" ]` | The command to run on a slurm host |
| minCPUs | `"1"` | Minimum number of CPUs required |
| minMemoryMB | `"1"` | Minimum memory required, in MB |
| minTmpDiskMB | `"1"` | Minimum temporary disk required, in MB |
| priority | `"100"` | Slurm priority |
| timeLimitSeconds | `"100"` | Number of seconds to allow the job to run for |
| userID | `"1002"` | Numeric UID that the job should run as |
| workDir | `"/tmp/"` | Directory to use for work |
| tmpDir | `"/tmp/"` | Directory to use for temporary files, as well as stdout/stderr capture |
Daggy will submit the `command` to run, capturing the output in `${tmpDir}/${taskName}_{RANDOM}.{stderr,stdout}` . Those
files will then be read after the task has completed, and stored in the AttemptRecord for later retrieval.
For this reason, it's important that the `tmpDir` directory **be readable by the daggy engine**. i.e in a distributed
environment, it should be a shared filesystem. If this isn't the case, the job output will not be captured by daggy,
although it will still be available wherever it was written by slurm.