408 lines
12 KiB
Markdown
408 lines
12 KiB
Markdown
Daggy: Ya like dags?
|
|
|
|
Description
|
|
==
|
|
|
|
Daggy is a work orchestration framework for running workflows modeled as directed, acyclic graphs (DAGs). These are
|
|
quite useful when modeling data ingestion / processing pipelines.
|
|
|
|
Below is an example workflow where data is pulled from three sources (A, B, C), some work is done on them, and a report
|
|
is generated.
|
|
|
|
Each step depends on the success of its upstream dependencies, e.g. `Derive_Data_AB` can't run until `Transform_A` and
|
|
`Transform_B` have completed successfully.
|
|
|
|
```mermaid
|
|
graph LR
|
|
Pull_A-->Transform_A;
|
|
Pull_B-->Transform_B;
|
|
Pull_C-->Transform_C;
|
|
|
|
Transform_A-->Derive_Data_AB;
|
|
Transform_B-->Derive_Data_AB;
|
|
Derive_Data_AB-->Derive_Data_ABC;
|
|
Transform_C-->Derive_Data_ABC;
|
|
|
|
Derive_Data_ABC-->Report;
|
|
```
|
|
|
|
Individual tasks (vertices) are run via a task executor. Daggy supports multiple executors, from local executor (via
|
|
fork), to distributed work managers like [slurm](https://slurm.schedmd.com/overview.html)
|
|
or [kubernetes](https://kubernetes.io/) (planned), or daggy's own executor.
|
|
|
|
State is maintained via state loggers. Currently daggy supports an in-memory state manager (OStreamLogger), and
|
|
[RedisJSON](https://oss.redis.com/redisjson/).
|
|
|
|
Future plans include supporting [postgres](https://postgresql.org).
|
|
|
|
Building
|
|
==
|
|
|
|
**Requirements:**
|
|
|
|
- git
|
|
- cmake >= 3.14
|
|
- gcc >= 8
|
|
|
|
- libslurm (if needed)
|
|
|
|
```
|
|
git clone https://gitlab.com/iroddis/daggy
|
|
cd daggy
|
|
mkdir build
|
|
cd build
|
|
cmake [-DDAGGY_ENABLE_SLURM=ON] ..
|
|
make
|
|
|
|
tests/tests # for unit tests
|
|
```
|
|
|
|
DAG Run Definition
|
|
===
|
|
|
|
daggy works as a standalone library, but generally runs as a service with a REST interface. This documentation is
|
|
specifically for submitting DAGs to the REST server for execution (a DAG run).
|
|
|
|
DAGs are defined in JSON as a set of `tasks`, along with optional `job` and `executionParameters` (future).
|
|
|
|
Basic Definition
|
|
--
|
|
|
|
A DAG Run definition consists of a dictionary that defines a set of tasks. Each task has the following attributes:
|
|
|
|
| Attribute | Required | Description |
|
|
|--------------|--------------|---------------------------------------------------------------|
|
|
| name | Yes | Name of this task. Must be unique. |
|
|
| command | Yes | The command to execute |
|
|
| maxRetries | No | If a task fails, how many times to retry (default: 0) |
|
|
| retry | No | How many seconds to wait between retries. |
|
|
| children | No | List of names of tasks that depend on this task |
|
|
| parents | No | List of names of tasks that this task depends on |
|
|
| isGenerator | No | The output of this task generates additional task definitions |
|
|
|
|
Defining both `parents` and `children` are not required; one or the other is sufficient. Both are supported to allow you
|
|
to define your task dependencies in the way that is most natural to how you think.
|
|
|
|
Below is an example DAG Run submission:
|
|
|
|
```json
|
|
{
|
|
"tasks": {
|
|
"task_one": {
|
|
"job": {
|
|
"command": [
|
|
"/bin/touch",
|
|
"/tmp/somefile"
|
|
]
|
|
},
|
|
"maxRetries": 3,
|
|
"retryIntervalSeconds": 30
|
|
},
|
|
"task_two": {
|
|
"job": {
|
|
"command": [
|
|
"/bin/touch",
|
|
"/tmp/someotherfile"
|
|
]
|
|
},
|
|
"maxRetries": 3,
|
|
"retryIntervalSeconds": 30,
|
|
"parents": [
|
|
"task_one"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Task Parameters
|
|
--
|
|
|
|
Task commands can be parameterized by passing in an optional `parameters` member. Each parameter consists of a name and
|
|
either a string value, or an array of string values. Tasks will be regenerated based on the values of the parameters.
|
|
|
|
For instance:
|
|
|
|
```json
|
|
{
|
|
"parameters": {
|
|
"DIRECTORY": "/var/tmp",
|
|
"FILE": "somefile"
|
|
},
|
|
"tasks": {
|
|
"task_one": {
|
|
"job": {
|
|
"command": [
|
|
"/bin/touch",
|
|
"{{DIRECTORY}}/{{FILE}}"
|
|
]
|
|
},
|
|
"maxRetries": 3,
|
|
"retryIntervalSeconds": 30
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
`task_one`'s command, when run, will touch `/var/tmp/somefile`, since the values of `DIRECTORY` and `FILE` will be
|
|
populated from the `job` values.
|
|
|
|
In the case where a parameter has an array of values, any tasks referencing that value will be duplicated with the
|
|
cartesian product of all relevant values.
|
|
|
|
Example:
|
|
|
|
```json
|
|
{
|
|
"job": {
|
|
"DIRECTORY": "/var/tmp",
|
|
"FILE": "somefile",
|
|
"DATE": [
|
|
"2021-01-01",
|
|
"2021-02-01",
|
|
"2021-03-01"
|
|
]
|
|
},
|
|
"tasks": {
|
|
"populate_inputs": {
|
|
"job": {
|
|
"command": [
|
|
"/bin/touch",
|
|
"{{DIRECTORY}}/{{FILE}}"
|
|
]
|
|
}
|
|
},
|
|
"calc_date": {
|
|
"job": {
|
|
"command": [
|
|
"/path/to/calculator",
|
|
"{{DIRECTORY}}/{{FILE}}",
|
|
"{{DATE}}"
|
|
]
|
|
}
|
|
},
|
|
"generate_report": {
|
|
"job": {
|
|
"command": [
|
|
"/path/to/generator"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Conceptually, this DAG looks like this:
|
|
|
|
```mermaid
|
|
graph LR
|
|
populate_inputs-->calc_date
|
|
calc_date-->generate_report
|
|
```
|
|
|
|
Once the parameters have been populated, the new DAG will look like this:
|
|
|
|
```mermaid
|
|
graph LR
|
|
populate_inputs-->calc_date_1
|
|
populate_inputs-->calc_date_2
|
|
populate_inputs-->calc_date_3
|
|
calc_date_1-->generate_report
|
|
calc_date_2-->generate_report
|
|
calc_date_3-->generate_report
|
|
```
|
|
|
|
- `calc_date_1` will have the command `/path/to/calculator /var/tmp/somefile 2021-01-01`
|
|
- `calc_date_2` will have the command `/path/to/calculator /var/tmp/somefile 2021-02-01`
|
|
- `calc_date_3` will have the command `/path/to/calculator /var/tmp/somefile 2021-03-01`
|
|
|
|
**NB**: When a task template resolves to multiple tasks instances, all of those new instances are still referred to by
|
|
the original name for the purposes of creating dependencies. e.g. to add a dependency dynamically (see next section),
|
|
you must refer to `"children": [ "calc_date" ]`, not to the individual `calc_date_1`.
|
|
|
|
Tasks Generating Tasks
|
|
----------------------
|
|
|
|
Some DAG structures can only be fully known at runtime. For instance, if a job pulls multiple files from a source, each
|
|
of which can be processed independently, it would be nice if the DAG could modify itself on the fly to accomodate that
|
|
request.
|
|
|
|
Enter the `generator` task. If a task is defined with `"isGenerator": true`, the output of the task is assumed to be a
|
|
JSON dictionary containing new tasks to run. The new tasks will go through parameter expansion as described above, using
|
|
the same parameter list as the original DAG. New tasks can define their own dependencies.
|
|
|
|
**NB:** Generated tasks won't have any children dependencies unless you define them. If there are parameterized
|
|
dependencies, you must use the name of the original task (e.g. use `calc_date`, not `calc_date_1`) to add a dependency.
|
|
|
|
**NB:** If you add a child dependency to a task that has already completed, that task won't restart. Best practice is to
|
|
create a dependency from the generator task to the task the new tasks will depend on.
|
|
|
|
```json
|
|
{
|
|
"tasks": {
|
|
"pull_files": {
|
|
"job": {
|
|
"command": [
|
|
"/path/to/puller/script",
|
|
"{{DATE}}"
|
|
]
|
|
},
|
|
"isGenerator": true,
|
|
children: [
|
|
"generate_report"
|
|
]
|
|
},
|
|
"generate_report": {
|
|
"job": {
|
|
"command": [
|
|
"/path/to/generator"
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
```mermaid
|
|
graph LR
|
|
pull_files-->generate_report
|
|
```
|
|
|
|
The output of the puller task might be:
|
|
|
|
```json
|
|
{
|
|
"calc_date_a": {
|
|
"job": {
|
|
command
|
|
": [
|
|
"/path/to/calculator",
|
|
"/path/to/data/file/a"
|
|
]
|
|
},
|
|
"children": [
|
|
"generate_report"
|
|
]
|
|
},
|
|
"calc_date_b": {
|
|
"job": {
|
|
"command": [
|
|
"/path/to/calculator",
|
|
"/path/to/data/file/b"
|
|
]
|
|
},
|
|
"children": [
|
|
"generate_report"
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
Once the first task runs, its output is parse as additional tasks to run. The new DAG will look like this:
|
|
|
|
```mermaid
|
|
graph LR
|
|
pull_files-->generate_report
|
|
pull_files-->calc_file_a
|
|
pull_files-->calc_file_b
|
|
calc_file_a-->generate_report
|
|
calc_file_b-->generate_report
|
|
```
|
|
|
|
Note that it was important that `generate_report` depend on `pull_files`, otherwise the two task would run concurrently,
|
|
and the `generate_report` wouldn't have any files to report on.
|
|
|
|
Execution Parameters
|
|
--
|
|
(future work)
|
|
|
|
The REST server can be configured with multiple pools of executors. For instance, it might be helpful to run certain
|
|
jobs on slurm with a specific set of restrictions, or allow for local execution as well as execution on a slurm cluster.
|
|
|
|
`executionParameters` is a member passed in that alters how the DAG is executed.
|
|
|
|
| Attribute | Description |
|
|
|-----------|-------------|
|
|
| pool | Names the executor the DAG should run on |
|
|
| poolParameters | Any parameters the executor accepts that might modify how a task is run |
|
|
|
|
Default Job Values
|
|
------------------
|
|
|
|
A DAG can be submitted with the extra section `jobDefaults`. These values will be used to fill in default values for all
|
|
tasks if they aren't overridden. This can be useful for cases like Slurm execution, where tasks will share default
|
|
memory and runtime requirements.
|
|
|
|
Executors
|
|
=========
|
|
|
|
Different executors require different structures for the `job` task member.
|
|
|
|
Local Executor (ForkingTaskExecutor)
|
|
------------------------------------
|
|
|
|
The ForkingTaskExecutor runs tasks on the local box, forking to run the task, and using threads to monitor completion
|
|
and capture output.
|
|
|
|
| Field | Sample | Description |
|
|
|---------|--------|--------------|
|
|
| command | `[ "/bin/echo", "param1" ]` | The command to run |
|
|
| commandString | `"/bin/echo param1"` | The command to run as a string. Quoted args are properly handled. |
|
|
| environment | `[ "DATE=2021-05-03" ]` | Environment variables to set for script |
|
|
|
|
Slurm Executor (SlurmTaskExecutor)
|
|
----------------------------------
|
|
|
|
The slurm executor requires that the daggy server be running on a node capable of submitting jobs.
|
|
|
|
To enable slurm support use `cmake -DDAGGY_ENABLE_SLURM=ON ..` when configuring the project.
|
|
|
|
Required `job` config values:
|
|
|
|
| Field | Sample | Description |
|
|
|---------|--------|--------------|
|
|
| command | `[ "/bin/echo", "param1" ]` | The command to run on a slurm host |
|
|
| commandString | `"/bin/echo param1"` | The command to run as a string. Quoted args are properly handled. |
|
|
| environment | `[ "DATE=2021-05-03" ]` | Environment variables to set for script |
|
|
| minCPUs | `"1"` | Minimum number of CPUs required |
|
|
| minMemoryMB | `"1"` | Minimum memory required, in MB |
|
|
| minTmpDiskMB | `"1"` | Minimum temporary disk required, in MB |
|
|
| priority | `"100"` | Slurm priority |
|
|
| timeLimitSeconds | `"100"` | Number of seconds to allow the job to run for |
|
|
| userID | `"1002"` | Numeric UID that the job should run as |
|
|
| workDir | `"/tmp/"` | Directory to use for work |
|
|
| tmpDir | `"/tmp/"` | Directory to use for temporary files, as well as stdout/stderr capture |
|
|
|
|
Daggy will submit the `command` to run, capturing the output in `${tmpDir}/${taskName}_{RANDOM}.{stderr,stdout}` . Those
|
|
files will then be read after the task has completed, and stored in the AttemptRecord for later retrieval.
|
|
|
|
For this reason, it's important that the `tmpDir` directory **be readable by the daggy engine**. i.e in a distributed
|
|
environment, it should be a shared filesystem. If this isn't the case, the job output will not be captured by daggy,
|
|
although it will still be available wherever it was written by slurm.
|
|
|
|
DaggyRunnerTaskExecutor
|
|
-----------------------
|
|
|
|
Daggy Runners (`daggyr` in this project) are daemons that can be run on remote hosts, then allocated work.
|
|
|
|
Tasks submitted to this type of runner require `cores` and `memoryMB` attributes. Remote runners have a specific
|
|
capacity that are consumed when tasks run on them. Right now those capacities are merely advisory; it's possible
|
|
to oversubscribe a runner, and the constraints are not enforced.
|
|
|
|
Enforcement via cgroups is planned.
|
|
|
|
|
|
| Field | Sample | Description |
|
|
|---------|--------|--------------|
|
|
| command | `[ "/bin/echo", "param1" ]` | The command to run |
|
|
| commandString | `"/bin/echo param1"` | The command to run as a string. Quoted args are properly handled. |
|
|
| environment | `[ "DATE=2021-05-03" ]` | Environment variables to set for script |
|
|
| cores | "1" | Number of cores required by the task |
|
|
| memoryMB | "100" | Amount of memory (RSS) required by the task, in MB |
|
|
|
|
Loggers
|
|
=======
|
|
|
|
RedisJSON
|
|
---------
|