Files
daggy/README.md
Ian Roddis 65ab439848 Squashed commit of the following:
commit b06b11cbb5d09c6d091551e39767cd3316f88376
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Tue Oct 5 11:57:37 2021 -0300

    Fixing failing unit test

commit fe2a43a19b2a16a9aedd9e95e71e672935ecaeb1
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Tue Oct 5 11:54:01 2021 -0300

    Adding endpoints and updating documentation

commit 46e0deeefb8b06291ae5e2d6b8ec4749c5b0ea6f
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Tue Oct 5 11:49:43 2021 -0300

    Completing unit tests and relevant fixes

commit e0569f370624844feee6aae4708bfe683f4156cf
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Mon Oct 4 17:30:59 2021 -0300

    Adding in gcc tsan for debug builds to help with race conditions, fixing many of those, and fixing really crummy assumption about how futures worked that will speed up task execution by a ton.

commit c748a4f592e1ada5546908be5281d04f4749539d
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Mon Oct 4 10:14:43 2021 -0300

    Checkpointing work that seems to have resolved the race condition

commit 7a79f2943e0d50545d976a28b4b379340a90dded
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Wed Sep 29 09:27:07 2021 -0300

    Completing the rough-in for DAG killing / pausing / resuming

commit 4cf8d81d5f6fcf4a7dd83d8fca3e23f153aa8acb
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Tue Sep 28 14:53:50 2021 -0300

    Adding dagrunner unit tests, adding a resetRunning method to resume

commit 54e2c1f9f5e7d5b339d71be024e0e390c4d2bf61
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Tue Sep 28 14:45:57 2021 -0300

    Refactoring runDAG into DAGRunner

commit 682be7a11e2fae850e1bc3e207628d2335768c2b
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Tue Sep 28 14:34:43 2021 -0300

    Adding DAGRunner class to replace Utilities::runDAG, making Slurm cancellation rc agree with SIGKILL

commit 4171b3a6998791abfc71b04f8de1ae93c4f90a78
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Tue Sep 28 14:14:17 2021 -0300

    Adding unit tests for stopping jobs to slurm

commit dc0b1ff26a5d98471164132d35bb8a552cc75ff8
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Tue Sep 28 14:04:15 2021 -0300

    Adding in stop method for task executors

commit e752b44f55113be54392bcbb5c3d6f251d673cfa
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Tue Sep 28 12:32:06 2021 -0300

    Adding additional tests for loggers

commit f0773d5a84a422738fc17c9277a2b735a21a3d04
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Tue Sep 28 12:29:21 2021 -0300

    Unit tests pass

commit 993ff2810de2d53dc6a59ab53d620fecf152d4a0
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Tue Sep 28 12:24:34 2021 -0300

    Adding handling for new routes, still need to add tests for new routes

commit 676623b14e45759872a2dbcbc98f6a744e022a71
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Tue Sep 28 12:12:43 2021 -0300

    Adding handling for new routes, still need to add tests for new routes

commit b9edb6ba291eb064f4c459a308ea6912fba9fa02
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Mon Sep 27 11:59:14 2021 -0300

    Defining new endpoints, fixing dag resumption code, adding PAUSED state, refactoring DAGSpec and adding deserializer
2021-10-05 11:57:55 -03:00

376 lines
11 KiB
Markdown

Daggy: Ya like dags?
Description
==
Daggy is a work orchestration framework for running workflows modeled as directed, acyclic graphs (DAGs). These are
quite useful when modeling data ingestion / processing pipelines.
Below is an example workflow where data is pulled from three sources (A, B, C), some work is done on them, and a report
is generated.
Each step depends on the success of its upstream dependencies, e.g. `Derive_Data_AB` can't run until `Transform_A` and
`Transform_B` have completed successfully.
```mermaid
graph LR
Pull_A-->Transform_A;
Pull_B-->Transform_B;
Pull_C-->Transform_C;
Transform_A-->Derive_Data_AB;
Transform_B-->Derive_Data_AB;
Derive_Data_AB-->Derive_Data_ABC;
Transform_C-->Derive_Data_ABC;
Derive_Data_ABC-->Report;
```
Individual tasks (vertices) are run via a task executor. Daggy supports multiple executors, from local executor (via
fork), to distributed work managers like [slurm](https://slurm.schedmd.com/overview.html)
or [kubernetes](https://kubernetes.io/) (planned).
State is maintained via state loggers. Currently daggy supports an in-memory state manager (OStreamLogger).
Future plans include supporting [redis](https://redis.io) and [postgres](https://postgresql.org).
Building
==
**Requirements:**
- git
- cmake >= 3.14
- gcc >= 8
- libslurm (if needed)
```
git clone https://gitlab.com/iroddis/daggy
cd daggy
mkdir build
cd build
cmake [-DDAGGY_ENABLE_SLURM=ON] ..
make
tests/tests # for unit tests
```
DAG Run Definition
===
daggy works as a standalone library, but generally runs as a service with a REST interface. This documentation is
specifically for submitting DAGs to the REST server for execution (a DAG run).
DAGs are defined in JSON as a set of `tasks`, along with optional `job` and `executionParameters` (future).
Basic Definition
--
A DAG Run definition consists of a dictionary that defines a set of tasks. Each task has the following attributes:
| Attribute | Required | Description |
|--------------|--------------|---------------------------------------------------------------|
| name | Yes | Name of this task. Must be unique. |
| command | Yes | The command to execute |
| maxRetries | No | If a task fails, how many times to retry (default: 0) |
| retry | No | How many seconds to wait between retries. |
| children | No | List of names of tasks that depend on this task |
| parents | No | List of names of tasks that this task depends on |
| isGenerator | No | The output of this task generates additional task definitions |
Defining both `parents` and `children` are not required; one or the other is sufficient. Both are supported to allow you
to define your task dependencies in the way that is most natural to how you think.
Below is an example DAG Run submission:
```json
{
"tasks": {
"task_one": {
"job": {
"command": [
"/usr/bin/touch",
"/tmp/somefile"
]
},
"maxRetries": 3,
"retryIntervalSeconds": 30
},
"task_two": {
"job": {
"command": [
"/usr/bin/touch",
"/tmp/someotherfile"
]
},
"maxRetries": 3,
"retryIntervalSeconds": 30,
"parents": [
"task_one"
]
}
}
}
```
Task Parameters
--
Task commands can be parameterized by passing in an optional `parameters` member. Each parameter consists of a name and
either a string value, or an array of string values. Tasks will be regenerated based on the values of the parameters.
For instance:
```json
{
"parameters": {
"DIRECTORY": "/var/tmp",
"FILE": "somefile"
},
"tasks": {
"task_one": {
"job": {
"command": [
"/usr/bin/touch",
"{{DIRECTORY}}/{{FILE}}"
]
},
"maxRetries": 3,
"retryIntervalSeconds": 30
}
}
}
```
`task_one`'s command, when run, will touch `/var/tmp/somefile`, since the values of `DIRECTORY` and `FILE` will be
populated from the `job` values.
In the case where a parameter has an array of values, any tasks referencing that value will be duplicated with the
cartesian product of all relevant values.
Example:
```json
{
"job": {
"DIRECTORY": "/var/tmp",
"FILE": "somefile",
"DATE": [
"2021-01-01",
"2021-02-01",
"2021-03-01"
]
},
"tasks": {
"populate_inputs": {
"job": {
"command": [
"/usr/bin/touch",
"{{DIRECTORY}}/{{FILE}}"
]
}
},
"calc_date": {
"job": {
"command": [
"/path/to/calculator",
"{{DIRECTORY}}/{{FILE}}",
"{{DATE}}"
]
}
},
"generate_report": {
"job": {
"command": [
"/path/to/generator"
]
}
}
}
}
```
Conceptually, this DAG looks like this:
```mermaid
graph LR
populate_inputs-->calc_date
calc_date-->generate_report
```
Once the parameters have been populated, the new DAG will look like this:
```mermaid
graph LR
populate_inputs-->calc_date_1
populate_inputs-->calc_date_2
populate_inputs-->calc_date_3
calc_date_1-->generate_report
calc_date_2-->generate_report
calc_date_3-->generate_report
```
- `calc_date_1` will have the command `/path/to/calculator /var/tmp/somefile 2021-01-01`
- `calc_date_2` will have the command `/path/to/calculator /var/tmp/somefile 2021-02-01`
- `calc_date_3` will have the command `/path/to/calculator /var/tmp/somefile 2021-03-01`
**NB**: When a task template resolves to multiple tasks instances, all of those new instances are still referred to by
the original name for the purposes of creating dependencies. e.g. to add a dependency dynamically (see next section),
you must refer to `"children": [ "calc_date" ]`, not to the individual `calc_date_1`.
Tasks Generating Tasks
----------------------
Some DAG structures can only be fully known at runtime. For instance, if a job pulls multiple files from a source, each
of which can be processed independently, it would be nice if the DAG could modify itself on the fly to accomodate that
request.
Enter the `generator` task. If a task is defined with `"isGenerator": true`, the output of the task is assumed to be a
JSON dictionary containing new tasks to run. The new tasks will go through parameter expansion as described above, using
the same parameter list as the original DAG. New tasks can define their own dependencies.
**NB:** Generated tasks won't have any children dependencies unless you define them. If there are parameterized
dependencies, you must use the name of the original task (e.g. use `calc_date`, not `calc_date_1`) to add a dependency.
**NB:** If you add a child dependency to a task that has already completed, that task won't restart. Best practice is to
create a dependency from the generator task to the task the new tasks will depend on.
```json
{
"tasks": {
"pull_files": {
"job": {
"command": [
"/path/to/puller/script",
"{{DATE}}"
]
},
"isGenerator": true,
children: [
"generate_report"
]
},
"generate_report": {
"job": {
"command": [
"/path/to/generator"
]
}
}
}
}
```
```mermaid
graph LR
pull_files-->generate_report
```
The output of the puller task might be:
```json
{
"calc_date_a": {
"job": {
command
": [
"/path/to/calculator",
"/path/to/data/file/a"
]
},
"children": [
"generate_report"
]
},
"calc_date_b": {
"job": {
"command": [
"/path/to/calculator",
"/path/to/data/file/b"
]
},
"children": [
"generate_report"
]
}
}
```
Once the first task runs, its output is parse as additional tasks to run. The new DAG will look like this:
```mermaid
graph LR
pull_files-->generate_report
pull_files-->calc_file_a
pull_files-->calc_file_b
calc_file_a-->generate_report
calc_file_b-->generate_report
```
Note that it was important that `generate_report` depend on `pull_files`, otherwise the two task would run concurrently,
and the `generate_report` wouldn't have any files to report on.
Execution Parameters
--
(future work)
The REST server can be configured with multiple pools of executors. For instance, it might be helpful to run certain
jobs on slurm with a specific set of restrictions, or allow for local execution as well as execution on a slurm cluster.
`executionParameters` is a member passed in that alters how the DAG is executed.
| Attribute | Description |
|-----------|-------------|
| pool | Names the executor the DAG should run on |
| poolParameters | Any parameters the executor accepts that might modify how a task is run |
Default Job Values
------------------
A DAG can be submitted with the extra section `jobDefaults`. These values will be used to fill in default values for all
tasks if they aren't overridden. This can be useful for cases like Slurm execution, where tasks will share default
memory and runtime requirements.
Executors
=========
Different executors require different structures for the `job` task member.
Local Executor (ForkingTaskExecutor)
------------------------------------
The ForkingTaskExecutor runs tasks on the local box, forking to run the task, and using threads to monitor completion
and capture output.
| Field | Sample | Description |
|---------|--------|--------------|
| command | `[ "/usr/bin/echo", "param1" ]` | The command to run on a slurm host |
Slurm Executor (SlurmTaskExecutor)
----------------------------------
The slurm executor requires that the daggy server be running on a node capable of submitting jobs.
To enable slurm support use `cmake -DDAGGY_ENABLE_SLURM=ON ..` when configuring the project.
Required `job` config values:
| Field | Sample | Description |
|---------|--------|--------------|
| command | `[ "/usr/bin/echo", "param1" ]` | The command to run on a slurm host |
| minCPUs | `"1"` | Minimum number of CPUs required |
| minMemoryMB | `"1"` | Minimum memory required, in MB |
| minTmpDiskMB | `"1"` | Minimum temporary disk required, in MB |
| priority | `"100"` | Slurm priority |
| timeLimitSeconds | `"100"` | Number of seconds to allow the job to run for |
| userID | `"1002"` | Numeric UID that the job should run as |
| workDir | `"/tmp/"` | Directory to use for work |
| tmpDir | `"/tmp/"` | Directory to use for temporary files, as well as stdout/stderr capture |
Daggy will submit the `command` to run, capturing the output in `${tmpDir}/${taskName}_{RANDOM}.{stderr,stdout}` . Those
files will then be read after the task has completed, and stored in the AttemptRecord for later retrieval.
For this reason, it's important that the `tmpDir` directory **be readable by the daggy engine**. i.e in a distributed
environment, it should be a shared filesystem. If this isn't the case, the job output will not be captured by daggy,
although it will still be available wherever it was written by slurm.