Massive re-org to allow per-executor job specification formats and executor-specific task validation and expansion.

A few different renames to try and keep things more consistent.
This commit is contained in:
Ian Roddis
2021-09-03 09:10:38 -03:00
parent e746f8c163
commit d15580f47f
22 changed files with 509 additions and 300 deletions

166
README.md
View File

@@ -58,7 +58,7 @@ DAG Run Definition
daggy works as a standalone library, but generally runs as a service with a REST interface. This documentation is
specifically for submitting DAGs to the REST server for execution (a DAG run).
DAGs are defined in JSON as a set of `tasks`, along with optional `taskParameters` and `executionParameters` (future).
DAGs are defined in JSON as a set of `tasks`, along with optional `job` and `executionParameters` (future).
Basic Definition
--
@@ -84,18 +84,22 @@ Below is an example DAG Run submission:
{
"tasks": {
"task_one": {
"command": [
"/usr/bin/touch",
"/tmp/somefile"
],
"job": {
"command": [
"/usr/bin/touch",
"/tmp/somefile"
]
},
"maxRetries": 3,
"retryIntervalSeconds": 30
},
"task_two": {
"command": [
"/usr/bin/touch",
"/tmp/someotherfile"
],
"job": {
"command": [
"/usr/bin/touch",
"/tmp/someotherfile"
]
},
"maxRetries": 3,
"retryIntervalSeconds": 30,
"parents": [
@@ -109,24 +113,25 @@ Below is an example DAG Run submission:
Task Parameters
--
Task commands can be parameterized by passing in an optional `taskParameters` member. Each parameter consists of a name
and either a string value, or an array of string values. Task commands will be regenerated based on the values of the
parameters.
Task commands can be parameterized by passing in an optional `parameters` member. Each parameter consists of a name and
either a string value, or an array of string values. Tasks will be regenerated based on the values of the parameters.
For instance:
```json
{
"taskParameters": {
"parameters": {
"DIRECTORY": "/var/tmp",
"FILE": "somefile"
},
"tasks": {
"task_one": {
"command": [
"/usr/bin/touch",
"{{DIRECTORY}}/{{FILE}}"
],
"job": {
"command": [
"/usr/bin/touch",
"{{DIRECTORY}}/{{FILE}}"
]
},
"maxRetries": 3,
"retryIntervalSeconds": 30
}
@@ -135,7 +140,7 @@ For instance:
```
`task_one`'s command, when run, will touch `/var/tmp/somefile`, since the values of `DIRECTORY` and `FILE` will be
populated from the `taskParameters` values.
populated from the `job` values.
In the case where a parameter has an array of values, any tasks referencing that value will be duplicated with the
cartesian product of all relevant values.
@@ -144,7 +149,7 @@ Example:
```json
{
"taskParameters": {
"job": {
"DIRECTORY": "/var/tmp",
"FILE": "somefile",
"DATE": [
@@ -155,22 +160,28 @@ Example:
},
"tasks": {
"populate_inputs": {
"command": [
"/usr/bin/touch",
"{{DIRECTORY}}/{{FILE}}"
]
"job": {
"command": [
"/usr/bin/touch",
"{{DIRECTORY}}/{{FILE}}"
]
}
},
"calc_date": {
"command": [
"/path/to/calculator",
"{{DIRECTORY}}/{{FILE}}",
"{{DATE}}"
]
"job": {
"command": [
"/path/to/calculator",
"{{DIRECTORY}}/{{FILE}}",
"{{DATE}}"
]
}
},
"generate_report": {
"command": [
"/path/to/generator"
]
"job": {
"command": [
"/path/to/generator"
]
}
}
}
}
@@ -200,37 +211,48 @@ graph LR
- `calc_date_2` will have the command `/path/to/calculator /var/tmp/somefile 2021-02-01`
- `calc_date_3` will have the command `/path/to/calculator /var/tmp/somefile 2021-03-01`
**NB**: When a task template resolves to multiple tasks instances, all of those new instances are still referred to by
the original name for the purposes of creating dependencies. e.g. to add a dependency dynamically (see next section),
you must refer to `"children": [ "calc_date" ]`, not to the individual `calc_date_1`.
Tasks Generating Tasks
----------------------
Some DAG structures cannot be known ahead of time, but only at runtime. For instance, if a job pulls multiple files
from a source, each of which can be processed independently, it would be nice if the DAG could modify itself on the fly
to accomodate that request.
Some DAG structures can only be fully known at runtime. For instance, if a job pulls multiple files from a source, each
of which can be processed independently, it would be nice if the DAG could modify itself on the fly to accomodate that
request.
Enter the `generator` task. If a task is defined with `"isGenerator": true`, the output of the task is assumed to be
a JSON dictionary containing new tasks to run. The new tasks will go through parameter expansion as described above,
and can freely define their dependencies the same way.
Enter the `generator` task. If a task is defined with `"isGenerator": true`, the output of the task is assumed to be a
JSON dictionary containing new tasks to run. The new tasks will go through parameter expansion as described above, using
the same parameter list as the original DAG. New tasks can define their own dependencies.
**NB:** Generated tasks won't have any children dependencies unless you define them. If there are parameterized
dependencies, you must use the name of the original task (e.g. use `calc_date`, not `calc_date_1`) to add a dependency.
**NB:** If you add a child dependency to a task that has already completed, weird things will happen. Don't do it.
**NB:** If you add a child dependency to a task that has already completed, that task won't restart. Best practice is to
create a dependency from the generator task to the task the new tasks will depend on.
```json
{
"tasks": {
"pull_files": {
"command": [
"/path/to/puller/script",
"{{DATE}}"
],
"job": {
"command": [
"/path/to/puller/script",
"{{DATE}}"
]
},
"isGenerator": true,
children: [ "generate_report" ]
children: [
"generate_report"
]
},
"generate_report": {
"command": [
"/path/to/generator"
]
"job": {
"command": [
"/path/to/generator"
]
}
}
}
}
@@ -245,20 +267,29 @@ The output of the puller task might be:
```json
{
"calc_date_a": {
"command": [
"/path/to/calculator",
"/path/to/data/file/a"
],
"children": ["generate_report"]
"calc_date_a": {
"job": {
command
": [
"/path/to/calculator",
"/path/to/data/file/a"
]
},
"calc_date_b": {
"command": [
"/path/to/calculator",
"/path/to/data/file/b"
],
"children": ["generate_report"]
}
"children": [
"generate_report"
]
},
"calc_date_b": {
"job": {
"command": [
"/path/to/calculator",
"/path/to/data/file/b"
]
},
"children": [
"generate_report"
]
}
}
```
@@ -272,8 +303,9 @@ graph LR
calc_file_a-->generate_report
calc_file_b-->generate_report
```
Note that it was important that `generate_report` depend on `pull_files`, otherwise the two task would
run concurrently, and the `generate_report` wouldn't have any files to report on.
Note that it was important that `generate_report` depend on `pull_files`, otherwise the two task would run concurrently,
and the `generate_report` wouldn't have any files to report on.
Execution Parameters
--
@@ -288,3 +320,15 @@ jobs on slurm with a specific set of restrictions, or allow for local execution
|-----------|-------------|
| pool | Names the executor the DAG should run on |
| poolParameters | Any parameters the executor accepts that might modify how a task is run |
Executors
=========
Different executors require different structures for the `job` task member.
Default Job Values
------------------
A DAG can be submitted with the extra section `jobDefaults`. These values will be used to fill in default values for all
tasks if they aren't overridden. This can be useful for cases like Slurm execution, where tasks will share default
memory and runtime requirements.