Massive re-org to allow per-executor job specification formats and executor-specific task validation and expansion.
A few different renames to try and keep things more consistent.
This commit is contained in:
166
README.md
166
README.md
@@ -58,7 +58,7 @@ DAG Run Definition
|
||||
daggy works as a standalone library, but generally runs as a service with a REST interface. This documentation is
|
||||
specifically for submitting DAGs to the REST server for execution (a DAG run).
|
||||
|
||||
DAGs are defined in JSON as a set of `tasks`, along with optional `taskParameters` and `executionParameters` (future).
|
||||
DAGs are defined in JSON as a set of `tasks`, along with optional `job` and `executionParameters` (future).
|
||||
|
||||
Basic Definition
|
||||
--
|
||||
@@ -84,18 +84,22 @@ Below is an example DAG Run submission:
|
||||
{
|
||||
"tasks": {
|
||||
"task_one": {
|
||||
"command": [
|
||||
"/usr/bin/touch",
|
||||
"/tmp/somefile"
|
||||
],
|
||||
"job": {
|
||||
"command": [
|
||||
"/usr/bin/touch",
|
||||
"/tmp/somefile"
|
||||
]
|
||||
},
|
||||
"maxRetries": 3,
|
||||
"retryIntervalSeconds": 30
|
||||
},
|
||||
"task_two": {
|
||||
"command": [
|
||||
"/usr/bin/touch",
|
||||
"/tmp/someotherfile"
|
||||
],
|
||||
"job": {
|
||||
"command": [
|
||||
"/usr/bin/touch",
|
||||
"/tmp/someotherfile"
|
||||
]
|
||||
},
|
||||
"maxRetries": 3,
|
||||
"retryIntervalSeconds": 30,
|
||||
"parents": [
|
||||
@@ -109,24 +113,25 @@ Below is an example DAG Run submission:
|
||||
Task Parameters
|
||||
--
|
||||
|
||||
Task commands can be parameterized by passing in an optional `taskParameters` member. Each parameter consists of a name
|
||||
and either a string value, or an array of string values. Task commands will be regenerated based on the values of the
|
||||
parameters.
|
||||
Task commands can be parameterized by passing in an optional `parameters` member. Each parameter consists of a name and
|
||||
either a string value, or an array of string values. Tasks will be regenerated based on the values of the parameters.
|
||||
|
||||
For instance:
|
||||
|
||||
```json
|
||||
{
|
||||
"taskParameters": {
|
||||
"parameters": {
|
||||
"DIRECTORY": "/var/tmp",
|
||||
"FILE": "somefile"
|
||||
},
|
||||
"tasks": {
|
||||
"task_one": {
|
||||
"command": [
|
||||
"/usr/bin/touch",
|
||||
"{{DIRECTORY}}/{{FILE}}"
|
||||
],
|
||||
"job": {
|
||||
"command": [
|
||||
"/usr/bin/touch",
|
||||
"{{DIRECTORY}}/{{FILE}}"
|
||||
]
|
||||
},
|
||||
"maxRetries": 3,
|
||||
"retryIntervalSeconds": 30
|
||||
}
|
||||
@@ -135,7 +140,7 @@ For instance:
|
||||
```
|
||||
|
||||
`task_one`'s command, when run, will touch `/var/tmp/somefile`, since the values of `DIRECTORY` and `FILE` will be
|
||||
populated from the `taskParameters` values.
|
||||
populated from the `job` values.
|
||||
|
||||
In the case where a parameter has an array of values, any tasks referencing that value will be duplicated with the
|
||||
cartesian product of all relevant values.
|
||||
@@ -144,7 +149,7 @@ Example:
|
||||
|
||||
```json
|
||||
{
|
||||
"taskParameters": {
|
||||
"job": {
|
||||
"DIRECTORY": "/var/tmp",
|
||||
"FILE": "somefile",
|
||||
"DATE": [
|
||||
@@ -155,22 +160,28 @@ Example:
|
||||
},
|
||||
"tasks": {
|
||||
"populate_inputs": {
|
||||
"command": [
|
||||
"/usr/bin/touch",
|
||||
"{{DIRECTORY}}/{{FILE}}"
|
||||
]
|
||||
"job": {
|
||||
"command": [
|
||||
"/usr/bin/touch",
|
||||
"{{DIRECTORY}}/{{FILE}}"
|
||||
]
|
||||
}
|
||||
},
|
||||
"calc_date": {
|
||||
"command": [
|
||||
"/path/to/calculator",
|
||||
"{{DIRECTORY}}/{{FILE}}",
|
||||
"{{DATE}}"
|
||||
]
|
||||
"job": {
|
||||
"command": [
|
||||
"/path/to/calculator",
|
||||
"{{DIRECTORY}}/{{FILE}}",
|
||||
"{{DATE}}"
|
||||
]
|
||||
}
|
||||
},
|
||||
"generate_report": {
|
||||
"command": [
|
||||
"/path/to/generator"
|
||||
]
|
||||
"job": {
|
||||
"command": [
|
||||
"/path/to/generator"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -200,37 +211,48 @@ graph LR
|
||||
- `calc_date_2` will have the command `/path/to/calculator /var/tmp/somefile 2021-02-01`
|
||||
- `calc_date_3` will have the command `/path/to/calculator /var/tmp/somefile 2021-03-01`
|
||||
|
||||
**NB**: When a task template resolves to multiple tasks instances, all of those new instances are still referred to by
|
||||
the original name for the purposes of creating dependencies. e.g. to add a dependency dynamically (see next section),
|
||||
you must refer to `"children": [ "calc_date" ]`, not to the individual `calc_date_1`.
|
||||
|
||||
Tasks Generating Tasks
|
||||
----------------------
|
||||
|
||||
Some DAG structures cannot be known ahead of time, but only at runtime. For instance, if a job pulls multiple files
|
||||
from a source, each of which can be processed independently, it would be nice if the DAG could modify itself on the fly
|
||||
to accomodate that request.
|
||||
Some DAG structures can only be fully known at runtime. For instance, if a job pulls multiple files from a source, each
|
||||
of which can be processed independently, it would be nice if the DAG could modify itself on the fly to accomodate that
|
||||
request.
|
||||
|
||||
Enter the `generator` task. If a task is defined with `"isGenerator": true`, the output of the task is assumed to be
|
||||
a JSON dictionary containing new tasks to run. The new tasks will go through parameter expansion as described above,
|
||||
and can freely define their dependencies the same way.
|
||||
Enter the `generator` task. If a task is defined with `"isGenerator": true`, the output of the task is assumed to be a
|
||||
JSON dictionary containing new tasks to run. The new tasks will go through parameter expansion as described above, using
|
||||
the same parameter list as the original DAG. New tasks can define their own dependencies.
|
||||
|
||||
**NB:** Generated tasks won't have any children dependencies unless you define them. If there are parameterized
|
||||
dependencies, you must use the name of the original task (e.g. use `calc_date`, not `calc_date_1`) to add a dependency.
|
||||
|
||||
**NB:** If you add a child dependency to a task that has already completed, weird things will happen. Don't do it.
|
||||
**NB:** If you add a child dependency to a task that has already completed, that task won't restart. Best practice is to
|
||||
create a dependency from the generator task to the task the new tasks will depend on.
|
||||
|
||||
```json
|
||||
{
|
||||
"tasks": {
|
||||
"pull_files": {
|
||||
"command": [
|
||||
"/path/to/puller/script",
|
||||
"{{DATE}}"
|
||||
],
|
||||
"job": {
|
||||
"command": [
|
||||
"/path/to/puller/script",
|
||||
"{{DATE}}"
|
||||
]
|
||||
},
|
||||
"isGenerator": true,
|
||||
children: [ "generate_report" ]
|
||||
children: [
|
||||
"generate_report"
|
||||
]
|
||||
},
|
||||
"generate_report": {
|
||||
"command": [
|
||||
"/path/to/generator"
|
||||
]
|
||||
"job": {
|
||||
"command": [
|
||||
"/path/to/generator"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -245,20 +267,29 @@ The output of the puller task might be:
|
||||
|
||||
```json
|
||||
{
|
||||
"calc_date_a": {
|
||||
"command": [
|
||||
"/path/to/calculator",
|
||||
"/path/to/data/file/a"
|
||||
],
|
||||
"children": ["generate_report"]
|
||||
"calc_date_a": {
|
||||
"job": {
|
||||
command
|
||||
": [
|
||||
"/path/to/calculator",
|
||||
"/path/to/data/file/a"
|
||||
]
|
||||
},
|
||||
"calc_date_b": {
|
||||
"command": [
|
||||
"/path/to/calculator",
|
||||
"/path/to/data/file/b"
|
||||
],
|
||||
"children": ["generate_report"]
|
||||
}
|
||||
"children": [
|
||||
"generate_report"
|
||||
]
|
||||
},
|
||||
"calc_date_b": {
|
||||
"job": {
|
||||
"command": [
|
||||
"/path/to/calculator",
|
||||
"/path/to/data/file/b"
|
||||
]
|
||||
},
|
||||
"children": [
|
||||
"generate_report"
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
@@ -272,8 +303,9 @@ graph LR
|
||||
calc_file_a-->generate_report
|
||||
calc_file_b-->generate_report
|
||||
```
|
||||
Note that it was important that `generate_report` depend on `pull_files`, otherwise the two task would
|
||||
run concurrently, and the `generate_report` wouldn't have any files to report on.
|
||||
|
||||
Note that it was important that `generate_report` depend on `pull_files`, otherwise the two task would run concurrently,
|
||||
and the `generate_report` wouldn't have any files to report on.
|
||||
|
||||
Execution Parameters
|
||||
--
|
||||
@@ -288,3 +320,15 @@ jobs on slurm with a specific set of restrictions, or allow for local execution
|
||||
|-----------|-------------|
|
||||
| pool | Names the executor the DAG should run on |
|
||||
| poolParameters | Any parameters the executor accepts that might modify how a task is run |
|
||||
|
||||
Executors
|
||||
=========
|
||||
|
||||
Different executors require different structures for the `job` task member.
|
||||
|
||||
Default Job Values
|
||||
------------------
|
||||
|
||||
A DAG can be submitted with the extra section `jobDefaults`. These values will be used to fill in default values for all
|
||||
tasks if they aren't overridden. This can be useful for cases like Slurm execution, where tasks will share default
|
||||
memory and runtime requirements.
|
||||
|
||||
Reference in New Issue
Block a user