Adding support for isGenerator tasks
- Changing how DAG is represented, both in code and how DAGs are defined in JSON. - Removing std::vector<Task> representation in favour of a map that will enforce unique task names - Task names now have a name (generated), and a definedName. - Adding support to loggers to add tasks after a DAGRun has been initialized.
This commit is contained in:
132
README.md
132
README.md
@@ -17,12 +17,12 @@ graph LR
|
||||
Pull_A-->Transform_A;
|
||||
Pull_B-->Transform_B;
|
||||
Pull_C-->Transform_C;
|
||||
|
||||
|
||||
Transform_A-->Derive_Data_AB;
|
||||
Transform_B-->Derive_Data_AB;
|
||||
Derive_Data_AB-->Derive_Data_ABC;
|
||||
Transform_C-->Derive_Data_ABC;
|
||||
|
||||
|
||||
Derive_Data_ABC-->Report;
|
||||
```
|
||||
|
||||
@@ -65,14 +65,15 @@ Basic Definition
|
||||
|
||||
A DAG Run definition consists of a dictionary that defines a set of tasks. Each task has the following attributes:
|
||||
|
||||
| Attribute | Required | Description |
|
||||
|------------|------------|--------------------------------------------------------|
|
||||
| name | Yes | Name of this task. Must be unique. |
|
||||
| command | Yes | The command to execute |
|
||||
| maxRetries | No | If a task fails, how many times to retry (default: 0) |
|
||||
| retry | No | How many seconds to wait between retries. |
|
||||
| children | No | List of names of tasks that depend on this task |
|
||||
| parents | No | List of names of tasks that this task depends on |
|
||||
| Attribute | Required | Description |
|
||||
|--------------|--------------|---------------------------------------------------------------|
|
||||
| name | Yes | Name of this task. Must be unique. |
|
||||
| command | Yes | The command to execute |
|
||||
| maxRetries | No | If a task fails, how many times to retry (default: 0) |
|
||||
| retry | No | How many seconds to wait between retries. |
|
||||
| children | No | List of names of tasks that depend on this task |
|
||||
| parents | No | List of names of tasks that this task depends on |
|
||||
| isGenerator | No | The output of this task generates additional task definitions |
|
||||
|
||||
Defining both `parents` and `children` are not required; one or the other is sufficient. Both are supported to allow you
|
||||
to define your task dependencies in the way that is most natural to how you think.
|
||||
@@ -81,9 +82,8 @@ Below is an example DAG Run submission:
|
||||
|
||||
```json
|
||||
{
|
||||
"tasks": [
|
||||
{
|
||||
"name": "task_one",
|
||||
"tasks": {
|
||||
"task_one": {
|
||||
"command": [
|
||||
"/usr/bin/touch",
|
||||
"/tmp/somefile"
|
||||
@@ -91,8 +91,7 @@ Below is an example DAG Run submission:
|
||||
"maxRetries": 3,
|
||||
"retryIntervalSeconds": 30
|
||||
},
|
||||
{
|
||||
"name": "task_two",
|
||||
"task_two": {
|
||||
"command": [
|
||||
"/usr/bin/touch",
|
||||
"/tmp/someotherfile"
|
||||
@@ -103,7 +102,7 @@ Below is an example DAG Run submission:
|
||||
"task_one"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
@@ -122,9 +121,8 @@ For instance:
|
||||
"DIRECTORY": "/var/tmp",
|
||||
"FILE": "somefile"
|
||||
},
|
||||
"tasks": [
|
||||
{
|
||||
"name": "task_one",
|
||||
"tasks": {
|
||||
"task_one": {
|
||||
"command": [
|
||||
"/usr/bin/touch",
|
||||
"{{DIRECTORY}}/{{FILE}}"
|
||||
@@ -132,9 +130,9 @@ For instance:
|
||||
"maxRetries": 3,
|
||||
"retryIntervalSeconds": 30
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
```
|
||||
|
||||
`task_one`'s command, when run, will touch `/var/tmp/somefile`, since the values of `DIRECTORY` and `FILE` will be
|
||||
populated from the `taskParameters` values.
|
||||
@@ -155,31 +153,28 @@ Example:
|
||||
"2021-03-01"
|
||||
]
|
||||
},
|
||||
"tasks": [
|
||||
{
|
||||
"name": "populate_inputs",
|
||||
"tasks": {
|
||||
"populate_inputs": {
|
||||
"command": [
|
||||
"/usr/bin/touch",
|
||||
"{{DIRECTORY}}/{{FILE}}"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "calc_date",
|
||||
"calc_date": {
|
||||
"command": [
|
||||
"/path/to/calculator",
|
||||
"{{DIRECTORY}}/{{FILE}}",
|
||||
"{{DATE}}"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "generate_report",
|
||||
"generate_report": {
|
||||
"command": [
|
||||
"/path/to/generator"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
```
|
||||
|
||||
Conceptually, this DAG looks like this:
|
||||
|
||||
@@ -205,6 +200,81 @@ graph LR
|
||||
- `calc_date_2` will have the command `/path/to/calculator /var/tmp/somefile 2021-02-01`
|
||||
- `calc_date_3` will have the command `/path/to/calculator /var/tmp/somefile 2021-03-01`
|
||||
|
||||
Tasks Generating Tasks
|
||||
----------------------
|
||||
|
||||
Some DAG structures cannot be known ahead of time, but only at runtime. For instance, if a job pulls multiple files
|
||||
from a source, each of which can be processed independently, it would be nice if the DAG could modify itself on the fly
|
||||
to accomodate that request.
|
||||
|
||||
Enter the `generator` task. If a task is defined with `"isGenerator": true`, the output of the task is assumed to be
|
||||
a JSON dictionary containing new tasks to run. The new tasks will go through parameter expansion as described above,
|
||||
and can freely define their dependencies the same way.
|
||||
|
||||
**NB:** Generated tasks won't have any children dependencies unless you define them. If there are parameterized
|
||||
dependencies, you must use the name of the original task (e.g. use `calc_date`, not `calc_date_1`) to add a dependency.
|
||||
|
||||
**NB:** If you add a child dependency to a task that has already completed, weird things will happen. Don't do it.
|
||||
|
||||
```json
|
||||
{
|
||||
"tasks": {
|
||||
"pull_files": {
|
||||
"command": [
|
||||
"/path/to/puller/script",
|
||||
"{{DATE}}"
|
||||
],
|
||||
"isGenerator": true,
|
||||
children: [ "generate_report" ]
|
||||
},
|
||||
"generate_report": {
|
||||
"command": [
|
||||
"/path/to/generator"
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
pull_files-->generate_report
|
||||
```
|
||||
|
||||
The output of the puller task might be:
|
||||
|
||||
```json
|
||||
{
|
||||
"calc_date_a": {
|
||||
"command": [
|
||||
"/path/to/calculator",
|
||||
"/path/to/data/file/a"
|
||||
],
|
||||
"children": ["generate_report"]
|
||||
},
|
||||
"calc_date_b": {
|
||||
"command": [
|
||||
"/path/to/calculator",
|
||||
"/path/to/data/file/b"
|
||||
],
|
||||
"children": ["generate_report"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Once the first task runs, its output is parse as additional tasks to run. The new DAG will look like this:
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
pull_files-->generate_report
|
||||
pull_files-->calc_file_a
|
||||
pull_files-->calc_file_b
|
||||
calc_file_a-->generate_report
|
||||
calc_file_b-->generate_report
|
||||
```
|
||||
Note that it was important that `generate_report` depend on `pull_files`, otherwise the two task would
|
||||
run concurrently, and the `generate_report` wouldn't have any files to report on.
|
||||
|
||||
Execution Parameters
|
||||
--
|
||||
(future work)
|
||||
@@ -217,4 +287,4 @@ jobs on slurm with a specific set of restrictions, or allow for local execution
|
||||
| Attribute | Description |
|
||||
|-----------|-------------|
|
||||
| pool | Names the executor the DAG should run on |
|
||||
| poolParameters | Any parameters the executor accepts that might modify how a task is run |
|
||||
| poolParameters | Any parameters the executor accepts that might modify how a task is run |
|
||||
|
||||
Reference in New Issue
Block a user