Adding support for isGenerator tasks

- Changing how DAG is represented, both in code and how DAGs are defined
  in JSON.
- Removing std::vector<Task> representation in favour of a map that will
  enforce unique task names
- Task names now have a name (generated), and a definedName.
- Adding support to loggers to add tasks after a DAGRun has been
  initialized.
This commit is contained in:
Ian Roddis
2021-08-30 22:05:37 -03:00
parent dd6159dda8
commit 2c00001e0b
22 changed files with 672 additions and 396 deletions

132
README.md
View File

@@ -17,12 +17,12 @@ graph LR
Pull_A-->Transform_A;
Pull_B-->Transform_B;
Pull_C-->Transform_C;
Transform_A-->Derive_Data_AB;
Transform_B-->Derive_Data_AB;
Derive_Data_AB-->Derive_Data_ABC;
Transform_C-->Derive_Data_ABC;
Derive_Data_ABC-->Report;
```
@@ -65,14 +65,15 @@ Basic Definition
A DAG Run definition consists of a dictionary that defines a set of tasks. Each task has the following attributes:
| Attribute | Required | Description |
|------------|------------|--------------------------------------------------------|
| name | Yes | Name of this task. Must be unique. |
| command | Yes | The command to execute |
| maxRetries | No | If a task fails, how many times to retry (default: 0) |
| retry | No | How many seconds to wait between retries. |
| children | No | List of names of tasks that depend on this task |
| parents | No | List of names of tasks that this task depends on |
| Attribute | Required | Description |
|--------------|--------------|---------------------------------------------------------------|
| name | Yes | Name of this task. Must be unique. |
| command | Yes | The command to execute |
| maxRetries | No | If a task fails, how many times to retry (default: 0) |
| retry | No | How many seconds to wait between retries. |
| children | No | List of names of tasks that depend on this task |
| parents | No | List of names of tasks that this task depends on |
| isGenerator | No | The output of this task generates additional task definitions |
Defining both `parents` and `children` are not required; one or the other is sufficient. Both are supported to allow you
to define your task dependencies in the way that is most natural to how you think.
@@ -81,9 +82,8 @@ Below is an example DAG Run submission:
```json
{
"tasks": [
{
"name": "task_one",
"tasks": {
"task_one": {
"command": [
"/usr/bin/touch",
"/tmp/somefile"
@@ -91,8 +91,7 @@ Below is an example DAG Run submission:
"maxRetries": 3,
"retryIntervalSeconds": 30
},
{
"name": "task_two",
"task_two": {
"command": [
"/usr/bin/touch",
"/tmp/someotherfile"
@@ -103,7 +102,7 @@ Below is an example DAG Run submission:
"task_one"
]
}
]
}
}
```
@@ -122,9 +121,8 @@ For instance:
"DIRECTORY": "/var/tmp",
"FILE": "somefile"
},
"tasks": [
{
"name": "task_one",
"tasks": {
"task_one": {
"command": [
"/usr/bin/touch",
"{{DIRECTORY}}/{{FILE}}"
@@ -132,9 +130,9 @@ For instance:
"maxRetries": 3,
"retryIntervalSeconds": 30
}
]
}
}
```
```
`task_one`'s command, when run, will touch `/var/tmp/somefile`, since the values of `DIRECTORY` and `FILE` will be
populated from the `taskParameters` values.
@@ -155,31 +153,28 @@ Example:
"2021-03-01"
]
},
"tasks": [
{
"name": "populate_inputs",
"tasks": {
"populate_inputs": {
"command": [
"/usr/bin/touch",
"{{DIRECTORY}}/{{FILE}}"
]
},
{
"name": "calc_date",
"calc_date": {
"command": [
"/path/to/calculator",
"{{DIRECTORY}}/{{FILE}}",
"{{DATE}}"
]
},
{
"name": "generate_report",
"generate_report": {
"command": [
"/path/to/generator"
]
}
]
}
}
```
```
Conceptually, this DAG looks like this:
@@ -205,6 +200,81 @@ graph LR
- `calc_date_2` will have the command `/path/to/calculator /var/tmp/somefile 2021-02-01`
- `calc_date_3` will have the command `/path/to/calculator /var/tmp/somefile 2021-03-01`
Tasks Generating Tasks
----------------------
Some DAG structures cannot be known ahead of time, but only at runtime. For instance, if a job pulls multiple files
from a source, each of which can be processed independently, it would be nice if the DAG could modify itself on the fly
to accomodate that request.
Enter the `generator` task. If a task is defined with `"isGenerator": true`, the output of the task is assumed to be
a JSON dictionary containing new tasks to run. The new tasks will go through parameter expansion as described above,
and can freely define their dependencies the same way.
**NB:** Generated tasks won't have any children dependencies unless you define them. If there are parameterized
dependencies, you must use the name of the original task (e.g. use `calc_date`, not `calc_date_1`) to add a dependency.
**NB:** If you add a child dependency to a task that has already completed, weird things will happen. Don't do it.
```json
{
"tasks": {
"pull_files": {
"command": [
"/path/to/puller/script",
"{{DATE}}"
],
"isGenerator": true,
children: [ "generate_report" ]
},
"generate_report": {
"command": [
"/path/to/generator"
]
}
}
}
```
```mermaid
graph LR
pull_files-->generate_report
```
The output of the puller task might be:
```json
{
"calc_date_a": {
"command": [
"/path/to/calculator",
"/path/to/data/file/a"
],
"children": ["generate_report"]
},
"calc_date_b": {
"command": [
"/path/to/calculator",
"/path/to/data/file/b"
],
"children": ["generate_report"]
}
}
```
Once the first task runs, its output is parse as additional tasks to run. The new DAG will look like this:
```mermaid
graph LR
pull_files-->generate_report
pull_files-->calc_file_a
pull_files-->calc_file_b
calc_file_a-->generate_report
calc_file_b-->generate_report
```
Note that it was important that `generate_report` depend on `pull_files`, otherwise the two task would
run concurrently, and the `generate_report` wouldn't have any files to report on.
Execution Parameters
--
(future work)
@@ -217,4 +287,4 @@ jobs on slurm with a specific set of restrictions, or allow for local execution
| Attribute | Description |
|-----------|-------------|
| pool | Names the executor the DAG should run on |
| poolParameters | Any parameters the executor accepts that might modify how a task is run |
| poolParameters | Any parameters the executor accepts that might modify how a task is run |