Massive re-org to allow per-executor job specification formats and executor-specific task validation and expansion.

A few different renames to try and keep things more consistent.
2021-09-03 09:10:38 -03:00
parent e746f8c163
commit d15580f47f
22 changed files with 509 additions and 300 deletions
--- a/README.md
+++ b/README.md
@@ -58,7 +58,7 @@ DAG Run Definition
 daggy works as a standalone library, but generally runs as a service with a REST interface. This documentation is
 specifically for submitting DAGs to the REST server for execution (a DAG run).

-DAGs are defined in JSON as a set of `tasks`, along with optional `taskParameters` and `executionParameters` (future).
+DAGs are defined in JSON as a set of `tasks`, along with optional `job` and `executionParameters` (future).

 Basic Definition
 --
@@ -84,18 +84,22 @@ Below is an example DAG Run submission:
 {
  "tasks": {
    "task_one": {
-      "command": [
-        "/usr/bin/touch",
-        "/tmp/somefile"
-      ],
+      "job": {
+        "command": [
+          "/usr/bin/touch",
+          "/tmp/somefile"
+        ]
+      },
      "maxRetries": 3,
      "retryIntervalSeconds": 30
    },
    "task_two": {
-      "command": [
-        "/usr/bin/touch",
-        "/tmp/someotherfile"
-      ],
+      "job": {
+        "command": [
+          "/usr/bin/touch",
+          "/tmp/someotherfile"
+        ]
+      },
      "maxRetries": 3,
      "retryIntervalSeconds": 30,
      "parents": [
@@ -109,24 +113,25 @@ Below is an example DAG Run submission:
 Task Parameters
 --

-Task commands can be parameterized by passing in an optional `taskParameters` member. Each parameter consists of a name
-and either a string value, or an array of string values. Task commands will be regenerated based on the values of the
-parameters.
+Task commands can be parameterized by passing in an optional `parameters` member. Each parameter consists of a name and
+either a string value, or an array of string values. Tasks will be regenerated based on the values of the parameters.

 For instance:

 ```json
 {
-  "taskParameters": {
+  "parameters": {
    "DIRECTORY": "/var/tmp",
    "FILE": "somefile"
  },
  "tasks": {
    "task_one": {
-      "command": [
-        "/usr/bin/touch",
-        "{{DIRECTORY}}/{{FILE}}"
-      ],
+      "job": {
+        "command": [
+          "/usr/bin/touch",
+          "{{DIRECTORY}}/{{FILE}}"
+        ]
+      },
      "maxRetries": 3,
      "retryIntervalSeconds": 30
    }
@@ -135,7 +140,7 @@ For instance:
 ```

 `task_one`'s command, when run, will touch `/var/tmp/somefile`, since the values of `DIRECTORY` and `FILE` will be
-populated from the `taskParameters` values.
+populated from the `job` values.

 In the case where a parameter has an array of values, any tasks referencing that value will be duplicated with the
 cartesian product of all relevant values.
@@ -144,7 +149,7 @@ Example:

 ```json
 {
-  "taskParameters": {
+  "job": {
    "DIRECTORY": "/var/tmp",
    "FILE": "somefile",
    "DATE": [
@@ -155,22 +160,28 @@ Example:
  },
  "tasks": {
    "populate_inputs": {
-      "command": [
-        "/usr/bin/touch",
-        "{{DIRECTORY}}/{{FILE}}"
-      ]
+      "job": {
+        "command": [
+          "/usr/bin/touch",
+          "{{DIRECTORY}}/{{FILE}}"
+        ]
+      }
    },
    "calc_date": {
-      "command": [
-        "/path/to/calculator",
-        "{{DIRECTORY}}/{{FILE}}",
-        "{{DATE}}"
-      ]
+      "job": {
+        "command": [
+          "/path/to/calculator",
+          "{{DIRECTORY}}/{{FILE}}",
+          "{{DATE}}"
+        ]
+      }
    },
    "generate_report": {
-      "command": [
-        "/path/to/generator"
-      ]
+      "job": {
+        "command": [
+          "/path/to/generator"
+        ]
+      }
    }
  }
 }
@@ -200,37 +211,48 @@ graph LR
 - `calc_date_2` will have the command `/path/to/calculator /var/tmp/somefile 2021-02-01`
 - `calc_date_3` will have the command `/path/to/calculator /var/tmp/somefile 2021-03-01`

+**NB**: When a task template resolves to multiple tasks instances, all of those new instances are still referred to by
+the original name for the purposes of creating dependencies. e.g. to add a dependency dynamically (see next section),
+you must refer to `"children": [ "calc_date" ]`, not to the individual `calc_date_1`.
+
 Tasks Generating Tasks
 ----------------------

-Some DAG structures cannot be known ahead of time, but only at runtime. For instance, if a job pulls multiple files
-from a source, each of which can be processed independently, it would be nice if the DAG could modify itself on the fly
-to accomodate that request.
+Some DAG structures can only be fully known at runtime. For instance, if a job pulls multiple files from a source, each
+of which can be processed independently, it would be nice if the DAG could modify itself on the fly to accomodate that
+request.

-Enter the `generator` task. If a task is defined with `"isGenerator": true`, the output of the task is assumed to be
-a JSON dictionary containing new tasks to run. The new tasks will go through parameter expansion as described above,
-and can freely define their dependencies the same way.
+Enter the `generator` task. If a task is defined with `"isGenerator": true`, the output of the task is assumed to be a
+JSON dictionary containing new tasks to run. The new tasks will go through parameter expansion as described above, using
+the same parameter list as the original DAG. New tasks can define their own dependencies.

 **NB:** Generated tasks won't have any children dependencies unless you define them. If there are parameterized
 dependencies, you must use the name of the original task (e.g. use `calc_date`, not `calc_date_1`) to add a dependency.

-**NB:** If you add a child dependency to a task that has already completed, weird things will happen. Don't do it.
+**NB:** If you add a child dependency to a task that has already completed, that task won't restart. Best practice is to
+create a dependency from the generator task to the task the new tasks will depend on.

 ```json
 {
  "tasks": {
    "pull_files": {
-      "command": [
-        "/path/to/puller/script",
-        "{{DATE}}"
-      ],
+      "job": {
+        "command": [
+          "/path/to/puller/script",
+          "{{DATE}}"
+        ]
+      },
      "isGenerator": true,
-      children: [ "generate_report" ]
+      children: [
+        "generate_report"
+      ]
    },
    "generate_report": {
-      "command": [
-        "/path/to/generator"
-      ]
+      "job": {
+        "command": [
+          "/path/to/generator"
+        ]
+      }
    }
  }
 }
@@ -245,20 +267,29 @@ The output of the puller task might be:

 ```json
 {
-    "calc_date_a": {
-        "command": [
-            "/path/to/calculator",
-            "/path/to/data/file/a"
-        ],
-        "children": ["generate_report"]
+  "calc_date_a": {
+    "job": {
+      command
+      ": [
+      "/path/to/calculator",
+      "/path/to/data/file/a"
+    ]
    },
-    "calc_date_b": {
-        "command": [
-            "/path/to/calculator",
-            "/path/to/data/file/b"
-        ],
-        "children": ["generate_report"]
-    }
+    "children": [
+      "generate_report"
+    ]
+  },
+  "calc_date_b": {
+    "job": {
+      "command": [
+        "/path/to/calculator",
+        "/path/to/data/file/b"
+      ]
+    },
+    "children": [
+      "generate_report"
+    ]
+  }
 }
 ```

@@ -272,8 +303,9 @@ graph LR
   calc_file_a-->generate_report
   calc_file_b-->generate_report
 ```
-Note that it was important that `generate_report` depend on `pull_files`, otherwise the two task would
-run concurrently, and the `generate_report` wouldn't have any files to report on.
+
+Note that it was important that `generate_report` depend on `pull_files`, otherwise the two task would run concurrently,
+and the `generate_report` wouldn't have any files to report on.

 Execution Parameters
 --
@@ -288,3 +320,15 @@ jobs on slurm with a specific set of restrictions, or allow for local execution
 |-----------|-------------|
 | pool | Names the executor the DAG should run on |
 | poolParameters | Any parameters the executor accepts that might modify how a task is run |
+
+Executors
+=========
+
+Different executors require different structures for the `job` task member.
+
+Default Job Values
+------------------
+
+A DAG can be submitted with the extra section `jobDefaults`. These values will be used to fill in default values for all
+tasks if they aren't overridden. This can be useful for cases like Slurm execution, where tasks will share default
+memory and runtime requirements.