Files

Ian Roddis 212bd80df2 Squashed commit of the following:

commit 73994327de890590eede353c8131f3f7c1e8aaa3
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Wed Aug 25 13:38:29 2021 -0300

    - Fixing up checks for individual dag runs

commit f20e3a3dec8c063111cf60f2bec2b8f84c8a4100
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Wed Aug 25 10:49:43 2021 -0300

    - Finishing serialization of DAGRun
    - Checkpointing work.

commit b490abadf93e3085e4204003de7eaa8183b4e1d5
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Wed Aug 25 10:34:08 2021 -0300

    - Consolidating struct definitions into Defines.hpp
    - Renaming DAGRunRecord member runStates to taskRunStates

commit 050346ec1fd10d1091f261905c6175ffe0bcf001
Author: Ian Roddis <gitlab@ie2r.com>
Date:   Wed Aug 25 09:27:05 2021 -0300

    - Adding additional tests for server endpoints

2021-08-25 13:40:05 -03:00

5.9 KiB

Raw Blame History

Daggy: Ya like dags?

Description

Daggy is a work orchestration framework for running workflows modeled as directed, acyclic graphs (DAGs). These are quite useful when modeling data ingestion / processing pipelines.

Below is an example workflow where data is pulled from three sources (A, B, C), some work is done on them, and a report is generated.

Each step depends on the success of its upstream dependencies, e.g. Derive_Data_AB can't run until Transform_A and Transform_B have completed successfully.

graph LR
  Pull_A-->Transform_A;
  Pull_B-->Transform_B;
  Pull_C-->Transform_C;
  
  Transform_A-->Derive_Data_AB;
  Transform_B-->Derive_Data_AB;
  Derive_Data_AB-->Derive_Data_ABC;
  Transform_C-->Derive_Data_ABC;
  
  Derive_Data_ABC-->Report;

Individual tasks (vertices) are run via a task executor. Daggy supports multiple executors, from local executor (via fork), to distributed work managers like slurm or kubernetes (both planned).

State is maintained via state loggers. Currently daggy supports an in-memory state manager (OStreamLogger), and a filesystem logger (FileSystemLogger). Future plans include supporting redis and postgres.

Building

Requirements:

git
cmake >= 3.14
gcc >= 9

git clone https://gitlab.com/iroddis/daggy
cd daggy
mkdir build
cd build
cmake ..
make

DAG Run Definition

daggy works as a standalone library, but generally runs as a service with a REST interface. This documentation is specifically for submitting DAGs to the REST server for execution (a DAG run).

DAGs are defined in JSON as a set of tasks, along with optional taskParameters and executionParameters (future).

Basic Definition

A DAG Run definition consists of a dictionary that defines a set of tasks. Each task has the following attributes:

Attribute	Required	Description
name	Yes	Name of this task. Must be unique.
command	Yes	The command to execute
maxRetries	No	If a task fails, how many times to retry (default: 0)
retry	No	How many seconds to wait between retries.
children	No	List of names of tasks that depend on this task
parents	No	List of names of tasks that this task depends on

Defining both parents and children are not required; one or the other is sufficient. Both are supported to allow you to define your task dependencies in the way that is most natural to how you think.

Below is an example DAG Run submission:

{
  "tasks": [
    {
      "name": "task_one",
      "command": [
        "/usr/bin/touch",
        "/tmp/somefile"
      ],
      "maxRetries": 3,
      "retryIntervalSeconds": 30
    },
    {
      "name": "task_two",
      "command": [
        "/usr/bin/touch",
        "/tmp/someotherfile"
      ],
      "maxRetries": 3,
      "retryIntervalSeconds": 30,
      "parents": [
        "task_one"
      ]
    }
  ]
}

Task Parameters

Task commands can be parameterized by passing in an optional taskParameters member. Each parameter consists of a name and either a string value, or an array of string values. Task commands will be regenerated based on the values of the parameters.

For instance:

{
  "taskParameters": {
    "DIRECTORY": "/var/tmp",
    "FILE": "somefile"
  },
  "tasks": [
    {
      "name": "task_one",
      "command": [
        "/usr/bin/touch",
        "{{DIRECTORY}}/{{FILE}}"
      ],
      "maxRetries": 3,
      "retryIntervalSeconds": 30
    }
  ]
}

task_one's command, when run, will touch /var/tmp/somefile, since the values of DIRECTORY and FILE will be populated from the taskParameters values.

In the case where a parameter has an array of values, any tasks referencing that value will be duplicated with the cartesian product of all relevant values.

Example:

{
  "taskParameters": {
    "DIRECTORY": "/var/tmp",
    "FILE": "somefile",
    "DATE": [
      "2021-01-01",
      "2021-02-01",
      "2021-03-01"
    ]
  },
  "tasks": [
    {
      "name": "populate_inputs",
      "command": [
        "/usr/bin/touch",
        "{{DIRECTORY}}/{{FILE}}"
      ]
    },
    {
      "name": "calc_date",
      "command": [
        "/path/to/calculator",
        "{{DIRECTORY}}/{{FILE}}",
        "{{DATE}}"
      ]
    },
    {
      "name": "generate_report",
      "command": [
        "/path/to/generator"
      ]
    }
  ]
}

Conceptually, this DAG looks like this:

graph LR
    populate_inputs-->calc_date
    calc_date-->generate_report

Once the parameters have been populated, the new DAG will look like this:

graph LR
    populate_inputs-->calc_date_1
    populate_inputs-->calc_date_2
    populate_inputs-->calc_date_3
    calc_date_1-->generate_report
    calc_date_2-->generate_report
    calc_date_3-->generate_report

calc_date_1 will have the command /path/to/calculator /var/tmp/somefile 2021-01-01
calc_date_2 will have the command /path/to/calculator /var/tmp/somefile 2021-02-01
calc_date_3 will have the command /path/to/calculator /var/tmp/somefile 2021-03-01

Execution Parameters

(future work)

The REST server can be configured with multiple pools of executors. For instance, it might be helpful to run certain jobs on slurm with a specific set of restrictions, or allow for local execution as well as execution on a slurm cluster.

executionParameters is a member passed in that alters how the DAG is executed.

Attribute	Description
pool	Names the executor the DAG should run on
poolParameters	Any parameters the executor accepts that might modify how a task is run

5.9 KiB Raw Blame History