Ian Roddis 57e93b5045 Simplifying daggyr server, and returning to a
task submit / task poll model.

Squashed commit of the following:

commit 0ef57f095d15f0402915de54f83c1671120bd228
Author: Ian Roddis <tech@kinesin.ca>
Date:   Wed Feb 2 08:18:03 2022 -0400

    Simplifying task polling and reducing lock scopes

commit d77ef02021cc728849c7d1fb0185dd1a861b4a3d
Author: Ian Roddis <tech@kinesin.ca>
Date:   Wed Feb 2 08:02:47 2022 -0400

    Simplifying check

commit c1acf34440162abb890a959f3685c2d184242ed5
Author: Ian Roddis <tech@kinesin.ca>
Date:   Wed Feb 2 08:01:13 2022 -0400

    Removing capacity tracking from runner, since it is maintained in daggyd

commit 9401246f92113ab140143c1895978b9de8bd9972
Author: Ian Roddis <tech@kinesin.ca>
Date:   Wed Feb 2 07:47:28 2022 -0400

    Adding retry for submission

commit 398aa04a320347bb35f23f3f101d91ab4df25652
Author: Ian Roddis <tech@kinesin.ca>
Date:   Tue Feb 1 14:54:20 2022 -0400

    Adding in execution note, as well as requeuing the result if the peer disconnects

commit 637b14af6d5b53f25b9c38d4c8a7ed8532af5599
Author: Ian Roddis <tech@kinesin.ca>
Date:   Tue Feb 1 14:13:59 2022 -0400

    Fixing locking issues

commit 4d6716dfda8aa7f51e0abbdab833aff618915ba0
Author: Ian Roddis <tech@kinesin.ca>
Date:   Tue Feb 1 13:33:14 2022 -0400

    Single task daggyr working

commit bd48a5452a92817faf25ee44a6115aaa2f6c30d1
Author: Ian Roddis <tech@kinesin.ca>
Date:   Tue Feb 1 12:22:04 2022 -0400

    Checkpointing work
2022-02-02 21:12:05 -04:00
2022-02-01 10:33:58 -04:00
2022-01-29 13:25:40 -04:00
2021-09-10 13:36:14 -03:00
2021-08-23 16:51:27 -03:00
2022-01-28 10:23:21 -04:00

Daggy: Ya like dags?

Description

Daggy is a work orchestration framework for running workflows modeled as directed, acyclic graphs (DAGs). These are quite useful when modeling data ingestion / processing pipelines.

Below is an example workflow where data is pulled from three sources (A, B, C), some work is done on them, and a report is generated.

Each step depends on the success of its upstream dependencies, e.g. Derive_Data_AB can't run until Transform_A and Transform_B have completed successfully.

graph LR
  Pull_A-->Transform_A;
  Pull_B-->Transform_B;
  Pull_C-->Transform_C;

  Transform_A-->Derive_Data_AB;
  Transform_B-->Derive_Data_AB;
  Derive_Data_AB-->Derive_Data_ABC;
  Transform_C-->Derive_Data_ABC;

  Derive_Data_ABC-->Report;

Individual tasks (vertices) are run via a task executor. Daggy supports multiple executors, from local executor (via fork), to distributed work managers like slurm or kubernetes (planned), or daggy's own executor.

State is maintained via state loggers. Currently daggy supports an in-memory state manager (OStreamLogger), and RedisJSON.

Future plans include supporting postgres.

Building

Requirements:

  • git

  • cmake >= 3.14

  • gcc >= 8

  • libslurm (if needed)

git clone https://gitlab.com/iroddis/daggy
cd daggy
mkdir build
cd build
cmake [-DDAGGY_ENABLE_SLURM=ON] ..
make

tests/tests # for unit tests

DAG Run Definition

daggy works as a standalone library, but generally runs as a service with a REST interface. This documentation is specifically for submitting DAGs to the REST server for execution (a DAG run).

DAGs are defined in JSON as a set of tasks, along with optional job and executionParameters (future).

Basic Definition

A DAG Run definition consists of a dictionary that defines a set of tasks. Each task has the following attributes:

Attribute Required Description
name Yes Name of this task. Must be unique.
command Yes The command to execute
maxRetries No If a task fails, how many times to retry (default: 0)
retry No How many seconds to wait between retries.
children No List of names of tasks that depend on this task
parents No List of names of tasks that this task depends on
isGenerator No The output of this task generates additional task definitions

Defining both parents and children are not required; one or the other is sufficient. Both are supported to allow you to define your task dependencies in the way that is most natural to how you think.

Below is an example DAG Run submission:

{
  "tasks": {
    "task_one": {
      "job": {
        "command": [
          "/bin/touch",
          "/tmp/somefile"
        ]
      },
      "maxRetries": 3,
      "retryIntervalSeconds": 30
    },
    "task_two": {
      "job": {
        "command": [
          "/bin/touch",
          "/tmp/someotherfile"
        ]
      },
      "maxRetries": 3,
      "retryIntervalSeconds": 30,
      "parents": [
        "task_one"
      ]
    }
  }
}

Task Parameters

Task commands can be parameterized by passing in an optional parameters member. Each parameter consists of a name and either a string value, or an array of string values. Tasks will be regenerated based on the values of the parameters.

For instance:

{
  "parameters": {
    "DIRECTORY": "/var/tmp",
    "FILE": "somefile"
  },
  "tasks": {
    "task_one": {
      "job": {
        "command": [
          "/bin/touch",
          "{{DIRECTORY}}/{{FILE}}"
        ]
      },
      "maxRetries": 3,
      "retryIntervalSeconds": 30
    }
  }
}

task_one's command, when run, will touch /var/tmp/somefile, since the values of DIRECTORY and FILE will be populated from the job values.

In the case where a parameter has an array of values, any tasks referencing that value will be duplicated with the cartesian product of all relevant values.

Example:

{
  "job": {
    "DIRECTORY": "/var/tmp",
    "FILE": "somefile",
    "DATE": [
      "2021-01-01",
      "2021-02-01",
      "2021-03-01"
    ]
  },
  "tasks": {
    "populate_inputs": {
      "job": {
        "command": [
          "/bin/touch",
          "{{DIRECTORY}}/{{FILE}}"
        ]
      }
    },
    "calc_date": {
      "job": {
        "command": [
          "/path/to/calculator",
          "{{DIRECTORY}}/{{FILE}}",
          "{{DATE}}"
        ]
      }
    },
    "generate_report": {
      "job": {
        "command": [
          "/path/to/generator"
        ]
      }
    }
  }
}

Conceptually, this DAG looks like this:

graph LR
    populate_inputs-->calc_date
    calc_date-->generate_report

Once the parameters have been populated, the new DAG will look like this:

graph LR
    populate_inputs-->calc_date_1
    populate_inputs-->calc_date_2
    populate_inputs-->calc_date_3
    calc_date_1-->generate_report
    calc_date_2-->generate_report
    calc_date_3-->generate_report
  • calc_date_1 will have the command /path/to/calculator /var/tmp/somefile 2021-01-01
  • calc_date_2 will have the command /path/to/calculator /var/tmp/somefile 2021-02-01
  • calc_date_3 will have the command /path/to/calculator /var/tmp/somefile 2021-03-01

NB: When a task template resolves to multiple tasks instances, all of those new instances are still referred to by the original name for the purposes of creating dependencies. e.g. to add a dependency dynamically (see next section), you must refer to "children": [ "calc_date" ], not to the individual calc_date_1.

Tasks Generating Tasks

Some DAG structures can only be fully known at runtime. For instance, if a job pulls multiple files from a source, each of which can be processed independently, it would be nice if the DAG could modify itself on the fly to accomodate that request.

Enter the generator task. If a task is defined with "isGenerator": true, the output of the task is assumed to be a JSON dictionary containing new tasks to run. The new tasks will go through parameter expansion as described above, using the same parameter list as the original DAG. New tasks can define their own dependencies.

NB: Generated tasks won't have any children dependencies unless you define them. If there are parameterized dependencies, you must use the name of the original task (e.g. use calc_date, not calc_date_1) to add a dependency.

NB: If you add a child dependency to a task that has already completed, that task won't restart. Best practice is to create a dependency from the generator task to the task the new tasks will depend on.

{
  "tasks": {
    "pull_files": {
      "job": {
        "command": [
          "/path/to/puller/script",
          "{{DATE}}"
        ]
      },
      "isGenerator": true,
      children: [
        "generate_report"
      ]
    },
    "generate_report": {
      "job": {
        "command": [
          "/path/to/generator"
        ]
      }
    }
  }
}
graph LR
   pull_files-->generate_report

The output of the puller task might be:

{
  "calc_date_a": {
    "job": {
      command
      ": [
      "/path/to/calculator",
      "/path/to/data/file/a"
    ]
    },
    "children": [
      "generate_report"
    ]
  },
  "calc_date_b": {
    "job": {
      "command": [
        "/path/to/calculator",
        "/path/to/data/file/b"
      ]
    },
    "children": [
      "generate_report"
    ]
  }
}

Once the first task runs, its output is parse as additional tasks to run. The new DAG will look like this:

graph LR
   pull_files-->generate_report
   pull_files-->calc_file_a
   pull_files-->calc_file_b
   calc_file_a-->generate_report
   calc_file_b-->generate_report

Note that it was important that generate_report depend on pull_files, otherwise the two task would run concurrently, and the generate_report wouldn't have any files to report on.

Execution Parameters

(future work)

The REST server can be configured with multiple pools of executors. For instance, it might be helpful to run certain jobs on slurm with a specific set of restrictions, or allow for local execution as well as execution on a slurm cluster.

executionParameters is a member passed in that alters how the DAG is executed.

Attribute Description
pool Names the executor the DAG should run on
poolParameters Any parameters the executor accepts that might modify how a task is run

Default Job Values

A DAG can be submitted with the extra section jobDefaults. These values will be used to fill in default values for all tasks if they aren't overridden. This can be useful for cases like Slurm execution, where tasks will share default memory and runtime requirements.

Executors

Different executors require different structures for the job task member.

Local Executor (ForkingTaskExecutor)

The ForkingTaskExecutor runs tasks on the local box, forking to run the task, and using threads to monitor completion and capture output.

Field Sample Description
command [ "/bin/echo", "param1" ] The command to run
commandString "/bin/echo param1" The command to run as a string. Quoted args are properly handled.
environment [ "DATE=2021-05-03" ] Environment variables to set for script

Slurm Executor (SlurmTaskExecutor)

The slurm executor requires that the daggy server be running on a node capable of submitting jobs.

To enable slurm support use cmake -DDAGGY_ENABLE_SLURM=ON .. when configuring the project.

Required job config values:

Field Sample Description
command [ "/bin/echo", "param1" ] The command to run on a slurm host
commandString "/bin/echo param1" The command to run as a string. Quoted args are properly handled.
environment [ "DATE=2021-05-03" ] Environment variables to set for script
minCPUs "1" Minimum number of CPUs required
minMemoryMB "1" Minimum memory required, in MB
minTmpDiskMB "1" Minimum temporary disk required, in MB
priority "100" Slurm priority
timeLimitSeconds "100" Number of seconds to allow the job to run for
userID "1002" Numeric UID that the job should run as
workDir "/tmp/" Directory to use for work
tmpDir "/tmp/" Directory to use for temporary files, as well as stdout/stderr capture

Daggy will submit the command to run, capturing the output in ${tmpDir}/${taskName}_{RANDOM}.{stderr,stdout} . Those files will then be read after the task has completed, and stored in the AttemptRecord for later retrieval.

For this reason, it's important that the tmpDir directory be readable by the daggy engine. i.e in a distributed environment, it should be a shared filesystem. If this isn't the case, the job output will not be captured by daggy, although it will still be available wherever it was written by slurm.

DaggyRunnerTaskExecutor

Daggy Runners (daggyr in this project) are daemons that can be run on remote hosts, then allocated work.

Tasks submitted to this type of runner require cores and memoryMB attributes. Remote runners have a specific capacity that are consumed when tasks run on them. Right now those capacities are merely advisory; it's possible to oversubscribe a runner, and the constraints are not enforced.

Enforcement via cgroups is planned.

Field Sample Description
command [ "/bin/echo", "param1" ] The command to run
commandString "/bin/echo param1" The command to run as a string. Quoted args are properly handled.
environment [ "DATE=2021-05-03" ] Environment variables to set for script
cores "1" Number of cores required by the task
memoryMB "100" Amount of memory (RSS) required by the task, in MB

Loggers

RedisJSON

Description
No description provided
Readme AGPL-3.0 747 KiB
Languages
C++ 88.7%
Vue 6.3%
CMake 3.6%
CSS 0.9%
JavaScript 0.3%
Other 0.2%