Download and get started
Choose from one of the following options to download Steep:
If you downloaded the binary package of Steep, extract the ZIP file and run the start script:
cd steep-5.7.0
bin/steep
Or, start the Docker image as follows:
docker run --name steep -d --rm -p 8080:8080 \
-e STEEP_HTTP_HOST=0.0.0.0 steep/steep:5.7.0
After a few seconds, you can access Steep’s web interface on http://localhost:8080/.
We will now submit a simple workflow to test if Steep is running correctly. The workflow consists of a single execute action that sleeps for 10 seconds and then quits. Execute the following command:
curl -X POST http://localhost:8080/workflows -d 'api: 4.2.0
vars:
- id: sleep_seconds
value: 10
actions:
- type: execute
service: sleep
inputs:
- id: seconds
var: sleep_seconds'
The command will return the ID of the submitted workflow. You can monitor the execution in the web interface or by issuing the following command:
curl http://localhost:8080/workflows/<workflow-id>
Replace <workflow-id>
with the returned ID.
Congratulations! You successfully installed Steep and ran your first workflow.
Documentation
In this section, we describe the individual features of Steep. The documentation always applies to the latest software version.
Table of contents
1 How does Steep work?
In order to answer this question, we will first describe how Steep transforms scientific workflow graphs into executable units. After that, we will have a look at Steep’s software architecture and what kind of processing services it can execute.
(This section is based on the following publication: Krämer, M. (2020). Capability-Based Scheduling of Scientific Workflows in the Cloud. Proceedings of the 9th International Conference on Data Science, Technology, and Applications DATA, 43–54. https://doi.org/10.5220/0009805400430054)
1.1 Workflow scheduling
Steep is a scientific workflow management system that can be used to control the processing of very large data sets in a distributed environment.
A scientific workflow is typically represented by a directed acyclic graph that describes how an input data set is processed by certain tasks in a given order to produce a desired outcome. Such workflows can become very large with hundreds up to several thousands of tasks processing data volumes ranging from gigabytes to terabytes. The following figure shows a simple example of such a workflow in an extended Petri Net notation proposed by van der Aalst and van Hee (2004).
In this example, an input file is first processed by a task A. This task produces two results. The first one is processed by task B whose result is in turn sent to C. The second result of A is processed by D. The outcomes of C and D are finally processed by task E.
In order to be able to schedule such a workflow in a distributed environment, the graph has to be transformed to individual executable units. Steep follows a hybrid scheduling approach that applies heuristics on the level of the workflow graph and later on the level of individual executable units. We assume that tasks that access the same data should be executed on the same machine to reduce the communication overhead and to improve file reuse. We therefore group tasks into so-called process chains, which are linear sequential lists (without branches and loops).
Steep transforms workflows to process chains in an iterative manner. In each iteration, it finds the longest linear sequences of tasks and groups them to process chains. The following animation shows how this works for our example workflow:
Task A will be put into a process chain in iteration 1. Steep then schedules the execution of this process chain. After the execution has finished, Steep uses the results to produce a process chain containing B and C and another one containing D. These process chains are then scheduled to be executed in parallel. The results are finally used to generate the fourth process chain containing task E, which is also scheduled for execution.
1.2 Software architecture
The following figure shows the main components of Steep: the HTTP server, the controller, the scheduler, the agent, and the cloud manager.
Together, these components form an instance of Steep. In practice, a single instance typically runs on a separate virtual machine, but multiple instances can also be started on the same machine. Each component can be enabled or disabled in a given instance (see the configuration options for more information). That means, in a cluster, there can be instances that have all five components enabled, and others that have only an agent, for example.
All components of all instances communicate with each other through messages sent over an event bus. Further, the HTTP server, the controller, and the scheduler are able to connect to a shared database.
The HTTP server provides information about scheduled, running, and finished workflows to clients. Clients can also upload a new workflow. In this case, the HTTP server puts the workflow into the database and sends a message to one of the instances of the controller.
The controller receives this message, loads the workflow from the database, and starts transforming it iteratively to process chains as described above. Whenever it has generated new process chains, it puts them into the database and sends a message to all instances of the scheduler.
The schedulers then select agents to execute the process chains. They load the process chains from the database, send them via the event bus to the selected agents for execution, and finally write the results into the database. The schedulers also send a message back to the controller so it can continue with the next iteration and generate more process chains until the workflow has been completely transformed.
In case a scheduler does not find an agent suitable for the execution of a process chain, it sends a message to the cloud manager (a component that interacts with the API of the Cloud infrastructure) and asks it to create a new agent.
1.3 Processing services
Steep is very flexible and allows a wide range of processing services (or microservices) to be integrated. A typical processing service is a program that reads one or more input files and writes one or more output files. The program may also accept generic parameters. The service can be implemented in any programming language (as long as the binary or script is executable on the machine the Steep agent is running) or can be wrapped in a Docker container.
For a seamless integration, a processing service should adhere to the following guidelines:
- Every processing service should be a microservice. It should run in its own process and serve one specific purpose.
- As Steep needs to call the service in a distributed environment, it should not have a graphical user interface or require any human interaction during the runtime. Suitable services are command-line applications that accept arguments to specify input files, output files, and parameters.
- The service should read from input files, process the data, write results to
output files, and then exit. It should not run continuously like a web
service. If you need to integrate a web service in your workflow, we
recommend using the
curl
command or something similar. - Steep does not require the processing services to implement a specific interface. Instead, the service’s input and output parameters should be described in a special data model called service metadata.
- According to common conventions for exit codes, a processing service should return 0 (zero) upon successful execution and any number but zero in case an error has occurred (e.g. 1, 2, 128, 255, etc.).
- In order to ensure deterministic workflow exceutions, services should be stateless and idempotent. This means that every execution of a service with the same input data and the same set of parameters should produce the same result.
2 Example workflows
In this section, we describe example workflows covering patterns we regularly see in real-world use cases. For each workflow, we also provide the required service metadata. For more information about the workflow model and service metadata, please read the section on data models.
2.1 Running two services in parallel
This example workflow consists of two actions that each copy a file. Since both actions do not depend on each other (i.e. they do not share any variable), Steep converts them to two independent process chains and executes them in parallel (as long as there are at least two agents available).
The workflow defines four variables. inputFile1
and inputFile2
point to the
two files to be copied. outputFile1
and outputFile2
have no value. Steep
will create unique values (output file names) for them during the workflow
execution.
The workflow then specifies two execute actions for the copy
service. The
service metadata of copy
defines that this processing service has an input
parameter input_file
and an output parameter output_file
, both of which
must be specified exactly one time (cardinality
equals 1..1
).
For each execute action, Steep assigns the input variables to the input parameters, generates file names for the output variables, and then executes the processing services.
Workflow:
api: 4.2.0
vars:
- id: inputFile1
value: example1.txt
- id: outputFile1
- id: inputFile2
value: example2.txt
- id: outputFile2
actions:
- type: execute
service: copy
inputs:
- id: input_file
var: inputFile1
outputs:
- id: output_file
var: outputFile1
- type: execute
service: copy
inputs:
- id: input_file
var: inputFile2
outputs:
- id: output_file
var: outputFile2
{
"api": "4.2.0",
"vars": [{
"id": "inputFile1",
"value": "example1.txt"
}, {
"id": "outputFile1"
}, {
"id": "inputFile2",
"value": "example2.txt"
}, {
"id": "outputFile2"
}],
"actions": [{
"type": "execute",
"service": "copy",
"inputs": [{
"id": "input_file",
"var": "inputFile1"
}],
"outputs": [{
"id": "output_file",
"var": "outputFile1"
}]
}, {
"type": "execute",
"service": "copy",
"inputs": [{
"id": "input_file",
"var": "inputFile2"
}],
"outputs": [{
"id": "output_file",
"var": "outputFile2"
}]
}]
}
Service metadata:
- id: copy
name: Copy
description: Copy files
path: cp
runtime: other
parameters:
- id: input_file
name: Input file name
description: Input file name
type: input
cardinality: 1..1
data_type: file
- id: output_file
name: Output file name
description: Output file name
type: output
cardinality: 1..1
data_type: file
[{
"id": "copy",
"name": "Copy",
"description": "Copy files",
"path": "cp",
"runtime": "other",
"parameters": [{
"id": "input_file",
"name": "Input file name",
"description": "Input file name",
"type": "input",
"cardinality": "1..1",
"data_type": "file"
}, {
"id": "output_file",
"name": "Output file name",
"description": "Output file name",
"type": "output",
"cardinality": "1..1",
"data_type": "file"
}]
}]
2.2 Chaining two services
The following example workflow makes a copy of a file and then a copy of the
copy (i.e. the file is copied and the result is copied again). The workflow
contains two actions that share the same variable: outputFile1
is used as
the output of the first action and as the input of the second action. Steep
executes them in sequence.
The service metadata for this workflow is the same as for the previous one.
Workflow:
api: 4.2.0
vars:
- id: inputFile
value: example.txt
- id: outputFile1
- id: outputFile2
actions:
- type: execute
service: copy
inputs:
- id: input_file
var: inputFile
outputs:
- id: output_file
var: outputFile1
- type: execute
service: copy
inputs:
- id: input_file
var: outputFile1
outputs:
- id: output_file
var: outputFile2
{
"api": "4.2.0",
"vars": [{
"id": "inputFile",
"value": "example.txt"
}, {
"id": "outputFile1"
}, {
"id": "outputFile2"
}],
"actions": [{
"type": "execute",
"service": "copy",
"inputs": [{
"id": "input_file",
"var": "inputFile"
}],
"outputs": [{
"id": "output_file",
"var": "outputFile1"
}]
}, {
"type": "execute",
"service": "copy",
"inputs": [{
"id": "input_file",
"var": "outputFile1"
}],
"outputs": [{
"id": "output_file",
"var": "outputFile2"
}]
}]
}
2.3 Splitting and joining results
This example starts with an action that copies a file. Two other actions then run in parallel and make copies of the result of the first action. A final action then joins these copies to a single file. The workflow has a split-and-join pattern because the graph is split into two branches after the first action. These branches are then joined into a single one with the final action.
Workflow:
api: 4.2.0
vars:
- id: inputFile
value: example.txt
- id: outputFile1
- id: outputFile2
- id: outputFile3
- id: outputFile4
actions:
- type: execute
service: copy
inputs:
- id: input_file
var: inputFile
outputs:
- id: output_file
var: outputFile1
- type: execute
service: copy
inputs:
- id: input_file
var: outputFile1
outputs:
- id: output_file
var: outputFile2
- type: execute
service: copy
inputs:
- id: input_file
var: outputFile1
outputs:
- id: output_file
var: outputFile3
- type: execute
service: join
inputs:
- id: i
var: outputFile2
- id: i
var: outputFile3
outputs:
- id: o
var: outputFile4
{
"api": "4.2.0",
"vars": [{
"id": "inputFile",
"value": "example.txt"
}, {
"id": "outputFile1"
}, {
"id": "outputFile2"
}, {
"id": "outputFile3"
}, {
"id": "outputFile4"
}],
"actions": [{
"type": "execute",
"service": "copy",
"inputs": [{
"id": "input_file",
"var": "inputFile"
}],
"outputs": [{
"id": "output_file",
"var": "outputFile1"
}]
}, {
"type": "execute",
"service": "copy",
"inputs": [{
"id": "input_file",
"var": "outputFile1"
}],
"outputs": [{
"id": "output_file",
"var": "outputFile2"
}]
}, {
"type": "execute",
"service": "copy",
"inputs": [{
"id": "input_file",
"var": "outputFile1"
}],
"outputs": [{
"id": "output_file",
"var": "outputFile3"
}]
}, {
"type": "execute",
"service": "join",
"inputs": [{
"id": "i",
"var": "outputFile2"
}, {
"id": "i",
"var": "outputFile3"
}],
"outputs": [{
"id": "o",
"var": "outputFile4"
}]
}]
}
Service metadata:
- id: copy
name: Copy
description: Copy files
path: cp
runtime: other
parameters:
- id: input_file
name: Input file name
description: Input file name
type: input
cardinality: 1..1
data_type: file
- id: output_file
name: Output file name
description: Output file name
type: output
cardinality: 1..1
data_type: file
- id: join
name: Join
description: Merge one or more files into one
path: join.sh
runtime: other
parameters:
- id: i
name: Input files
description: One or more input files to merge
type: input
cardinality: 1..n
data_type: file
- id: o
name: Output file
description: The output file
type: output
cardinality: 1..1
data_type: file
[{
"id": "copy",
"name": "Copy",
"description": "Copy files",
"path": "cp",
"runtime": "other",
"parameters": [{
"id": "input_file",
"name": "Input file name",
"description": "Input file name",
"type": "input",
"cardinality": "1..1",
"data_type": "file"
}, {
"id": "output_file",
"name": "Output file name",
"description": "Output file name",
"type": "output",
"cardinality": "1..1",
"data_type": "file"
}]
}, {
"id": "join",
"name": "Join",
"description": "Merge one or more files into one",
"path": "join.sh",
"runtime": "other",
"parameters": [{
"id": "i",
"name": "Input files",
"description": "One or more input files to merge",
"type": "input",
"cardinality": "1..n",
"data_type": "file"
}, {
"id": "o",
"name": "Output file",
"description": "The output file",
"type": "output",
"cardinality": "1..1",
"data_type": "file"
}]
}]
2.4 Processing a dynamic number of results in parallel
This example demonstrates how to process the results of an action in parallel
even if the number of result files is unknown during the design of the workflow.
The workflow starts with an action that splits an input file inputFile
into
multiple files (e.g. one file per line) stored in a directory outputDirectory
.
A for-each action then iterates over these files and creates copies. The
for-each action has an iterator i
that serves as the input for the individual
instances of the copy
service. The output files (outputFile1
) of this service
are collected via the yieldToOutput
property in a variable called copies
.
The final join
service merges these copies into a single file outputFile2
.
Workflow:
api: 4.2.0
vars:
- id: inputFile
value: example.txt
- id: lines
value: 1
- id: outputDirectory
- id: i
- id: outputFile1
- id: copies
- id: outputFile2
actions:
- type: execute
service: split
inputs:
- id: file
var: inputFile
- id: lines
var: lines
outputs:
- id: output_directory
var: outputDirectory
- type: for
input: outputDirectory
enumerator: i
output: copies
actions:
- type: execute
service: copy
inputs:
- id: input_file
var: i
outputs:
- id: output_file
var: outputFile1
yieldToOutput: outputFile1
- type: execute
service: join
inputs:
- id: i
var: copies
outputs:
- id: o
var: outputFile2
{
"api": "4.2.0",
"vars": [{
"id": "inputFile",
"value": "example.txt"
}, {
"id": "lines",
"value": 1
}, {
"id": "outputDirectory"
}, {
"id": "i"
}, {
"id": "outputFile1"
}, {
"id": "copies"
}, {
"id": "outputFile2"
}],
"actions": [{
"type": "execute",
"service": "split",
"inputs": [{
"id": "file",
"var": "inputFile"
}, {
"id": "lines",
"var": "lines"
}],
"outputs": [{
"id": "output_directory",
"var": "outputDirectory"
}]
}, {
"type": "for",
"input": "outputDirectory",
"enumerator": "i",
"output": "copies",
"actions": [{
"type": "execute",
"service": "copy",
"inputs": [{
"id": "input_file",
"var": "i"
}],
"outputs": [{
"id": "output_file",
"var": "outputFile1"
}]
}],
"yieldToOutput": "outputFile1"
}, {
"type": "execute",
"service": "join",
"inputs": [{
"id": "i",
"var": "copies"
}],
"outputs": [{
"id": "o",
"var": "outputFile2"
}]
}]
}
Service metadata:
- id: split
name: Split
description: Split a file into pieces
path: split
runtime: other
parameters:
- id: lines
name: Number of lines per file
description: Create smaller files n lines in length
type: input
cardinality: 0..1
data_type: integer
label: '-l'
- id: file
name: Input file
description: The input file to split
type: input
cardinality: 1..1
data_type: file
- id: output_directory
name: Output directory
description: The output directory
type: output
cardinality: 1..1
data_type: directory
file_suffix: /
- id: copy
name: Copy
description: Copy files
path: cp
runtime: other
parameters:
- id: input_file
name: Input file name
description: Input file name
type: input
cardinality: 1..1
data_type: file
- id: output_file
name: Output file name
description: Output file name
type: output
cardinality: 1..1
data_type: file
- id: join
name: Join
description: Merge one or more files into one
path: join.sh
runtime: other
parameters:
- id: i
name: Input files
description: One or more input files to merge
type: input
cardinality: 1..n
data_type: file
- id: o
name: Output file
description: The output file
type: output
cardinality: 1..1
data_type: file
[{
"id": "split",
"name": "Split",
"description": "Split a file into pieces",
"path": "split",
"runtime": "other",
"parameters": [{
"id": "lines",
"name": "Number of lines per file",
"description": "Create smaller files n lines in length",
"type": "input",
"cardinality": "0..1",
"data_type": "integer",
"label": "-l"
}, {
"id": "file",
"name": "Input file",
"description": "The input file to split",
"type": "input",
"cardinality": "1..1",
"data_type": "file"
}, {
"id": "output_directory",
"name": "Output directory",
"description": "The output directory",
"type": "output",
"cardinality": "1..1",
"data_type": "directory",
"file_suffix": "/"
}]
}, {
"id": "copy",
"name": "Copy",
"description": "Copy files",
"path": "cp",
"runtime": "other",
"parameters": [{
"id": "input_file",
"name": "Input file name",
"description": "Input file name",
"type": "input",
"cardinality": "1..1",
"data_type": "file"
}, {
"id": "output_file",
"name": "Output file name",
"description": "Output file name",
"type": "output",
"cardinality": "1..1",
"data_type": "file"
}]
}, {
"id": "join",
"name": "Join",
"description": "Merge one or more files into one",
"path": "join.sh",
"runtime": "other",
"parameters": [{
"id": "i",
"name": "Input files",
"description": "One or more input files to merge",
"type": "input",
"cardinality": "1..n",
"data_type": "file"
}, {
"id": "o",
"name": "Output file",
"description": "The output file",
"type": "output",
"cardinality": "1..1",
"data_type": "file"
}]
}]
2.5 Feeding results back into the workflow (cycles/loops)
The following example shows how to create loops with a dynamic number of
iterations. Suppose there is a processing service called countdown.js
that
reads a number from an input file, decreases this number by 1, and then writes
the result to an output file. The service could be implemented in Node.js
as follows:
#!/usr/bin/env node
const fs = require("fs").promises
async function countDown(input, output) {
let value = parseInt(await fs.readFile(input, "utf-8"))
console.log(`Old value: ${value}`)
value--
if (value > 0) {
console.log(`New value: ${value}`)
await fs.writeFile(output, "" + value, "utf-8")
} else {
console.log("No new value")
}
}
countDown(process.argv[2], process.argv[3])
The following workflow uses this service in a for-each action to continuously reprocess a file and decrease the number in it until it reaches 0.
In the first iteration of the for-each action, the service reads from a file called
input.txt
and writes to an output file with a name generated during runtime.
The path of this output file is routed back into the for-each action via
yieldToInput
. In the second iteration, the service reads from the output file
and produces another one. This process continues until the number equals 0.
In this case, the service does not write an output file anymore and the
workflow finishes.
Note that we use the data type fileOrEmptyList
in the service metadata for the
output parameter of the countdown
service. This is a special data type that
either returns the generated file or an empty list if the file does not exist.
In the latter case, the for-each action does not have any more input values to
process. Think of the input
of a for-each action as a queue. If nothing is
pushed into the queue and all elements have already been processed, the for-each
action can finish.
Workflow:
api: 4.2.0
vars:
- id: input_file
value: input.txt
- id: i
- id: output_file
actions:
- type: for
input: input_file
enumerator: i
yieldToInput: output_file
actions:
- type: execute
service: countdown
inputs:
- id: input
var: i
outputs:
- id: output
var: output_file
{
"api": "4.2.0",
"vars": [{
"id": "input_file",
"value": "input.txt"
}, {
"id": "i"
}, {
"id": "output_file"
}],
"actions": [{
"type": "for",
"input": "input_file",
"enumerator": "i",
"yieldToInput": "output_file",
"actions": [{
"type": "execute",
"service": "countdown",
"inputs": [{
"id": "input",
"var": "i"
}],
"outputs": [{
"id": "output",
"var": "output_file"
}]
}]
}]
}
Service metadata:
- id: countdown
name: Count Down
description: Read a number, subtract 1, and write the result
path: ./countdown.js
runtime: other
parameters:
- id: input
name: Input file
description: The input file containing the number to decrease
type: input
cardinality: 1..1
data_type: file
- id: output
name: Output file
description: The path to the output file
type: output
cardinality: 1..1
data_type: fileOrEmptyList
[{
"id": "countdown",
"name": "Count Down",
"description": "Read a number, subtract 1, and write the result",
"path": "./countdown.js",
"runtime": "other",
"parameters": [{
"id": "input",
"name": "Input file",
"description": "The input file containing the number to decrease",
"type": "input",
"cardinality": "1..1",
"data_type": "file"
}, {
"id": "output",
"name": "Output file",
"description": "The path to the output file",
"type": "output",
"cardinality": "1..1",
"data_type": "fileOrEmptyList"
}]
}]
3 Data models
This section contains a description of all data models used by Steep.
3.1 Workflows
The main components of the workflow model are variables and actions. Use variables to specify input files and parameters for your processing services. Variables for output files must also be declared but must not have a value. The names of output files will be generated by Steep during the runtime of the workflow.
Property | Type | Description |
---|---|---|
api (required) | string | The API (or data model) version. Should be 4.2.0 . |
name (optional) | string | An optional human-readable workflow name |
vars (required) | array | An array of variables |
actions (required) | array | An array of actions that make up the workflow |
Example:
See the section on example workflows.
3.2 Variables
A variable holds a value for inputs and outputs of processing services. It can be defined (inputs) or undefined (outputs). Defined values are immutable. Undefined variables will be assigned a value by Steep during the runtime of a workflow.
Variables are also used to link two services together and to define the data flow in the workflow graph. For example, if the output parameter of a service A refers to a variable V, and the input parameter of service B refers to the same variable, Steep will first execute A to determine the value of V and then execute B.
Property | Type | Description |
---|---|---|
id (required) | string | A unique variable identifier |
value (optional) | any | The variable’s value or null if the variable is undefined |
Example:
id: input_file
value: /data/input.txt
{
"id": "input_file",
"value": "/data/input.txt"
}
3.3 Actions
There are two types of actions in a workflow: execute actions
and for-each actions.
They are differentiated by their type
attribute.
3.3.1 Execute actions
An execute action instructs Steep to execute a certain service with given inputs and outputs.
Property | Type | Description |
---|---|---|
type (required) | string | The type of the action. Must be "execute" . |
service (required) | string | The ID of the service to execute |
inputs (optional) | array | An array of input parameters |
outputs (optional) | array | An array of output parameters |
retries (optional) | object | An optional retry policy specifying how often this action should be retried in case of an error. Overrides any default retry policy defined in the service metadata. |
Example:
type: execute
service: my_service
inputs:
- id: verbose
var: is_verbose
- id: resolution
var: resolution_pixels
- id: input_file
var: my_input_file
outputs:
- id: output_file
var: my_output_file
store: true
{
"type": "execute",
"service": "my_service",
"inputs": [{
"id": "verbose",
"var": "is_verbose"
}, {
"id": "resolution",
"var": "resolution_pixels"
}, {
"id": "input_file",
"var": "my_input_file"
}],
"outputs": [{
"id": "output_file",
"var": "my_output_file",
"store": true
}]
}
3.3.2 For-each actions
A for-each action has an input, a list of sub-actions, and an output. It clones the sub-actions as many times as there are items in its input, executes the actions, and then collects the results in the output.
Although the action is called ‘for-each’, the execution order of the sub-actions is undefined (i.e. the execution is non-sequential and non-deterministic). Instead, Steep always tries to execute as many sub-actions as possible in parallel.
For-each actions may contain execute actions but also nested for-each actions.
Property | Type | Description |
---|---|---|
type (required) | string | The type of the action. Must be "for" . |
input (required) | string | The ID of a variable containing the items to which to apply the sub-actions |
enumerator (required) | string | The ID of a variable that holds the current value from input for each iteration |
output (optional) | string | The ID of a variable that will collect output values from all iterations (see yieldToOutput ) |
actions (optional) | array | An array of sub-actions to execute in each iteration |
yieldToOutput (optional) | string | The ID of a sub-action’s output variable whose value should be appended to the for-each action’s output |
yieldToInput (optional) | string | The ID of a sub-action’s output variable whose value should be appended to the for-each action’s input to generate further iterations |
Example:
type: for
input: all_input_files
output: all_output_files
enumerator: i
yieldToOutput: output_file
actions:
- type: execute
service: copy
inputs:
- id: input
var: i
outputs:
- id: output
var: output_file
{
"type": "for",
"input": "all_input_files",
"output": "all_output_files",
"enumerator": "i",
"yieldToOutput": "output_file",
"actions": [{
"type": "execute",
"service": "copy",
"inputs": [{
"id": "input",
"var": "i"
}],
"outputs": [{
"id": "output",
"var": "output_file"
}]
}]
}
3.3.3 Parameters
This data model represents inputs and generic parameters of execute actions.
Property | Type | Description |
---|---|---|
id (required) | string | The ID of the parameter as defined in the service metadata |
var (required) | string | The ID of a variable that holds the value for this parameter |
Example:
id: input
var: i
{
"id": "input",
"var": "i"
}
3.3.4 Output parameters
Output parameters of execute actions have additional properties compared to inputs.
Property | Type | Description |
---|---|---|
id (required) | string | The ID of the parameter as defined in the service metadata |
var (required) | string | The ID of a variable to which Steep will assign the generated name of the output file. This variable can then be used, for example, as an input parameter of a subsequent action. |
prefix (optional) | string | An optional string to prepend to the generated name of the output file. For example, if Steep generates the name "name123abc" and the prefix is "my/dir/" , the output filename will be "my/dir/name123abc" . Note that the prefix must end with a slash if you want to create a directory. The output filename will be relative to the configured temporary directory or output directory (depending on the store property). You may even specify an absolute path: if the generated name is "name456fgh" and the prefix is "/absolute/dir/" , the output filename will be "/absolute/dir/name456fgh" . |
store (optional) | boolean | If this property is true , Steep will generate an output filename that is relative to the configured output directory instead of the temporary directory. The default value is false . |
Example:
id: output
var: o
prefix: some_directory/
store: false
{
"id": "output",
"var": "o",
"prefix": "some_directory/",
"store": false
}
3.4 Process chains
As described above, Steep transforms a workflow to one or more process chains. A process chain is a sequential list of instructions that will be sent to Steep’s remote agents to execute processing services in a distributed environment.
Property | Type | Description |
---|---|---|
id (required) | string | Unique process chain identifier |
executables (required) | array | A list of executable objects that describe what processing services should be called and with which arguments |
submissionId (required) | string | The ID of the submission to which this process chain belongs |
startTime (optional) | string | An ISO 8601 timestamp denoting the date and time when the process chain execution was started. May be null if the execution has not started yet. |
endTime (optional) | string | An ISO 8601 timestamp denoting the date and time when the process chain execution finished. May be null if the execution has not finished yet. |
status (required) | string | The current status of the process chain |
requiredCapabilities (optional) | array | A set of strings specifying capabilities a host system must provide to be able to execute this process chain. See also setups. |
results (optional) | object | If status is SUCCESS , this property contains the list of process chain result files grouped by their output variable ID. Otherwise, it is null . |
estimatedProgress (optional) | number | A floating point number between 0.0 (0%) and 1.0 (100%) indicating the current execution progress of this process chain. This property will only be provided if the process chain is currently being executed (i.e. if its status equals RUNNING ) and if a progress could actually be estimated. Note that the value is an estimation based on various factors and does not have to represent the real progress. More precise values can be calculated with a progress estimator plugin. Sometimes, progress cannot be estimated at all. In this case, the value will be null . |
errorMessage (optional) | string | If status is ERROR , this property contains a human-readable error message. Otherwise, it is null . |
Example:
id: akpm646jjigral4cdyyq
submissionId: akpm6yojjigral4cdxgq
startTime: '2020-05-18T08:44:19.221456Z'
endTime: '2020-05-18T08:44:19.446437Z'
status: SUCCESS
requiredCapabilities:
- nodejs
executables:
- id: Count Down
path: ./countdown.js
runtime: other
arguments:
- id: input
type: input
dataType: file
variable:
id: input_file
value: input.txt
- id: output
type: output
dataType: fileOrEmptyList
variable:
id: output_file
value: output.txt
runtimeArgs: []
results:
output_file:
- output.txt
{
"id": "akpm646jjigral4cdyyq",
"submissionId": "akpm6yojjigral4cdxgq",
"startTime": "2020-05-18T08:44:19.221456Z",
"endTime": "2020-05-18T08:44:19.446437Z",
"status": "SUCCESS",
"requiredCapabilities": ["nodejs"],
"executables": [{
"id": "Count Down",
"path": "./countdown.js",
"runtime": "other",
"arguments": [{
"id": "input",
"type": "input",
"dataType": "file",
"variable": {
"id": "input_file",
"value": "input.txt"
}
}, {
"id": "output",
"type": "output",
"dataType": "fileOrEmptyList",
"variable": {
"id": "output_file",
"value": "output.txt"
}
}],
"runtimeArgs": []
}],
"results": {
"output_file": ["output.txt"]
}
}
3.4.1 Process chain status
The following table shows the statuses a process chain can have:
Status | Description |
---|---|
REGISTERED | The process chain has been created but execution has not started yet |
RUNNING | The process chain is currently being executed |
CANCELLED | The execution of the process chain was cancelled |
SUCCESS | The process chain was executed successfully |
ERROR | The execution of the process chain failed |
3.5 Executables
An executable is part of a process chain. It describes how a processing service should be executed and with which parameters.
Property | Type | Description |
---|---|---|
id (required) | string | An identifier (does not have to be unique. Typically refers to the name of the service to be executed) |
path (required) | string | The path to the binary of the service to be executed. This property is specific to the runtime . For example, for the docker runtime, this property refers to the Docker image. |
arguments (required) | array | A list of arguments to pass to the service. May be empty. |
runtime (required) | string | The name of the runtime that will execute the service. Built-in runtimes are currently other (for any service that is executable on the target system) and docker for Docker containers. More runtimes can be added through plugins |
runtimeArgs (optional) | array | A list of arguments to pass to the runtime. May be empty. |
serviceId (optional) | string | The ID of the processing service to be executed. May be null if the executable does not refer to a service. |
retries (optional) | object | An optional retry policy specifying how often this executable should be restarted in case of an error. |
Example:
id: Count Down
path: my_docker_image:latest
runtime: docker
arguments:
- id: input
type: input
dataType: file
variable:
id: input_file
value: /data/input.txt
- id: output
type: output
dataType: directory
variable:
id: output_file
value: /data/output
- id: arg1
type: input
dataType: boolean
label: '--foobar'
variable:
id: akqcqqoedcsaoescyhga
value: 'true'
runtimeArgs:
- id: akqcqqoedcsaoescyhgq
type: input
dataType: string
label: '-v'
variable:
id: data_mount
value: /data:/data
{
"id": "Count Down",
"path": "my_docker_image:latest",
"runtime": "docker",
"arguments": [{
"id": "input",
"type": "input",
"dataType": "file",
"variable": {
"id": "input_file",
"value": "/data/input.txt"
}
}, {
"id": "output",
"type": "output",
"dataType": "directory",
"variable": {
"id": "output_file",
"value": "/data/output"
}
}, {
"id": "arg1",
"type": "input",
"dataType": "boolean",
"label": "--foobar",
"variable": {
"id": "akqcqqoedcsaoescyhga",
"value": "true"
}
}],
"runtimeArgs": [{
"id": "akqcqqoedcsaoescyhgq",
"type": "input",
"dataType": "string",
"label": "-v",
"variable": {
"id": "data_mount",
"value": "/data:/data"
}
}]
}
3.5.1 Arguments
An argument is part of an executable.
Property | Type | Description |
---|---|---|
id (required) | string | An argument identifier |
label (optional) | string | An optional label to use when the argument is passed to the service (e.g. --input ). |
variable (required) | object | A variable that holds the value of this argument. |
type (required) | string | The type of this argument. Valid values: input , output |
dataType (required) | string | The type of the argument value. If this property is directory , Steep will create a new directory for the service’s output and recursively search it for result files after the service has been executed. Otherwise, this property can be an arbitrary string. New data types with special handling can be added through output adapter plugins. |
Example:
id: akqcqqoedcsaoescyhgq
type: input
dataType: string
label: '-v'
variable:
id: data_mount
value: /data:/data
{
"id": "akqcqqoedcsaoescyhgq",
"type": "input",
"dataType": "string",
"label": "-v",
"variable": {
"id": "data_mount",
"value": "/data:/data"
}
}
3.5.2 Argument variables
An argument variable holds the value of an argument.
Property | Type | Description |
---|---|---|
id (required) | string | The variable’s unique identifier |
value (required) | string | The variable’s value |
Example:
id: data_mount
value: /data:/data
{
"id": "data_mount",
"value": "/data:/data"
}
3.6 Submissions
A submission is created when you submit a workflow
through the /workflows
endpoint. It contains information
about the workflow execution such as the start and end time as well as the
current status.
Property | Type | Description |
---|---|---|
id (required) | string | Unique submission identifier |
workflow (required) | object | The submitted workflow |
startTime (optional) | string | An ISO 8601 timestamp denoting the date and time when the workflow execution was started. May be null if the execution has not started yet. |
endTime (optional) | string | An ISO 8601 timestamp denoting the date and time when the workflow execution finished. May be null if the execution has not finished yet. |
status (required) | string | The current status of the submission |
requiredCapabilities | array | A set of strings specifying capabilities a host system must provide to be able to execute this workflow. See also setups. |
runningProcessChains (required) | number | The number of process chains currently being executed |
cancelledProcessChains (required) | number | The number of process chains that have been cancelled |
succeededProcessChains (required) | number | The number of process chains that have finished successfully |
failedProcessChains (required) | number | The number of process chains whose execution has failed |
totalProcessChains (required) | number | The current total number of process chains in this submission. May increase during execution when new process chains are generated. |
results (optional) | object | If status is SUCCESS or PARTIAL_SUCCESS , this property contains the list of workflow result files grouped by their output variable ID. Otherwise, it is null . |
errorMessage (optional) | string | If status is ERROR , this property contains a human-readable error message. Otherwise, it is null . |
Example:
id: aiq7eios7ubxglkcqx5a
workflow:
api: 4.2.0
vars:
- id: myInputFile
value: /data/input.txt
- id: myOutputFile
actions:
- type: execute
service: cp
inputs:
- id: input_file
var: myInputFile
outputs:
- id: output_file
var: myOutputFile
store: true
startTime: '2020-02-13T15:38:58.719382Z'
endTime: '2020-02-13T15:39:00.807715Z'
status: SUCCESS
runningProcessChains: 0
cancelledProcessChains: 0
succeededProcessChains: 1
failedProcessChains: 0
totalProcessChains: 1
results:
myOutputFile:
- /data/out/aiq7eios7ubxglkcqx5a/aiq7hygs7ubxglkcrf5a
{
"id": "aiq7eios7ubxglkcqx5a",
"workflow": {
"api": "4.2.0",
"vars": [{
"id": "myInputFile",
"value": "/data/input.txt"
}, {
"id": "myOutputFile"
}],
"actions": [{
"type": "execute",
"service": "cp",
"inputs": [{
"id": "input_file",
"var": "myInputFile"
}],
"outputs": [{
"id": "output_file",
"var": "myOutputFile",
"store": true
}]
}]
},
"startTime": "2020-02-13T15:38:58.719382Z",
"endTime": "2020-02-13T15:39:00.807715Z",
"status": "SUCCESS",
"runningProcessChains": 0,
"cancelledProcessChains": 0,
"succeededProcessChains": 1,
"failedProcessChains": 0,
"totalProcessChains": 1,
"results": {
"myOutputFile": [
"/data/out/aiq7eios7ubxglkcqx5a/aiq7hygs7ubxglkcrf5a"
]
}
}
3.6.1 Submission status
The following table shows the statuses a submission can have:
Status | Description |
---|---|
ACCEPTED | The submission has been accepted by Steep but execution has not started yet |
RUNNING | The submission is currently being executed |
CANCELLED | The submission was cancelled |
SUCCESS | The execution of the submission finished successfully |
PARTIAL_SUCCESS | The submission was executed completely but one or more process chains failed |
ERROR | The execution of the submission failed |
3.8 Retry policies
A retry policy specifies how often the execution of a workflow action should be retried in case of an error. Retry policies can be specified per service in the service metadata or per executable action in the workflow.
Property | Type | Description |
---|---|---|
maxAttempts (optional) | number | The maximum number of attempts to perform. This includes the initial attempt. For example, a value of 3 means 1 initial attempt and 2 retries. The default value is 1 . A value of -1 means an unlimited (infinite) number of attempts. 0 means there will be no attempt at all (the service or action will be skipped). |
delay (optional) | duration | The amount of time that should pass between two attempts. The default is 0 , which means the operation will be retried immediately. |
exponentialBackoff (optional) | number | A factor for an exponential backoff (see description below) |
maxDelay (optional) | duration | The maximum amount of time that should pass between two attempts. Only applies if exponentialBackoff is larger than 1 . By default, there is no upper limit. |
Exponential backoff:
The exponential backoff factor can be used to gradually increase the delay
.
The actual delay between two attempts will be calculated as follows:
actualDelay = min(delay * pow(exponentialBackoff, nAttempt - 1), maxDelay)
For example, if delay
equals 1s
, exponentialBackoff
equals 2
, and
maxDelay
equals 10s
, the following actual delays will apply:
Delay after attempt 1
min(1s * pow(2, 0), 10s) = 1s
Delay after attempt 2:
min(1s * pow(2, 1), 10s) = 2s
Delay after attempt 3:
min(1s * pow(2, 2), 10s) = 4s
Delay after attempt 4:
min(1s * pow(2, 3), 10s) = 8s
Delay after attempt 5:
min(1s * pow(2, 4), 10s) = 10s
Delay after attempt 6:
min(1s * pow(2, 4), 10s) = 10s
The default value is 1
, which means there is no backoff and the actual
delay always equals the specified one.
Example:
maxAttempts: 5
delay: 1s
exponentialBackoff: 2
maxDelay: 10s
{
"maxAttempts": 5,
"delay": "1s",
"exponentialBackoff": 2,
"maxDelay": "10s"
}
3.9 Durations
A duration consists of one or more number/unit pairs possibly separated by whitespace characters. Supported units are:
milliseconds
,millisecond
,millis
,milli
,ms
seconds
,second
,secs
,sec
,s
minutes
,minute
,mins
,min
,m
hours
,hour
,hrs
,hr
,h
days
,day
,d
Numbers must be positive integers. The default unit is milliseconds
Examples:
1000ms
3 secs
5m
20mins
10h 30 minutes
1 hour 10minutes 5s
1d 5h
10 days 1hrs 30m 15 secs
3.10 Agents
An agent represents an instance of Steep that can execute process chains.
Property | Type | Description |
---|---|---|
id (required) | string | A unique agent identifier |
available (required) | boolean | true if the agent is currently idle and new process chains can be assigned to it, false if it is busy executing a process chain |
capabilities (required) | array | A set of strings specifying capabilities the agent provides. See also setups. |
startTime (required) | string | An ISO 8601 timestamp denoting the date and time when the agent has started. |
stateChangedTime (required) | string | An ISO 8601 timestamp denoting the date and time when the value of the available property changed from true to false (i.e. when the agent became busy) or from false to true (when it became available) |
Example:
id: akuxryerojbw7mnvovaa
available: true
capabilities:
- docker
startTime: '2020-05-26T10:50:02.001998Z'
stateChangedTime: '2020-05-26T11:06:52.367121Z'
{
"id": "akuxryerojbw7mnvovaa",
"available": true,
"capabilities": ["docker"],
"startTime": "2020-05-26T10:50:02.001998Z",
"stateChangedTime": "2020-05-26T11:06:52.367121Z"
}
3.11 VMs
This data model describes virtual machines created by Steep’s cloud manager.
Property | Type | Description |
---|---|---|
id (required) | string | A unique VM identifier |
externalId (optional) | string | An identifier generated by the Cloud platform |
ipAddress (optional) | string | The VM’s IP address |
setup (required) | object | The setup used to create this VM |
creationTime (optional) | string | An ISO 8601 timestamp denoting the date and time when the VM was created. This property is null if the VM has not been created yet. |
agentJoinTime (optional) | string | An ISO 8601 timestamp denoting the date and time when a Steep agent has been deployed to the VM and has joined the cluster. This property is null if the agent has not joined the cluster yet. |
destructionTime (optional) | string | An ISO 8601 timestamp denoting the date and time when the VM was destroyed. This property is null if the VM has not been destroyed yet. |
status (required) | string | The status of the VM |
reason (optional) | string | The reason why the VM has the current status (e.g. an error message if it has the ERROR status or a simple message indicating why it has been DESTROYED ) |
3.11.1 VM status
The following table shows the statuses a VM can have:
Status | Description |
---|---|
CREATING | The VM is currently being created |
PROVISIONING | The VM has been created and is currently being provisioned (i.e. provisioning scripts defined in the VM’s setup are being executed and the Steep agent is being deployed) |
RUNNING | The VM has been created and provisioned successfully. It is currently running and registered as a remote agent. |
LEFT | The remote agent on this VM has left. It will be destroyed eventually. |
DESTROYING | The VM is currently being destroyed |
DESTROYED | The VM has been destroyed |
ERROR | The VM could not be created, provisioned, or failed otherwise. See the VM’s reason property for more information. |
3.12 Setups
A setup describes how a virtual machine (VM) should be created by Steep’s cloud manager.
Property | Type | Description |
---|---|---|
id (required) | string | A unique setup identifier |
flavor (required) | string | The flavor of the new VM |
imageName (required) | string | The name of the VM image to deploy |
availabilityZone (required) | string | The availability zone in which to create the VM |
blockDeviceSizeGb (required) | number | The size of the VM’s main block device in gigabytes |
blockDeviceVolumeType (optional) | string | An optional type of the VM’s main block device. By default, the type will be selected automatically. |
minVMs (optional) | number | An optional minimum number of VMs to create with this setup. The default value is 0 . |
maxVMs (required) | number | The maximum number of VMs to create with this setup |
maxCreateConcurrent (optional) | number | The maximum number of VMs to create and provision concurrently. The default value is 1 . |
provisioningScripts (optional) | array | An optional list of paths to scripts that should be executed on the VM after it has been created |
providedCapabilities (optional) | array | An optional set of strings specifying capabilities that VMs with this setup will have |
sshUsername (optional) | string | An optional username for the SSH connection to the created VM. Overrides the main configuration item steep.cloud.ssh.username if it is defined. |
additionalVolumes | array | An optional list of volumes that will be attached to the VM |
Example:
id: default
flavor: 7d217779-4d7b-4689-8a40-c12a377b946d
imageName: Ubuntu 18.04
availabilityZone: nova
blockDeviceSizeGb: 50
minVMs: 0
maxVMs: 4
provisioningScripts:
- conf/setups/default/01_docker.sh
- conf/setups/default/02_steep.sh
providedCapabilities:
- docker
additionalVolumes:
- sizeGb: 50
type: SSD
{
"id": "default",
"flavor": "7d217779-4d7b-4689-8a40-c12a377b946d",
"imageName": "Ubuntu 18.04",
"availabilityZone": "nova",
"blockDeviceSizeGb": 50,
"minVMs": 0,
"maxVMs": 4,
"provisioningScripts": [
"conf/setups/default/01_docker.sh",
"conf/setups/default/02_steep.sh"
],
"providedCapabilities": ["docker"],
"additionalVolumes": [{
"sizeGb": 50,
"type": "SSD"
}]
}
3.12.1 Volumes
This data model describes an additional volume that can be attached to a virtual machine specified by a setup.
Property | Type | Description |
---|---|---|
sizeGb (required) | number | The volume’s size in gigabytes |
type (optional) | string | Type the volume’s type. By default, the type will be selected automatically. |
availabilityZone (optional) | string | The availability zone in which to create the volume. By default, it will be created in the same availability zone as the VM to which it will be attached. |
Example:
sizeGb: 50
type: SSD
availabilityZone: nova
{
"sizeGb": 50,
"type": "SSD",
"availabilityZone": "nova"
}
3.13 Pool agent parameters
Steep’s cloud manager component is able to create virtual machines and to deploy remote agent instances to it. The cloud manager keeps every remote agent created in a pool. Use pool agent parameters to define a minimum and maximum number of instances per provided capability set.
Property | Type | Description |
---|---|---|
capabilities (required) | array | A set of strings specifying capabilities that a remote agent must provide so these parameters apply to it |
min (optional) | number | An optional minimum number of remote agents that the cloud manager should create with the given capabilities |
max (optional) | number | An optional maximum number of remote agents that the cloud manager is allowed to create with the given capabilities |
Example:
capabilities:
- docker
- python
min: 1
max: 5
{
"capabilities": ["docker", "python"],
"min": 1,
"max": 5
}
4 HTTP endpoints
The main way to communicate with Steep (i.e. to submit workflows, to monitor progress, fetch metadata, etc.) is through its HTTP interface. In this section, we describe all HTTP endpoints. By default, Steep listens to incoming connections on port 8080.
4.1 GET information
Get information about Steep. This includes:
- Steep’s version number
- A build ID
- A SHA of the Git commit for which the build was created
- A timestamp of the moment when the build was created
Resource URL |
---|
/
Parameters | |
---|---|
None |
Status codes | |
---|---|
200 | The operation was successful |
Example request |
---|
GET / HTTP/1.1
Example response |
---|
HTTP/1.1 200 OK
content-encoding: gzip
content-length: 131
content-type: application/json
{
"build": "173",
"commit": "288a2a02c7d1ffd9775b4388176e07b60286880f",
"timestamp": 1605344824636,
"version": "5.7.0"
}
4.2 GET submissions
Get information about all submissions in the database. The response is a JSON
array consisting of submission objects
without the properties workflow
, results
, and errorMessage
. In order to
get the complete details of a submission, use
the GET submission by ID endpoint.
The submissions are returned in the order in which they were added to the database with the newest ones at the top.
Resource URL |
---|
/workflows
Parameters | |
---|---|
size (optional) | The maximum number of submissions to return. The default value is 10. |
offset (optional) | The offset of the first submission to return. The default value is 0. |
status (optional) | If this parameter is defined, Steep will only return submissions with the given status. Otherwise, it will return all submissions from the database. See the list of submission statuses for valid values. |
Response headers | |
---|---|
x-page-size | The size of the current page (i.e. the maximum number of submission objects returned). See size request parameter. |
x-page-offset | The offset of the first submission returned. See offset request parameter |
x-page-total | The total number of submissions in the database matching the given request parameters. |
Status codes | |
---|---|
200 | The operation was successful |
400 | One of the parameters was invalid. See response body for error message. |
Example request |
---|
GET /workflows HTTP/1.1
Example response |
---|
HTTP/1.1 200 OK
content-encoding: gzip
content-length: 662
content-type: application/json
x-page-offset: 0
x-page-size: 10
x-page-total: 2
[
{
"id": "akpm6yojjigral4cdxgq",
"startTime": "2020-05-18T08:44:01.045710Z",
"endTime": "2020-05-18T08:44:21.218425Z",
"status": "SUCCESS",
"requiredCapabilities": [],
"runningProcessChains": 0,
"cancelledProcessChains": 0,
"succeededProcessChains": 10,
"failedProcessChains": 0,
"totalProcessChains": 10
},
{
"id": "akttc5kv575splk3ameq",
"startTime": "2020-05-24T17:20:37.343072Z",
"status": "RUNNING",
"requiredCapabilities": [],
"runningProcessChains": 1,
"cancelledProcessChains": 0,
"succeededProcessChains": 391,
"failedProcessChains": 0,
"totalProcessChains": 1000
}
]
4.3 GET submission by ID
Get details about a single submission from the database.
Resource URL |
---|
/workflows/:id
Parameters | |
---|---|
id | The ID of the submission to return |
Status codes | |
---|---|
200 | The operation was successful |
404 | The submssion was not found |
Example request |
---|
GET /workflows/akpm6yojjigral4cdxgq HTTP/1.1
Example response |
---|
HTTP/1.1 200 OK
content-encoding: gzip
content-length: 348
content-type: application/json
{
"id": "akpm6yojjigral4cdxgq",
"startTime": "2020-05-18T08:44:01.045710Z",
"endTime": "2020-05-18T08:44:21.218425Z",
"status": "SUCCESS",
"requiredCapabilities": [],
"runningProcessChains": 0,
"cancelledProcessChains": 0,
"succeededProcessChains": 10,
"failedProcessChains": 0,
"totalProcessChains": 10,
"workflow": {
"api": "4.2.0",
"vars": [
…
],
"actions": [
…
]
}
}
4.4 PUT submission
Update a submission. The request
body is a JSON object with the submission properties
to update. At the moment, only the status
property can be updated.
If the operation was successful, the response body contains
the updated submission without the properties workflow
, results
, and
errorMessage
.
Note: You can use this endpoint to cancel the execution of a submission (see example below).
Resource URL |
---|
/workflows/:id
Parameters | |
---|---|
id | The ID of the submission to update |
Status codes | |
---|---|
200 | The operation was successful |
400 | The request body was invalid |
404 | The submission was not found |
Example request |
---|
PUT /workflows/akujvtkv575splk3saqa HTTP/1.1
Content-Length: 28
Content-Type: application/json
{
"status": "CANCELLED"
}
Example response |
---|
HTTP/1.1 200 OK
content-encoding: gzip
content-length: 168
content-type: application/json
{
"id": "akujvtkv575splk3saqa",
"startTime": "2020-05-25T19:02:21.610396Z",
"endTime": "2020-05-25T19:02:33.414032Z",
"status": "CANCELLED",
"runningProcessChains": 0,
"cancelledProcessChains": 314,
"succeededProcessChains": 686,
"failedProcessChains": 0,
"totalProcessChains": 1000
}
4.5 POST workflow
Create a new submission. The request body contains the workflow to execute.
If the operation was successful, the response body contains submission.
Resource URL |
---|
/workflows
Status codes | |
---|---|
202 | The workflow has been accepted (i.e. stored in the database) and is scheduled for execution. |
400 | The posted workflow was invalid. See response body for more information. |
Example request |
---|
POST /workflows HTTP/1.1
Content-Length: 231
Content-Type: application/json
{
"api": "3.0.0",
"vars": [{
"id": "sleep_seconds",
"value": 3
}],
"actions": [{
"type": "execute",
"service": "sleep",
"inputs": [{
"id": "seconds",
"var": "sleep_seconds"
}]
}]
}
Example response |
---|
HTTP/1.1 202 Accepted
content-encoding: gzip
content-length: 374
content-type: application/json
{
"id": "akukkcsv575splk3v2ma",
"status": "ACCEPTED",
"workflow": {
"api": "3.0.0",
"vars": [{
"id": "sleep_seconds",
"value": 3
}],
"actions": [{
"type": "execute",
"service": "sleep",
"inputs": [],
"outputs": [],
"inputs": [{
"id": "seconds",
"var": "sleep_seconds"
}]
}]
}
}
4.6 GET process chains
Get information about all process chains in the database. The response is a JSON
array consisting of process chain objects
without the properties executables
and results
. In order to
get the complete details of a process chain, use
the GET process chain by ID endpoint.
The process chains are returned in the order in which they were added to the database with the newest ones at the top.
Resource URL |
---|
/processchains
Parameters | |
---|---|
size (optional) | The maximum number of process chains to return. The default value is 10. |
offset (optional) | The offset of the first process chain to return. The default value is 0. |
submissionId (optional) | If this parameter is defined, Steep will only return process chains from the submission with the given ID. Otherwise, it will return process chains from all submissions. If there is no submission with the given ID, the result will be an empty array. |
status (optional) | If this parameter is defined, Steep will only return process chains with the given status. Otherwise, it will return all process chains from the database. See the list of process chain statuses for valid values. |
Response headers | |
---|---|
x-page-size | The size of the current page (i.e. the maximum number of process chain objects returned). See size request parameter. |
x-page-offset | The offset of the first process chain returned. See offset request parameter |
x-page-total | The total number of process chains in the database matching the given request parameters. |
Status codes | |
---|---|
200 | The operation was successful |
400 | One of the parameters was invalid. See response body for error message. |
Example request |
---|
GET /processchains HTTP/1.1
Example response |
---|
HTTP/1.1 200 OK
content-encoding: gzip
content-length: 262
content-type: application/json
x-page-offset: 0
x-page-size: 10
x-page-total: 7026
[
{
"id": "akukkcsv575splk3v2na",
"submissionId": "akukkcsv575splk3v2ma",
"startTime": "2020-05-25T19:46:02.532829Z",
"endTime": "2020-05-25T19:46:05.546807Z",
"status": "SUCCESS",
"requiredCapabilities": []
},
{
"id": "akujvtkv575splk3tppq",
"submissionId": "akujvtkv575splk3saqa",
"status": "CANCELLED",
"requiredCapabilities": []
},
…
]
4.7 GET process chain by ID
Get details about a single process chain from the database.
Resource URL |
---|
/processchains/:id
Parameters | |
---|---|
id | The ID of the process chain to return |
Status codes | |
---|---|
200 | The operation was successful |
404 | The process chain was not found |
Example request |
---|
GET /processchains/akukkcsv575splk3v2na HTTP/1.1
Example response |
---|
HTTP/1.1 200 OK
content-encoding: gzip
content-length: 535
content-type: application/json
{
"id": "akukkcsv575splk3v2na",
"submissionId": "akukkcsv575splk3v2ma",
"startTime": "2020-05-25T19:46:02.532829Z",
"endTime": "2020-05-25T19:46:05.546807Z",
"status": "SUCCESS",
"requiredCapabilities": [],
"results": {},
"executables": [{
"id": "sleep",
"path": "sleep",
"runtime": "other",
"runtimeArgs": [],
"arguments": [{
"id": "seconds",
"type": "input",
"dataType": "integer",
"variable": {
"id": "sleep_seconds",
"value": "3"
}
}]
}]
}
4.8 GET process chain logs
Get contents of the log file of a process chain.
This endpoint will only return something if process chain logging is enabled in the log configuration. Otherwise, the endpoint will always return HTTP status code 404.
Also note that process chain logs are stored on the machine where the Steep agent that has executed the process chain is running. The log files will only be available as long as the machine exists and the agent is still available. If you want to persist log files use the log configuration and define a location on a shared file system.
This endpoint supports HTTP range requests, which allows you to only fetch a portion of a process chain’s log file.
Resource URL |
---|
/logs/processchains/:id
Parameters | |
---|---|
id | The ID of the process chain whose log file to return |
forceDownload (optional) | true if the Content-Disposition header should be set in the response. This is useful if you link to a log file from a web page and want the browser to download the file instead of displaying its contents. The default value is false . |
Request headers | |
---|---|
Range (optional) | The part of the log file that should be returned (see HTTP Range header) |
Response headers | |
---|---|
Accept-Ranges | A marker to advertise support of range requests (see HTTP Accept-Ranges header) |
Content-Range (optional) | Indicates what part of the log file is delivered (see HTTP Content-Range header) |
Status codes | |
---|---|
200 | The operation was successful |
206 | Partial content will be delivered in response to a range request |
404 | Process chain log file could not be found. Possible reasons: (1) the process chain has not produced any output (yet), (2) the agent that has executed the process chain is not available anymore, (3) process chain logging is disabled in Steep’s configuration |
416 | Range not satisfiable |
Example request |
---|
GET /logs/processchains/akukkcsv575splk3v2na HTTP/1.1
Example response |
---|
HTTP/1.1 200 OK
content-type: text/plain
accept-ranges: bytes
content-encoding: gzip
transfer-encoding: chunked
2021-03-04 15:34:17 - Hello world!
4.9 HEAD process chain logs
This endpoint can be used to check if a process chain log file exists and how large it is. For more information, please refer to the GET process chain logs endpoint.
Resource URL |
---|
/logs/processchains/:id
Parameters | |
---|---|
id | The ID of the process chain whose log file to check |
Response headers | |
---|---|
Content-Length | The total size of the log file |
Accept-Ranges | A marker to advertise support of range requests (see HTTP Accept-Ranges header) |
Status codes | |
---|---|
200 | The operation was successful |
404 | Process chain log file could not be found. Possible reasons: (1) the process chain has not produced any output (yet), (2) the agent that has executed the process chain is not available anymore, (3) process chain logging is disabled in Steep’s configuration |
Example request |
---|
HEAD /logs/processchains/akukkcsv575splk3v2na HTTP/1.1
Example response |
---|
HTTP/1.1 200 OK
content-type: text/plain
content-length: 1291
accept-ranges: bytes
4.10 PUT process chain
Update a process chain. The request
body is a JSON object with the process chain properties to update. At the
moment, only the status
property can be updated.
If the operation was successful, the response body contains
the updated process chain without the properties executables
and results
.
Note: You can use this endpoint to cancel the execution of a process chain (see example below).
Resource URL |
---|
/processchains/:id
Parameters | |
---|---|
id | The ID of the process chain to update |
Status codes | |
---|---|
200 | The operation was successful |
400 | The request body was invalid |
404 | The process chain was not found |
Example request |
---|
PUT /processchains/akuxzp4rojbw7mnvovcq HTTP/1.1
Content-Length: 28
Content-Type: application/json
{
"status": "CANCELLED"
}
Example response |
---|
HTTP/1.1 200 OK
content-encoding: gzip
content-length: 222
content-type: application/json
{
"id": "akuxzp4rojbw7mnvovcq",
"submissionId": "akuxzp4rojbw7mnvovbq",
"startTime": "2020-05-26T11:06:24.055225Z",
"endTime": "2020-05-26T11:06:52.367194Z",
"status": "CANCELLED",
"requiredCapabilities": []
}
4.11 GET agents
Get information about all agents currently connected to the cluster. In order to get details about a single agent, use the GET agent by ID endpoint.
Resource URL |
---|
/agents
Parameters | |
---|---|
None |
Status codes | |
---|---|
200 | The operation was successful |
Example request |
---|
GET /agents HTTP/1.1
Example response |
---|
HTTP/1.1 200 OK
content-encoding: gzip
content-length: 195
content-type: application/json
[
{
"id": "akuxryerojbw7mnvovaa",
"available": false,
"capabilities": [],
"startTime": "2020-05-26T10:50:02.001998Z",
"stateChangedTime": "2020-05-26T11:06:52.367121Z"
},
{
"id": "akvn7r3szw5wiztrnotq",
"available": true,
"capabilities": [],
"startTime": "2020-05-27T12:21:24.548640Z",
"stateChangedTime": "2020-05-27T12:21:24.548640Z"
}
]
4.12 GET agent by ID
Get details about a single agent.
Resource URL |
---|
/agents/:id
Parameters | |
---|---|
id | The ID of the agent to return |
Status codes | |
---|---|
200 | The operation was successful |
404 | The agent was not found |
Example request |
---|
GET /processchains/akuxryerojbw7mnvovaa HTTP/1.1
Example response |
---|
HTTP/1.1 200 OK
content-encoding: gzip
content-length: 177
content-type: application/json
{
"id": "akuxryerojbw7mnvovaa",
"available": false,
"capabilities": [],
"startTime": "2020-05-26T10:50:02.001998Z",
"stateChangedTime": "2020-05-26T11:06:52.367121Z"
}
4.13 GET VMs
Get information about all VMs in the database. To get details about a single VM, use the GET VM by ID endpoint.
The VMs are returned in the order in which they were added to the database with the newest ones at the top.
Resource URL |
---|
/vms
Parameters | |
---|---|
size (optional) | The maximum number of VMs to return. The default value is 10. |
offset (optional) | The offset of the first VM to return. The default value is 0. |
status (optional) | If this parameter is defined, Steep will only return VMs with the given status. Otherwise, it will return all VMs from the database. See the list of VM statuses for valid values. |
Response headers | |
---|---|
x-page-size | The size of the current page (i.e. the maximum number of VM objects returned). See size request parameter. |
x-page-offset | The offset of the first VM returned. See offset request parameter |
x-page-total | The total number of VMs in the database matching the given request parameters. |
Status codes | |
---|---|
200 | The operation was successful |
400 | One of the parameters was invalid. See response body for error message. |
Example request |
---|
GET /vms HTTP/1.1
Example response |
---|
HTTP/1.1 200 OK
content-encoding: gzip
content-length: 2402
content-type: application/json
x-page-offset: 0
x-page-size: 10
x-page-total: 614
[
{
"id": "akvn5rmvrozqzj5k3n7a",
"externalId": "cc6bb115-5852-4646-87c0-d61a9e275722",
"ipAddress": "10.90.5.10",
"creationTime": "2020-05-27T12:17:01.861596Z",
"agentJoinTime": "2020-05-27T12:18:27.957192Z",
"status": "LEFT",
"setup": {
"id": "default",
"flavor": "7d217779-4d7b-4689-8a40-c12a377b946d",
"imageName": "Ubuntu 18.04",
"availabilityZone": "nova",
"blockDeviceSizeGb": 50,
"minVMs": 0,
"maxVMs": 4,
"provisioningScripts": [
"conf/setups/default/01_docker.sh",
"conf/setups/default/02_steep.sh"
],
"providedCapabilities": ["docker"]
}
},
{
"id": "akvnmkuvrozqzj5k3mza",
"externalId": "f9ecfb9c-d0a1-45c9-87fc-3595bebc85c6",
"ipAddress": "10.90.5.24",
"creationTime": "2020-05-27T11:40:19.142991Z",
"agentJoinTime": "2020-05-27T11:41:42.349354Z",
"destructionTime": "2020-05-27T11:50:58.961455Z",
"status": "DESTROYED",
"reason": "Agent has left the cluster",
"setup": {
"id": "default",
"flavor": "7d217779-4d7b-4689-8a40-c12a377b946d",
"imageName": "Ubuntu 18.04",
"availabilityZone": "nova",
"blockDeviceSizeGb": 50,
"minVMs": 0,
"maxVMs": 4,
"provisioningScripts": [
"conf/setups/default/01_docker.sh",
"conf/setups/default/02_steep.sh"
],
"providedCapabilities": ["docker"]
}
},
…
]
4.14 GET VM by ID
Get details about a single VM from the database.
Resource URL |
---|
/vms/:id
Parameters | |
---|---|
id | The ID of the VM to return |
Status codes | |
---|---|
200 | The operation was successful |
404 | The VM was not found |
Example request |
---|
GET /vms/akvn5rmvrozqzj5k3n7a HTTP/1.1
Example response |
---|
HTTP/1.1 200 OK
content-encoding: gzip
content-length: 617
content-type: application/json
{
"id": "akvn5rmvrozqzj5k3n7a",
"externalId": "cc6bb115-5852-4646-87c0-d61a9e275722",
"ipAddress": "10.90.5.10",
"creationTime": "2020-05-27T12:17:01.861596Z",
"agentJoinTime": "2020-05-27T12:18:27.957192Z",
"status": "LEFT",
"setup": {
"id": "default",
"flavor": "7d217779-4d7b-4689-8a40-c12a377b946d",
"imageName": "Ubuntu 18.04",
"availabilityZone": "nova",
"blockDeviceSizeGb": 50,
"minVMs": 0,
"maxVMs": 4,
"provisioningScripts": [
"conf/setups/default/01_docker.sh",
"conf/setups/default/02_steep.sh"
],
"providedCapabilities": ["docker"]
}
}
4.15 GET services
Get information about all configured service metadata. To get metadata of a single service, use the GET service by ID endpoint.
Resource URL |
---|
/services
Parameters | |
---|---|
None |
Status codes | |
---|---|
200 | The operation was successful |
Example request |
---|
GET /services HTTP/1.1
Example response |
---|
HTTP/1.1 200 OK
content-encoding: gzip
content-length: 2167
content-type: application/json
[
{
"id": "cp",
"name": "cp",
"description": "Copies files",
"path": "cp",
"runtime": "other",
"parameters": [{
"id": "no_overwrite",
"name": "No overwrite",
"description": "Do not overwrite existing file",
"type": "input",
"cardinality": "1..1",
"data_type": "boolean",
"default": false,
"label": "-n"
}, {
"id": "input_file",
"name": "Input file name",
"description": "Input file name",
"type": "input",
"cardinality": "1..1",
"data_type": "file"
}, {
"id": "output_file",
"name": "Output file name",
"description": "Output file name",
"type": "output",
"cardinality": "1..1",
"data_type": "file"
}],
"runtime_args": [],
"required_capabilities": []
}, {
"id": "sleep",
"name": "sleep",
"description": "sleeps for the given amount of seconds",
"path": "sleep",
"runtime": "other",
"parameters": [{
"id": "seconds",
"name": "seconds to sleep",
"description": "The number of seconds to sleep",
"type": "input",
"cardinality": "1..1",
"data_type": "integer"
}],
"runtime_args": [],
"required_capabilities": []
},
…
]
4.16 GET service by ID
Get configured metadata of a single service.
Resource URL |
---|
/services/:id
Parameters | |
---|---|
id | The ID of the service to return |
Status codes | |
---|---|
200 | The operation was successful |
404 | The service metadata was not found |
Example request |
---|
GET /services/sleep HTTP/1.1
Example response |
---|
HTTP/1.1 200 OK
content-encoding: gzip
content-length: 401
content-type: application/json
{
"id": "sleep",
"name": "sleep",
"description": "sleeps for the given amount of seconds",
"path": "sleep",
"runtime": "other",
"parameters": [{
"id": "seconds",
"name": "seconds to sleep",
"description": "The number of seconds to sleep",
"type": "input",
"cardinality": "1..1",
"data_type": "integer"
}],
"runtime_args": [],
"required_capabilities": []
}
4.17 GET Prometheus metrics
Steep can provide metrics to Prometheus. Besides statistics about the Java Virtual Machine that Steep is running in, the following metrics are included:
Metric | Description |
---|---|
steep_remote_agents | The number of registered remote agents |
steep_controller_process_chains | The number of process chains the controller is currently waiting for |
steep_scheduler_process_chains | The number of process chains with a given status (indicated by the label status ) |
steep_local_agent_retries | The total number of times an executable with a given service ID (indicated by the label serviceId ) had to be retried |
Resource URL |
---|
/metrics
Parameters | |
---|---|
None |
Status codes | |
---|---|
200 | The operation was successful |
Example request |
---|
GET /metrics HTTP/1.1
Example response |
---|
HTTP/1.1 200 OK
content-type: text/plain
content-encoding: gzip
content-length: 1674
# HELP jvm_memory_bytes_used Used bytes of a given JVM memory area.
# TYPE jvm_memory_bytes_used gauge
jvm_memory_bytes_used{area="heap",} 2.1695392E8
jvm_memory_bytes_used{area="nonheap",} 1.46509968E8
…
# HELP steep_remote_agents Number of registered remote agents
# TYPE steep_remote_agents gauge
steep_remote_agents 1.0
…
# HELP steep_controller_process_chains Number of process chains the controller is waiting for
# TYPE steep_controller_process_chains gauge
steep_controller_process_chains 0.0
…
5 Web-based user interface
Steep has a web-based user interface that allows you to monitor the execution of running workflows, process chains, agents, and VMS, as well as to browse the database contents.
Start Steep and visit any of the above-mentioned HTTP endpoints with your web browser to open the user interface.

6 Configuration
After you have downloaded and extracted
Steep, you can find its configuration files under the conf
directory. The
following sections describe each of these files in detail.
6.1 steep.yaml
The file steep.yaml
contains the main configuration of Steep. In this section,
we describe all configuration keys and values you can set.
Note that keys are specified using the dot notation. You can use them as they are given here or use YAML notation instead. For example, the following configuration item
steep.cluster.eventBus.publicPort: 41187
is identical to:
steep:
cluster:
eventBus:
publicPort: 41187
You may override items in your configuration file with environment variables.
This is particularly useful if you are using Steep inside a Docker container.
The environment variables use a slightly different naming scheme. All variables
are in capital letters and dots are replaced by underscores. For example,
the configuration key steep.http.host
becomes STEEP_HTTP_HOST
and
steep.cluster.eventBus.publicPort
becomes STEEP_CLUSTER_EVENTBUS_PUBLICPORT
.
You may use YAML syntax to specify environment variable values. For example,
the array steep.agent.capabilities
can be specified as follows:
STEEP_AGENT_CAPABILITIES=["docker", "python"]
6.1.1 General configuration
steep.tmpPath
The path to a directory where temporary files should be stored during processing. Steep generates names for the outputs of execute actions in a workflow. If the
store
flag of an output parameter isfalse
(which is the default), the generated filename will be relative to this temporary directory.
steep.outPath
The path to a directory where output files should be stored. This path will be used instead of
steep.tmpPath
to generate a filename for an output parameter if itsstore
flag istrue
.
steep.overrideConfigFile
The path to a file that keeps additional configuration. The values of the
overrideConfigFile
will be merged into the main configuration file, so it basically overrides the default values. Note that configuration items in this file can still be overridden with environment variables. This configuration item is useful if you don’t want to change the main configuration file (or if you cannot do so) but still want to set different configuration values. Use it if you run Steep in a Docker container and bind mount theoverrideConfigFile
as a volume.
steep.services
The path to the configuration files containing service metadata. Either a string pointing to a single file, a glob pattern (e.g.
**/*.yaml
), or an array of files or glob patterns.
steep.plugins
The path to the configuration file(s) containing plugin descriptors. Either a string pointing to a single file, a glob pattern (e.g.
**/*.yaml
), or an array of files or glob patterns.
6.1.2 Cluster settings
Use these configuration items to build up a cluster of Steep instances. Under
the hood, Steep uses Vert.x and Hazelcast,
so these configuration items are very similar to the ones found in these two
frameworks. To build up a cluster, you need to configure an event bus connection
and a cluster connection. They should use different ports. host
typically
refers to the machine your instance is running on and publicHost
or
publicAddress
specify the hostname or IP address that your Steep instance will
use in your network to advertise itself so that other instances can connect to
it.
For more information, please read the documentation of Vert.x and Hazelcast.
steep.cluster.eventBus.host
The IP address (or hostname) to bind the clustered eventbus to
Default: Automatically detected local network interface
steep.cluster.eventBus.port
The port the clustered eventbus should listen on
Default: A random port
steep.cluster.eventBus.publicHost
The IP address (or hostname) the eventbus uses to announce itself within in the cluster
Default: Same as
steep.cluster.eventBus.host
steep.cluster.eventBus.publicPort
The port that the eventbus uses to announce itself within in the cluster
Default: Same as
steep.cluster.eventBus.port
steep.cluster.hazelcast.publicAddress
The IP address (or hostname) and port Hazelcast uses to announce itself within in the cluster
steep.cluster.hazelcast.port
The port that Hazelcast should listen on
steep.cluster.hazelcast.interfaces
A list of IP address patterns specifying valid interfaces Hazelcast should bind to
steep.cluster.hazelcast.members
A list of IP addresses (or hostnames) of Hazelcast cluster members
steep.cluster.hazelcast.tcpEnabled
true
if Hazelcast should use TCP to connect to other instances,false
if it should use multicast
Default:
false
6.1.3 HTTP configuration
steep.http.enabled
true
if the HTTP interface should be enabled
Default:
true
steep.http.host
The host to bind the HTTP server to
Default:
localhost
steep.http.port
The port the HTTP server should listen on
Default: 8080
steep.http.postMaxSize
The maximum size of HTTP POST bodies in bytes
Default: 1048576 (1 MB)
steep.http.basePath
The path where the HTTP endpoints and the web-based user interface should be mounted
Default:
""
(empty string, i.e. no base path)
steep.http.cors.enable
true
if Cross-Origin Resource Sharing (CORS) should be enabled
Default:
false
steep.http.cors.allowOrigin
A regular expression specifying allowed CORS origins. Use
*
to allow all origins.
Default:
"$."
(match nothing by default)
steep.http.cors.allowCredentials
true
if theAccess-Control-Allow-Credentials
response header should be returned.
Default:
false
steep.http.cors.allowHeaders
A string or an array indicating which header field names can be used in a request.
steep.http.cors.allowMethods
A string or an array indicating which HTTP methods can be used in a request.
steep.http.cors.exposeHeaders
A string or an array indicating which headers are safe to expose to the API of a CORS API specification.
steep.http.cors.maxAge
The number of seconds the results of a preflight request can be cached in a preflight result cache.
6.1.4 Controller configuration
steep.controller.enabled
true
if the controller should be enabled. Set this value tofalse
if your Steep instance does not have access to the shared database.
Default:
true
steep.controller.lookupIntervalMilliseconds
The interval at which the controller looks for accepted submissions
Default: 2000 (2 seconds)
steep.controller.lookupOrphansIntervalMilliseconds
The interval at which the controller looks for orphaned running submissions (i.e. submissions that are in the status
RUNNING
but that are currently not being processed by any Controller instance). If Steep finds such a submission it will try to resume it.
Default: 300000 (5 minutes)
6.1.5 Scheduler configuration
steep.scheduler.enabled
true
if the scheduler should be enabled. Set this value tofalse
if your Steep instance does not have access to the shared database.
Default:
true
steep.scheduler.lookupIntervalMilliseconds
The interval in which the scheduler looks for registered process chains
Default: 20000 (20 seconds)
6.1.6 Agent configuration
steep.agent.enabled
true
if this Steep instance should be able to execute process chains (i.e. if one or more agents should be deployed)
Default:
true
steep.agent.instances
The number of agents that should be deployed within this Steep instance (i.e. how many executables the Steep instance can run in parallel)
Default: 1
steep.agent.id
Unique identifier for the first agent instance deployed. Subsequent agent instances will have a consecutive number appended to their IDs.
Default: (an automatically generated unique ID)
steep.agent.capabilities
List of capabilities that the agents provide
Default:
[]
(empty list)
steep.agent.autoShutdownTimeoutMinutes
The number of minutes any agent instance can remain idle until Steep shuts itself down gracefully. By default, this value is
0
, which means the Steep never shuts itself down.
Default: 0
steep.agent.busyTimeoutSeconds
The number of seconds that should pass before an idle agent decides that it is not busy anymore. Normally, the scheduler allocates and agent, sends it a process chain, and then deallocates it after the process chain execution has finished. This value is important, if the scheduler crashes while the process chain is being executed and does not deallocate the agent anymore. In this case, the agent deallocates itself after the configured time has passed.
Default: 60
steep.agent.outputLinesToCollect
The number of output lines to collect at most from each executed service (also applies to error output)
Default: 100
6.1.7 Runtime settings
steep.runtimes.docker.env
Additional environment variables that will be passed to containers created by the Docker runtime
Example:
["key=value", "foo=bar"]
Default:
[]
(empty list)
steep.runtimes.docker.volumes
Additional volume mounts to be passed to the Docker runtime
Example:
["/data:/data"]
Default:
[]
(empty list)
6.1.8 Database connection
steep.db.driver
The database driver
Valid values:
inmemory
,postgresql
,mongodb
Default:
inmemory
steep.db.url
The database URL
steep.db.username
The database username (only used by the
postgresql
driver)
steep.db.password
The database password (only used by the
postgresql
driver)
6.1.9 Cloud connection
steep.cloud.enabled
true
if Steep should connect to a cloud to acquire remote agents on demand
Default:
false
steep.cloud.driver
Defines which cloud driver to use
Valid values:
openstack
(see the OpenStack cloud driver for more information)
steep.cloud.createdByTag
A metadata tag that should be attached to virtual machines to indicate that they have been created by Steep
steep.cloud.setupsFile
The path to the file that describes all available setups. See setups.yaml.
steep.cloud.syncIntervalSeconds
The number of seconds that should pass before the cloud manager synchronizes its internal state with the cloud again
Default: 120 (2 minutes)
steep.cloud.keepAliveIntervalSeconds
The number of seconds that should pass before the cloud manager sends keep-alive messages to a minimum of remote agents again (so that they do not shut down themselves). See
minVMs
property of the setups data model.
Default: 30
steep.cloud.agentPool
An array of agent pool parameters describing how many remote agents the cloud manager should keep in its pool how many it is allowed to create for each given set of capabilities.
Default:
[]
(empty list)
6.1.10 OpenStack cloud driver
steep.cloud.openstack.endpoint
OpenStack authentication endpoint
steep.cloud.openstack.username
OpenStack username used for authentication
steep.cloud.openstack.password
OpenStack password used for authentication
steep.cloud.openstack.domainName
OpenStack domain name used for authentication
steep.cloud.openstack.projectId
The ID of the OpenStack project to which to connect. Either this configuration item or
steep.cloud.openstack.projectName
must be set but not both at the same time.
steep.cloud.openstack.projectName
The name of the OpenStack project to which to connect. This configuration item will be used in combination with
steep.cloud.openstack.domainName
ifsteep.cloud.openstack.projectId
is not set.
steep.cloud.openstack.networkId
The ID of the OpenStack network to attach new VMs to
steep.cloud.openstack.usePublicIp
true
if new VMs should have a public IP address
Default:
false
steep.cloud.openstack.securityGroups
The OpenStack security groups that should be attached to new VMs.
Default:
[]
(empty list)
steep.cloud.openstack.keypairName
The name of the keypair to deploy to new VMs. The keypair must already exist in OpenStack.
6.1.11 SSH connection to VMs
steep.cloud.ssh.username
Username for SSH access to VMs. Can be overriden by the
sshUsername
property in each setup. May even benull
if all setups define their own username.
steep.cloud.ssh.privateKeyLocation
Location of a private key to use for SSH
6.1.12 Log configuration
steep.logs.level
The default log level for all loggers (console as well as file-based)
Valid values:
TRACE
,DEBUG
,INFO
,WARN
,ERROR
,OFF
.
Default:
DEBUG
steep.logs.main.enabled
true
if logging to the main log file should be enabled
Default: false
steep.logs.main.logFile
The name of the main log file
Default:
logs/steep.log
steep.logs.main.dailyRollover.enabled
true
if main log files should be renamed every day. The file name will be based onsteep.logs.main.logFile
and the file’s date in the formYYYY-MM-DD
(e.g.steep.2020-11-19.log
)
Default:
true
steep.logs.main.dailyRollover.maxDays
The maximum number of days’ worth of main log files to keep
Default: 7
steep.logs.main.dailyRollover.maxSize
The total maximum size of all main log files in bytes. Oldest log files will deleted when this size is reached.
Default: 104857600 (100 MB)
steep.logs.processChains.enabled
true
if the output of process chains should be logged separately to disk. The output will still also appear on the console and in the main log file (if enabled), but there, it’s not separated by process chain. This feature is useful if you want to record the output of individual process chains and make it available through the process chain logs endpoint.
Default:
false
steep.logs.processChains.path
The path where process chain logs will be stored. Individual files will will be named after the ID of the corresponding process chain (e.g.
aprsqz6d5f4aiwsdzbsq.log
).
Default:
logs/processchains
steep.logs.processChains.groupByPrefix
Set this configuration item to a value greater than
0
to group process chain log files by prefix in subdirectories under the directory configured throughsteep.logs.processChains.path
. For example, if this configuration item is set to3
, Steep will create a separate subdirectory for all process chains whose ID starts with the same three characters. The name of this subdirectory will be these three characters. The process chainsapomaokjbk3dmqovemwa
andapomaokjbk3dmqovemsq
will be put into a subdirectory calledapo
, and the process chainao344a53oyoqwhdelmna
will be put intoao3
. Note that in practice,3
is a reasonable value, which will create a new directory about every day. A value of0
disables grouping.
Default: 0
6.2 setups.yaml
The configuration file setups.yaml
contains an array of setup objects
that Steep’s cloud manager component uses to create new virtual machines and
to deploy remote agents to it.
6.3 services/services.yaml
The file services/services.yaml
contains an array of service metadata objects
describing the interfaces of all processing services Steep can execute.
6.4 plugins/common.yaml
The configuration file plugins/common.yaml
describes all plugins so Steep can
compile and use them during runtime. The file contains an array of descriptor
objects with the properties specified in the section
on extending Steep through plugins.
7 Extending Steep through plugins
Steep can be extended through plugins. In this section, we will describe the interfaces of all plugins and how they need to be configured so Steep can compile and execute them at runtime.
Each plugin is a Kotlin script with the file
extension .kts
. Inside this script, there should be a single function with
the same name as the plugin and a signature that depends on the plugin type.
Function interfaces are described in the sub-sections below.
All plugins must be referenced in the plugins/common.yaml file. This file is an array of descriptor objects with at least the following properties:
Property | Type | Description |
---|---|---|
name (required) | string | A unique name of the plugin (the function inside the plugin’s script file must have the same name) |
type (required) | string | The plugin type. Valid values are: initializer , outputAdapter , processChainAdapter , and runtime . |
scriptFile (required) | string | The path to the plugin’s Kotlin script file. The file should have the extension .kts . The path is relative to Steep’s application directory, so a valid example is conf/plugins/fileOrEmptyList.kts . |
Specific plugin types may require additional properties described in the sub-sections below.
7.1 Initializers
An initializer plugin is a function that will be called during the
initialization phase of Steep just before components such as the controller or
the scheduler are deployed. If required, the function can be a suspend
function.
Type |
---|
initializer
Additional properties |
---|
None |
Function interface |
---|
suspend fun myInitializer(vertx: io.vertx.core.Vertx)
Example descriptor |
---|
- name: myInitializer
type: initializer
scriptFile: conf/plugins/myInitializer.kts
Example plugin script |
---|
suspend fun myInitializer(vertx: io.vertx.core.Vertx) {
println("Hello from my initializer plugin")
}
7.2 Output adapters
An output adapter plugin is a function that can manipulate the output of
services depending on their produced data type (see the data_type
property
of the service parameter
data model, as well as the dataType
property of the process chain argument data model).
In other words, if an output parameter
of a processing service has a specific data_type
defined in the service’s metadata
and this data type matches the one given in the output adapter’s descriptor,
then the plugin’s function will be called after the service has been executed.
Steep will pass the output argument and the whole process chain (for reference)
to the plugin. The output argument’s value will be the current result (i.e. the
output file or directory). The plugin can modify this file or directory (if
necessary) and return a new list of files that will then be used by Steep for
further processing.
Steep has a built-in output adapter for the data type directory
. Whenever you
specify this data type in the service metadata, Steep will pass the output
directory to an internal function that recursively collects all files from this
directory and returns them as a list.
The example output adapter fileOrEmptyList
described below is also included
in Steep. It checks if an output file exists (i.e. if the processing service
has actually created it) and either returns a list with a single element (the
file) or an empty list. This is useful if a processing service has an optional
output that you want to use as input of a subsequent for-each action
or of the current for-each action via yieldToInput
.
If required, the function can be a suspend
function.
Type |
---|
outputAdapter
Additional properties | |
---|---|
supportedDataType (required) | The data_type that this plugin can handle |
Function interface |
---|
suspend fun myOutputAdapter(output: model.processchain.Argument,
processChain: model.processchain.ProcessChain,
vertx: io.vertx.core.Vertx): List<Any>
Example descriptor |
---|
- name: fileOrEmptyList
type: outputAdapter
scriptFile: conf/plugins/fileOrEmptyList.kts
supportedDataType: fileOrEmptyList
Example plugin script |
---|
import io.vertx.core.Vertx
import io.vertx.kotlin.core.file.existsAwait
import model.processchain.Argument
import model.processchain.ProcessChain
suspend fun fileOrEmptyList(output: Argument, processChain: ProcessChain,
vertx: Vertx): List<String> {
return if (!vertx.fileSystem().existsAwait(output.variable.value)) {
emptyList()
} else {
listOf(output.variable.value)
}
}
7.3 Process chain adapters
A process chain adapter plugin is a function that can manipulate generated process chains before they are executed.
It takes a list of generated process chains and returns a new list of process chains to execute or the given list if no modification was made. If required, the function can be a suspend function.
Type |
---|
processChainAdapter
Additional properties |
---|
None |
Function interface |
---|
suspend fun myProcessChainAdapter(processChains: List<model.processchain.ProcessChain>,
vertx: io.vertx.core.Vertx): List<model.processchain.ProcessChain>
Example descriptor |
---|
- name: myProcessChainAdapter
type: processChainAdapter
scriptFile: conf/plugins/myProcessChainAdapter.kts
Example plugin script |
---|
import model.processchain.ProcessChain
import io.vertx.core.Vertx
suspend fun myProcessChainAdapter(processChains: List<ProcessChain>,
vertx: Vertx): List<ProcessChain> {
val result = mutableListOf<ProcessChain>()
for (pc in processChains) {
// never execute the 'sleep' service
val executables = pc.executables.filter { e -> e.id != "sleep" }
result.add(pc.copy(executables = executables))
}
return result
}
7.4 Custom runtime environments
A runtime plugin is a function that can run process chain executables
inside a certain runtime environment. See the runtime
property
of service metadata.
The plugin’s function takes an executable to run and an output collector. The output collector is a simple interface, to which the executable’s standard output should be forwarded. If required, the function can be a suspend function.
Use this plugin if you want to implement a special way to execute processing services. For example, you can implement a remote web service call, or you can use one of the existing runtimes and run a certain service in a special way (like in the example plugin below).
Type |
---|
runtime
Additional properties | |
---|---|
supportedRuntime (required) | The name of the runtime this plugin provides. Use this value in your service metadata. |
Function interface |
---|
suspend fun myRuntime(executable: model.processchain.Executable,
outputCollector: helper.OutputCollector, vertx: io.vertx.core.Vertx)
Example descriptor (Source) |
---|
- name: ignoreError
type: runtime
scriptFile: conf/plugins/ignoreError.kts
supportedRuntime: ignoreError
Example plugin script (Source) |
---|
import helper.OutputCollector
import helper.Shell.ExecutionException
import io.vertx.core.Vertx
import model.processchain.Executable
import runtime.OtherRuntime
fun ignoreError(executable: Executable, outputCollector: OutputCollector, vertx: Vertx) {
try {
OtherRuntime().execute(executable, outputCollector)
} catch (e: ExecutionException) {
if (e.exitCode != 1) {
throw e
}
}
}
7.5 Progress estimators
A progress estimator plugin is a function that analyses the log output of a
running processing service to estimate its current progress. For example, the
plugin can look for percentages or number of bytes processed. The returned
value contributes to the execution progress of a process chain (see the
estimatedProgress
property of the process chain
data model).
The function takes the executable that is currently being run and a list of
recently collected output lines. It returns an estimated progress between
0.0 (0%) and 1.0 (100%) or null
if the progress could not be determined. The
function will be called for each output line collected and the newest line is
always at the end of the given list. If required, the function can be a
suspend
function.
Type |
---|
progressEstimator
Additional properties | |
---|---|
supportedServiceId (required) | The ID of the service this estimator plugin supports |
Function interface |
---|
suspend fun myProgressEstimator(executable: model.processchain.Executable,
recentLines: List<String>, vertx: io.vertx.core.Vertx): Double?
Example descriptor |
---|
- name: extractArchiveProgressEstimator
type: progressEstimator
scriptFile: conf/plugins/extractArchiveProgressEstimator.kts
supportedServiceId: extract-archive
Example plugin script |
---|
import model.processchain.Executable
import io.vertx.core.Vertx
suspend fun extractArchiveProgressEstimator(executable: Executable,
recentLines: List<String>, vertx: Vertx): Double? {
val lastLine = recentLines.last()
val percentSign = lastLine.indexOf('%')
if (percentSign > 0) {
val percentStr = lastLine.substring(0, percentSign)
val percent = percentStr.trim().toIntOrNull()
if (percent != null) {
return percent / 100.0
}
}
return null
}
About
Steep’s development is led by the competence center for Spatial Information Management of the Fraunhofer Institute for Computer Graphics Research IGD in Darmstadt, Germany. Fraunhofer IGD is the international leading research institution for applied visual computing. The competence center for Spatial Information Management offers expertise and innovative technologies that enable successful communication and efficient cooperation with the help of geographic information.
Steep was initially developed within the research project “IQmulus” (A High-volume Fusion and Analysis Platform for Geospatial Point Clouds, Coverages and Volumetric Data Sets) funded from the 7th Framework Programme of the European Commission, call identifier FP7-ICT-2011-8, under the Grant agreement no. 318787 from 2012 to 2016. It was previously called the ‘IQmulus JobManager’ or just the ‘JobManager’.
Presentations
This presentation was given by Michel Krämer at the DATA conference 2020. He presented his paper entitled “Capability-based scheduling of scientific workflows in the cloud”. Michel talked about Steep’s software architecture and its scheduling algorithm that assigns process chains to virtual machines.
Publications
Steep and its predecessor JobManager have appeared in at least the following publications:
[ PDF ]
[ PDF ]
[ PDF ]
