How does Steep work?
To answer this question, we will first describe how Steep transforms workflow graphs into executable units. After that, we will have a look at Steep’s software architecture and what kind of processing services it can execute.This guide is based on the following publication: Krämer, M. (2020). Capability-Based Scheduling of Scientific Workflows in the Cloud. Proceedings of the 9th International Conference on Data Science, Technology, and Applications DATA, 43–54. https://doi.org/10.5220/0009805400430054
In literature, a workflow is typically represented by a directed graph that describes how an input data set is processed by certain tasks in a given order to produce a desired outcome. The following figure shows a simple example in the extended Petri Net notation proposed by van der Aalst and van Hee (2004).
The workflow starts with an input file that is read by a task A. This task produces two results. The first one is processed by task B whose result is in turn sent to C. The second result of A is processed by D. The outcomes of C and D are finally processed by task E. This is a very simple example. In practice, workflows can become very large with hundreds up to several thousands of tasks processing large numbers of input files.
In order to be able to schedule such a workflow in a distributed environment, the graph has to be transformed to individual executable units. Steep follows a hybrid scheduling approach that applies heuristics on the level of the workflow graph and later on the level of individual executable units. It assumes that tasks that access the same data should be executed on the same machine to reduce the communication overhead and to improve file reuse. Steep therefore groups tasks into so-called process chains, which are linear sequential lists (without branches and loops).
Transforming workflows into process chains is an iterative process. In each iteration, Steep finds the longest linear sequences of tasks and groups them to process chains. The following animation shows how this works for our example workflow:
Task A will be put into a process chain in iteration 1. Steep then schedules the execution of this process chain. After the execution has finished, Steep uses the results to produce a process chain containing B and C and another one containing D. These process chains are then scheduled to be executed in parallel. The results are finally used to generate the fourth process chain containing task E, which is also scheduled for execution.
The following figure shows the main components of Steep: the HTTP server, the controller, the scheduler, the agent, and the cloud manager.
Together, these components form an instance of Steep. In practice, each instance typically runs on a separate virtual machine, but multiple instances can also be started on the same machine. Each component can be enabled or disabled in a given instance (see configuration options for more information). This means, in a cluster, there can be instances that have all five components enabled, and others that only have an agent, for example.
All components of all instances communicate with each other through messages sent over an event bus. Further, the HTTP server, the controller, and the scheduler are able to connect to a shared database.
The HTTP server provides information about scheduled, running, and finished workflows to clients. Clients can also upload a new workflow. In such a case, the HTTP server puts the workflow into the database and sends a message to one of the instances of the controller.
The controller receives this message, loads the workflow from the database, and starts transforming it iteratively to process chains as described above. Whenever it has generated new process chains, it puts them into the database and sends a message to all instances of the scheduler.
The schedulers then select agents to execute the process chains. They load the process chains from the database, send them via the event bus to the selected agents for execution, and finally write the results into the database. The schedulers also send a message back to the controller so it can continue with the next iteration and generate more process chains until the workflow has been completely transformed.
In case a scheduler does not find an agent suitable for the execution of a process chain, it sends a message to the cloud manager (a component that interacts with the API of the Cloud infrastructure) and asks it to create a new agent.
Steep is very flexible and allows a wide range of processing services (or microservices) to be integrated. A typical processing service is a program that reads one or more input files and writes one or more output files. The program may also accept generic parameters. The service can be implemented in any programming language (as long as the binary or script is executable on the machine on which the Steep agent is running) or can be wrapped in a Docker container.
For a seamless integration, a processing service should adhere to the following guidelines:
- Every processing service should be a microservice. It should run in its own process and serve one specific purpose.
- As Steep needs to call the service in a distributed environment, it should not have a graphical user interface or require any human interaction during the runtime. Suitable services are command-line applications that accept arguments to specify input files, output files, and parameters.
- The service should read from input files, process the data, write results to
output files, and then exit. It should not run continuously like a web
service. If you need to integrate a web service in your workflow, we
recommend using the
curlcommand or something similar.
- Steep does not require the processing services to implement a specific interface. Instead, the service’s input and output parameters should be described in a special data model called service metadata.
- According to common conventions for exit codes, a processing service should return 0 (zero) upon successful execution and any number but zero in case an error has occurred (e.g. 1, 2, 128, 255, etc.).
- In order to ensure deterministic workflow executions, services should be stateless and idempotent. This means that every execution of a service with the same input data and the same set of parameters should produce the same result.