Parallelization
In the previous tutorial, you’ve learned how to create a workflow that downloads a single .zip
file and extracts it to Steep’s output directory. We will now extend this workflow to download and extract multiple files.
In certain situations, it can be beneficial to perform operations in parallel. For example, running multiple downloads at the same time can improve bandwidth usage. Since a .zip
file can only be extracted with a single thread, running multiple unzip
instances in parallel can improve CPU utilization. Also, through horizontal scaling, CPU-intensive tasks can be distributed to multiple machines.
This tutorial will show you how to write workflows where the individual actions are executed in parallel. Note that you don’t have to model parallelization yourself. Steep is able to automatically detect independent branches of the workflow graph (so-called process chains) and execute them in parallel. Read more about the approach to parallelization and process chain generation in How does Steep work?
Step 1: Configure Steep for parallelization
In order to enable parallelization, you either have to configure multiple agents or launch more than one Steep instance. The first approach allows you to run multiple process chains in parallel on the same machine (vertical scaling), while the second one enables horizontal scaling across several machines in a distributed environment. Both approaches can be combined.
Spawn multiple agents (vertical scaling)
By default, every Steep instance spawns only one agent. You can configure the number of agents in the general configuration file conf/steep.yaml
. For example, the following configuration will spawn 12 agents:
Restart Steep for the changes to take effect. Then, open the ‘Agents’ tab in Steep’s UI to see the list of agents: http://localhost:8080/agents
You may also access Steep’s HTTP API:
Launch multiple Steep instances (horizontal scaling)
Steep comes with a built-in clustering solution based on Eclipse Vert.x and Hazelcast.
If you run multiple instances of Steep on your local machine, they will automatically form a cluster. For this to work, you have to change the ports for the event bus and the HTTP server of each instance. The default configuration can be overridden through environment variables.
After downloading the Steep distribution, open your terminal and change into the directory where you’ve extracted it. Run the following command:
Run this command as many times as you want but make sure to change the values of STEEP_CLUSTER_EVENTBUS_PORT
, STEEP_CLUSTER_EVENTBUS_PUBLICPORT
, and STEEP_HTTP_PORT
for each instance. For example:
The default configuration will also allow you to run Steep on multiple machines in the same network. The instances will automatically detect each other (through multicasting) and form a cluster. If this does not work (because your firewall does not allow multicasting) or if you need a more advanced distributed deployment across different networks or with specific IP addresses, read the section on cluster settings.
Step 2: Execute parallel workflows
In the following, we present two ways to extend the workflow from the previous tutorial to download and extract multiple files at the same time.
This section assumes that you’ve read and followed the previous tutorial and that you’ve already configured the metadata for the download and extract services.
Option A: Independent actions
While processing the workflow graph, Steep looks for independent sequences of actions and converts them to process chains. The following workflow contains four actions (two download actions and two extract actions). There are two output variables o1
and o2
that connect the services with each other.
As described in the previous tutorial, Steep automatically detects dependencies between the actions. In this case, the first extract action depends on the output of the first download action (o1
), and the second extract action depends on the second download action (o2
).
Both pairs of actions are independent. Steep converts them to different process chains, which can then be executed in parallel.
Option B: Using a for-each action
While option A does exactly what we want, it is kind of verbose: The download and extract actions have to be repeated for every file. But what if we wanted to apply this to, say, a hundred or a thousand files?
For this purpose, Steep offers so-called for-each actions that iterate over a list of inputs and apply a set of actions to each of them. During workflow execution, Steep is able to automatically convert the individual iterations of a for-each action to independent process chains and run them in parallel.
for-each actions are not loops! Although, you may be used to general purpose programming languages that sometimes also have a for-each construct, which is executed with a single thread and, hence, sequentially, for-each actually just means that something should be applied to each element of a set. In our case, a set of actions is applied to each item in the for-each action’s input. This can be done in parallel and in any order.
The following workflow works exactly like the one from option A but uses a variable and a for-each action instead to reduce boilerplate code.
The graphical representation of the workflow is as follows:
We are using a variable (urls
) with a constant value in this workflow, but it is also possible to apply a for-each action to the output of a service. This is useful if a service produces a directory and you want to process each file in this directory. Specify the data type directory
for the output parameter in the service’s metadata and, after the service has finished, Steep will automatically traverse the directory recursively to collect all files in a list. We will use this approach in the advanced tutorial on aerial image segmentation.
Step 3: Submit parallel workflows
Submit the workflows from above to Steep and familiarize yourself with how it converts them to process chains and what directories and files it generates in the output directory.
Visit the workflow page in Steep’s web UI (http://localhost:8080/workflows) and click on the ID of your submitted workflow. It should display two process chains, one for each downloaded file.
List the contents of Steep’s output directory for this workflow:
You should find two sub-directories, one for each extract action output.