Tutorials

Parallelization

In the previous tutorial, you’ve learned how to create a workflow that downloads a single .zip file and extracts it to Steep’s output directory. We will now extend this workflow to download and extract multiple files.

In certain situations, it can be beneficial to perform operations in parallel. For example, running multiple downloads at the same time can improve bandwidth usage. Since a .zip file can only be extracted with a single thread, running multiple unzip instances in parallel can improve CPU utilization. Also, through horizontal scaling, CPU-intensive tasks can be distributed to multiple machines.

This tutorial will show you how to write workflows where the individual actions are executed in parallel. Note that you don’t have to model parallelization yourself. Steep is able to automatically detect independent branches of the workflow graph (so-called process chains) and execute them in parallel. Read more about the approach to parallelization and process chain generation in How does Steep work?

Step 1: Configure Steep for parallelization

In order to enable parallelization, you either have to configure multiple agents or launch more than one Steep instance. The first approach allows you to run multiple process chains in parallel on the same machine (vertical scaling), while the second one enables horizontal scaling across several machines in a distributed environment. Both approaches can be combined.

Spawn multiple agents (vertical scaling)

By default, every Steep instance spawns only one agent. You can configure the number of agents in the general configuration file conf/steep.yaml. For example, the following configuration will spawn 12 agents:

conf/steep.yaml
yaml
steep:
  agents:
    instances: 12

Restart Steep for the changes to take effect. Then, open the ‘Agents’ tab in Steep’s UI to see the list of agents: http://localhost:8080/agents

You may also access Steep’s HTTP API:

Terminal
shell
curl http://localhost:8080/agents

Launch multiple Steep instances (horizontal scaling)

Steep comes with a built-in clustering solution based on Eclipse Vert.x and Hazelcast.

If you run multiple instances of Steep on your local machine, they will automatically form a cluster. For this to work, you have to change the ports for the event bus and the HTTP server of each instance. The default configuration can be overridden through environment variables.

After downloading the Steep distribution, open your terminal and change into the directory where you’ve extracted it. Run the following command:

Terminal
shell
STEEP_CLUSTER_EVENTBUS_PORT=41187 STEEP_CLUSTER_EVENTBUS_PUBLICPORT=41187 \
  STEEP_HTTP_PORT=8080 bin/steep

Run this command as many times as you want but make sure to change the values of STEEP_CLUSTER_EVENTBUS_PORT, STEEP_CLUSTER_EVENTBUS_PUBLICPORT, and STEEP_HTTP_PORT for each instance. For example:

Terminal
shell
STEEP_CLUSTER_EVENTBUS_PORT=41188 STEEP_CLUSTER_EVENTBUS_PUBLICPORT=41188 \
  STEEP_HTTP_PORT=8081 bin/steep

The default configuration will also allow you to run Steep on multiple machines in the same network. The instances will automatically detect each other (through multicasting) and form a cluster. If this does not work (because your firewall does not allow multicasting) or if you need a more advanced distributed deployment across different networks or with specific IP addresses, read the section on cluster settings.

Step 2: Execute parallel workflows

In the following, we present two ways to extend the workflow from the previous tutorial to download and extract multiple files at the same time.

This section assumes that you’ve read and followed the previous tutorial and that you’ve already configured the metadata for the download and extract services.

Option A: Independent actions

While processing the workflow graph, Steep looks for independent sequences of actions and converts them to process chains. The following workflow contains four actions (two download actions and two extract actions). There are two output variables o1 and o2 that connect the services with each other.

download-independent-actions.yaml
api: 4.7.0
actions:
  # Download source code of the Steep website
  - type: execute
    service: download
    inputs:
      - id: url
        value: https://github.com/steep-wms/steep-wms.github.io/archive/refs/heads/master.zip
    outputs:
      - id: output_file
        var: o1
 
  # Download source code of Steep
  - type: execute
    service: download
    inputs:
      - id: url
        value: https://github.com/steep-wms/steep/archive/refs/heads/master.zip
    outputs:
      - id: output_file
        var: o2
 
  # Extract source code of the Steep website
  - type: execute
    service: extract
    inputs:
      - id: input_file
        var: o1   # Output variable of the first download action
    outputs:
      - id: destination_directory
        var: website_output_directory
        store: true
 
  # Extract source code of Steep
  - type: execute
    service: extract
    inputs:
      - id: input_file
        var: o2   # Output variable of the second download action
    outputs:
      - id: destination_directory
        var: steep_output_directory
        store: true

As described in the previous tutorial, Steep automatically detects dependencies between the actions. In this case, the first extract action depends on the output of the first download action (o1), and the second extract action depends on the second download action (o2).

download
download
extract
extract
o1
o1
website_output_directory
website_ou...
https://.../steep-wms.github.io/...
https://.....
download
download
extract
extract
o2
o2
steep_output_directory
steep_outp...
https://.../steep/...
https://.....

Both pairs of actions are independent. Steep converts them to different process chains, which can then be executed in parallel.

Option B: Using a for-each action

While option A does exactly what we want, it is kind of verbose: The download and extract actions have to be repeated for every file. But what if we wanted to apply this to, say, a hundred or a thousand files?

For this purpose, Steep offers so-called for-each actions that iterate over a list of inputs and apply a set of actions to each of them. During workflow execution, Steep is able to automatically convert the individual iterations of a for-each action to independent process chains and run them in parallel.

for-each actions are not loops! Although, you may be used to general purpose programming languages that sometimes also have a for-each construct, which is executed with a single thread and, hence, sequentially, for-each actually just means that something should be applied to each element of a set. In our case, a set of actions is applied to each item in the for-each action’s input. This can be done in parallel and in any order.

The following workflow works exactly like the one from option A but uses a variable and a for-each action instead to reduce boilerplate code.

download-for-each.yaml
api: 4.7.0
vars:
  # Download the source code of the Steep website and Steep
  - id: urls
    value:
      - https://github.com/steep-wms/steep-wms.github.io/archive/refs/heads/master.zip
      - https://github.com/steep-wms/steep/archive/refs/heads/master.zip
 
actions:
  - type: for
    input: urls     # The list of files to download
    enumerator: i   # The variable name for the current item
    actions:
      # Download the current item
      - type: execute
        service: download
        inputs:
          - id: url
            var: i
        outputs:
          - id: output_file
            var: o
 
      # Extract the current item
      - type: execute
        service: extract
        inputs:
          - id: input_file
            var: o
        outputs:
          - id: destination_directory
            var: output_directory
            store: true

The graphical representation of the workflow is as follows:

download
download
extract
extract
o
o
https://.../steep-wms.github.io/...
https://.....
https://.../steep/...
https://.....
output_directory
output_dir...
-
-
-
-
i
i
urls
urls

We are using a variable (urls) with a constant value in this workflow, but it is also possible to apply a for-each action to the output of a service. This is useful if a service produces a directory and you want to process each file in this directory. Specify the data type directory for the output parameter in the service’s metadata and, after the service has finished, Steep will automatically traverse the directory recursively to collect all files in a list. We will use this approach in the advanced tutorial on aerial image segmentation.

Step 3: Submit parallel workflows

Submit the workflows from above to Steep and familiarize yourself with how it converts them to process chains and what directories and files it generates in the output directory.

Visit the workflow page in Steep’s web UI (http://localhost:8080/workflows) and click on the ID of your submitted workflow. It should display two process chains, one for each downloaded file.

List the contents of Steep’s output directory for this workflow:

Terminal
shell
ls /tmp/steep/out/[WORKFLOW ID]

You should find two sub-directories, one for each extract action output.