When you use Dataflow to run your pipeline, the Dataflow runner uploads your pipeline code and dependencies to a Cloud Storage bucket and creates a Dataflow job. This Dataflow job runs your pipeline on managed resources in Google Cloud.
- For batch pipelines that use the Apache Beam Java SDK versions 2.54.0 or later, Runner v2 is enabled by default.
- For pipelines that use the Apache Beam Java SDK, Runner v2 is required when running multi-language pipelines, using custom containers, or using Spanner or Bigtable change stream pipelines. In other cases, use the default runner.
- For pipelines that use the Apache Beam Python SDK versions 2.21.0 or later, Runner v2 is enabled by default. For pipelines that use the Apache Beam Python SDK versions 2.45.0 and later, Dataflow Runner v2 is the only Dataflow runner available.
- For the Apache Beam SDK for Go, Dataflow Runner v2 is the only Dataflow runner available.
Runner v2 uses a services-based architecture that benefits some pipelines:
-
Dataflow Runner v2 lets you pre-build your Python container, which can improve VM startup times and Horizontal Autoscaling performance. For more information, see Pre-build Python dependencies.
Dataflow Runner v2 supports multi-language pipelines, a feature that enables your Apache Beam pipeline to use transforms defined in other Apache Beam SDKs. Dataflow Runner v2 supports using Java transforms from a Python SDK pipeline and using Python transforms from a Java SDK pipeline. When you run Apache Beam pipelines without Runner v2, the Dataflow runner uses language-specific workers.
Limitations and restrictions
Dataflow Runner v2 has the following requirements:
- Dataflow Runner v2 requires Streaming Engine for streaming jobs.
- Because Dataflow Runner v2 requires Streaming Engine for streaming jobs, any Apache Beam transform that requires Dataflow Runner v2 also requires the use of Streaming Engine for streaming jobs. For example, the Pub/Sub Lite I/O connector for the Apache Beam SDK for Python is a cross-language transform that requires Dataflow Runner v2. If you try to disable Streaming Engine for a job or template that uses this transform, the job fails.
- For streaming pipelines that use the Apache Beam Java SDK, the classes
MapState
andSetState
are not supported with Runner v2. To use theMapState
andSetState
classes with Java pipelines, enable Streaming Engine, disable Runner v2, and use the Apache Beam SDK version 2.58.0 or later. - For batch and streaming pipelines that use the Apache Beam Java SDK, the
classes
OrderedListState
andAfterSynchronizedProcessingTime
are not supported.
Enable Dataflow Runner v2
To enable Dataflow Runner v2, follow the configuration instructions for your Apache Beam SDK.
Java
Dataflow Runner v2 requires the Apache Beam Java SDK versions 2.30.0 or later, with version 2.44.0 or later being recommended.
For batch pipelines that use the Apache Beam Java SDK versions 2.54.0 or later, Runner v2 is enabled by default.
To enable Runner v2, run your job with the
--experiments=use_runner_v2
flag.
To disable Runner v2, use the --experiments=disable_runner_v2
flag.
Some pipelines are automatically opted in to Runner v2.
To prevent your pipeline from using this feature, use the
--experiments=disable_runner_v2
pipeline option.
Python
For pipelines that use the Apache Beam Python SDK versions 2.21.0 or later, Runner v2 is enabled by default.
Dataflow Runner v2 isn't supported with the Apache Beam Python SDK versions 2.20.0 and earlier.
In some cases, your pipeline might not use Runner v2 even though
the pipeline runs on a supported SDK version. In such cases,
to run the job with Runner v2, use the --experiments=use_runner_v2
flag.
If you want to disable Runner v2 and your job is identified as auto_runner_v2
experiment, use the --experiments=disable_runner_v2
flag.
Disabling Runner v2 is not supported with the Apache Beam
Python SDK versions 2.45.0 and later.
Go
Dataflow Runner v2 is the only Dataflow runner available for the Apache Beam SDK for Go. Runner v2 is enabled by default.
Monitor your job
Use the monitoring interface to view Dataflow job metrics, such as memory utilization, CPU utilization, and more.
Worker VM logs are available through the Logs Explorer and the Dataflow monitoring interface. Worker VM logs include logs from the runner harness process and logs from the SDK processes. You can use the VM logs to troubleshoot your job.
Troubleshoot Runner v2
To troubleshoot jobs using Dataflow Runner v2, follow standard pipeline troubleshooting steps. The following list provides additional information about how Dataflow Runner v2 works:
- Dataflow Runner v2 jobs run two types of processes on the worker VM: SDK process and the runner harness process. Depending on the pipeline and VM type, there might be one or more SDK processes, but there is only one runner harness process per VM.
- SDK processes run user code and other language-specific functions. The runner harness process manages everything else.
- The runner harness process waits for all SDK processes to connect to it before starting to request work from Dataflow.
- Jobs might be delayed if the worker VM downloads and installs dependencies
during the SDK process startup. If issues occur during an SDK process, such as
when starting up or installing libraries, the worker reports its status as
unhealthy. If the startup times increase, enable the Cloud Build API on your
project and submit your pipeline with the following parameter:
--prebuild_sdk_container_engine=cloud_build
. - Because Dataflow Runner v2 uses checkpointing, each worker might wait for up to five seconds while buffering changes before sending the changes for further processing. As a result, latency of approximately six seconds is expected.
- To diagnose problems in your user code, examine the worker logs from the SDK processes. If you find any errors in the runner harness logs, contact Support to file a bug.
- To debug common errors related to Dataflow multi-language pipelines, see the Multi-language Pipelines Tips guide.