Apache Beam Go SDK quickstart
This quickstart shows you how to run an example pipeline written with the Apache Beam Go SDK, using the Direct Runner. The Direct Runner executes pipelines locally on your machine.
If you’re interested in contributing to the Apache Beam Go codebase, see the Contribution Guide.
On this page:
Set up your development environment
Make sure you have a Go development environment ready. If not, follow the instructions in the Download and install page.
Clone the GitHub repository
Clone or download the
apache/beam-starter-go GitHub
repository and change into the beam-starter-go
directory.
Run the quickstart
Run the following command:
The output is similar to the following:
The lines might appear in a different order.
Explore the code
The main code file for this quickstart is main.go (GitHub). The code performs the following steps:
- Create a Beam pipeline.
- Create an initial
PCollection
. - Apply transforms.
- Run the pipeline, using the Direct Runner.
Create a pipeline
Before creating a pipeline, call the Init
function:
beam.Init()
Then create the pipeline:
pipeline, scope := beam.NewPipelineWithRoot()
The NewPipelineWithRoot
function returns a new
Pipeline
object, along with the pipeline’s root scope. A scope is a
hierarchical grouping for composite transforms.
Create an initial PCollection
The PCollection
abstraction represents a potentially distributed,
multi-element data set. A Beam pipeline needs a source of data to populate an
initial PCollection
. The source can be bounded (with a known, fixed size) or
unbounded (with unlimited size).
This example uses the Create
function to create a PCollection
from an in-memory array of strings. The resulting PCollection
contains the
strings “hello”, “world!”, and a user-provided input string.
elements := beam.Create(scope, "hello", "world!", input_text)
Apply transforms to the PCollection
Transforms can change, filter, group, analyze, or otherwise process the
elements in a PCollection
.
This example adds a ParDo transform to convert the input strings to title case:
elements = beam.ParDo(scope, strings.Title, elements)
The ParDo
function takes the parent scope, a transform function that
will be applied to the data, and the input PCollection. It returns the output
PCollection.
The previous example uses the built-in strings.Title
function for
the transform. You can also provide an application-defined function to a ParDo.
For example:
func logAndEmit(ctx context.Context, element string, emit func(string)) {
beamLog.Infoln(ctx, element)
emit(element)
}
This function logs the input element and returns the same element unmodified. Create a ParDo for this function as follows:
beam.ParDo(scope, logAndEmit, elements)
At runtime, the ParDo will call the logAndEmit
function on each element in
the input collection.
Run the pipeline
The code shown in the previous sections defines a pipeline, but does not process any data yet. To process data, you run the pipeline:
beamx.Run(ctx, pipeline)
A Beam runner runs a Beam pipeline on a specific platform. This example uses the Direct Runner, which is the default runner if you don’t specify one. The Direct Runner runs the pipeline locally on your machine. It is meant for testing and development, rather than being optimized for efficiency. For more information, see Using the Direct Runner.
For production workloads, you typically use a distributed runner that runs the pipeline on a big data processing system such as Apache Flink, Apache Spark, or Google Cloud Dataflow. These systems support massively parallel processing.
Next Steps
- Learn more about the Beam SDK for Go and look through the Go SDK API reference.
- Take a self-paced tour through our Learning Resources.
- Dive in to some of our favorite Videos and Podcasts.
- Join the Beam users@ mailing list.
Please don’t hesitate to reach out if you encounter any issues!
Last updated on 2024/11/25
Have you found everything you were looking for?
Was it all useful and clear? Is there anything that you would like to change? Let us know!