Learning Resources
Welcome to our learning resources. This page contains a collection of resources that will help you to get started and use Apache Beam. If you’re just starting, you can view this as a guided tour, otherwise you can jump straight to any section of your interest.
If you have additional material that you would like to see here, please let us know at [email protected]!
Getting Started
Quickstart
- Java Quickstart - How to set up and run a WordCount pipeline on the Java SDK.
- Python Quickstart - How to set up and run a WordCount pipeline on the Python SDK.
- Go Quickstart - How to set up and run a WordCount pipeline on the Go SDK.
- Java Development Environment - Setting up a Java development environment for Apache Beam using IntelliJ and Maven.
- Python Development Environment - Setting up a Python development environment for Apache Beam using PyCharm.
Learning the Basics
- WordCount - Walks you through the code of a simple WordCount pipeline. This is a very basic pipeline intended to show the most basic concepts of data processing. WordCount is the “Hello World” for data processing.
- Mobile Gaming - Introduces how to consider time while processing data, user defined transforms, windowing, filtering data, streaming pipelines, triggers, and session analysis. This is a great place to start once you get the hang of WordCount.
Fundamentals
- Programming Guide - The Programming Guide contains more in-depth information on most topics in the Apache Beam SDK. These include descriptions on how everything works as well as code snippets to see how to use every part. This can be used as a reference guidebook.
- The world beyond batch: Streaming 101 - Covers some basic background information, terminology, time domains, batch processing, and streaming.
- The world beyond batch: Streaming 102 - Tour of the unified batch and streaming programming model in Beam, alongside with an example to explain many of the concepts.
- Apache Beam Execution Model - Explanation on how runners execute an Apache Beam pipeline. This includes why serialization is important, and how a runner might distribute the work in parallel to multiple machines.
Common Patterns
- Common Use Case Patterns Part 1 - Common patterns such as writing data to multiple storage locations, slowly-changing lookup cache, calling external services, dealing with bad data, and starting jobs through a REST endpoint.
- Common Use Case Patterns Part 2 - Common patterns such as GroupBy using multiple data properties, joining two PCollections on a common key, streaming large lookup tables, merging two streams with different window lengths, and threshold detection with time-series data.
- Retry Policy - Adding a retry policy to a
DoFn
.
Articles
Data Analysis
- Predicting news social engagement - Using multiple data sources, many common design patterns, and sentiment analysis to get insights into different news articles for TensorFlow and Dataflow.
- Processing IoT Data - IoT sensors are continuously streaming data to the cloud. Learn how to handle the sensor data which can be useful for real-time monitoring, alerts, long-term data storage for analysis, performance improvement, and model training.
Data Migration
- Oracle Database to Google BigQuery - Migrate data from an Oracle Database into BigQuery using Dataprep.
- Google BigQuery to Google Datastore - Migrate data from a BigQuery table into Datastore without thinking of its schema.
- SAP HANA to Google BigQuery - Migrate data from a SAP HANA in-memory database into BigQuery.
Machine Learning
- Machine Learning using the RunInference API - Use Apache Beam with the RunInference API to use machine learning (ML) models to do local and remote inference with batch and streaming pipelines. Follow the RunInference API pipeline examples to do image classification, image segmentation, language modeling, and MNIST digit classification. See examples of RunInference transforms.
- Machine Learning Preprocessing and Prediction - Predict the molecular energy from data stored in the Spatial Data File (SDF) format. Train a TensorFlow model with tf.Transform for preprocessing in Python. This also shows how to create batch and streaming prediction pipelines in Apache Beam.
- Machine Learning Preprocessing - Find the optimal parameter settings for simulated physical machines like a bottle filler or cookie machine. The goal of each simulated machine is to have the same input/output of the actual machine, making it a “digital twin”. This uses tf.Transform for preprocessing.
Advanced Concepts
- Running on AppEngine - Use a Dataflow template to launch a pipeline from Google AppEngine, and how to run the pipeline periodically via a cron job.
- Stateful Processing - Learn how to access a persistent mutable state while processing input elements, this allows for side effects in a
DoFn
. This can be used for arbitrary-but-consistent index assignment, if you want to assign a unique incrementing index to each incoming element where order doesn’t matter. - Timely and Stateful Processing - An example on how to do batched RPC calls. The call requests are stored in a mutable state as they are received. Once there are either enough requests or a certain time has passed, the batch of requests is triggered to be sent.
- Running External Libraries - Call an external library written in a language that does not have a native SDK in Apache Beam such as C++.
Videos
- Getting Started with Apache Beam - Five part video series for understanding basic to advanced concepts.
- See more Videos and Podcasts
Courses
- Beam College – Free live and recorded lessons for learning Beam and data processing.
- Serverless Data Processing - Course specialized for Dataflow runner.
Books
Building Big Data Pipelines with Apache Beam
Building Big Data Pipelines with Apache Beam by Jan Lukavský, Packt. (January 2022). A general description of the Apache Beam model including gradually built examples that help create solid understanding of the subject. In the first part the book explains concepts using Java SDK, then SQL DSL and Portability layer with focus on Python SDK. The last part of the book is dedicated to more advanced topics like IO connectors using Splittable DoFn and description of how a typical runner executes Pipeline.
Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing
Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing by Tyler Akidau, Slava Chernyak, Reuven Lax. (August 2018). Expanded from Tyler Akidau’s popular blog posts “Streaming 101” and “Streaming 102”, this book takes you from an introductory level to a nuanced understanding of the what, where, when, and how of processing real-time data streams.
Certifications
Getting Started with Apache Beam Quest
Get Started with Apache Beam This quest includes four labs that teach you how to write and test Apache Beam pipelines. Three of the labs use Java and one uses Python. Each lab takes about 1.5 hours to complete. When you complete the quest, you’re granted a badge that you can use to show your Beam expertise.
Interactive Labs
Java
- Big Data Text Processing Pipeline (40m) - Run a word count pipeline on the Dataflow runner.
- Real Time Machine Learning (45m) - Create a real-time flight delay prediction service using historical data on internal flights in the United States.
- Visualize Real-Time Geospatial Data (60m) - Process real-time streaming data from a real-time real world historical data set, store the results in BigQuery, and visualize the geospatial data on Data Studio.
- Processing Time Windowed Data (90m) - Implement time-windowed aggregation to augment the raw data in order to produce a consistent training and test datasets for a machine learning model.
Python
- Python Qwik Start (30m) - Run a word count pipeline on the Dataflow runner.
- Simulate historic flights (60m) - Simulate real-time historic internal flights in the United States and store the resulting simulated data in BigQuery.
Beam Katas
Beam Katas are interactive Beam coding exercises (i.e. code katas) that can help you to learn Apache Beam concepts and programming model hands-on. Built based on JetBrains Educational Products, Beam Katas objective is to provide a series of structured hands-on learning experiences for learners to understand about Apache Beam and its SDKs by solving exercises with gradually increasing complexity. Beam Katas are available for both Java and Python SDKs.
Java
- Download IntelliJ Edu
- Upon opening the IDE, expand the “Learn and Teach” menu, then select “Browse Courses”
- Search for “Beam Katas - Java”
- Expand the “Advanced Settings” and modify the “Location” and “Jdk” appropriately
- Click “Join”
- Learn more about how to use the Education product
Python
- Download PyCharm Edu
- Upon opening the IDE, expand the “Learn and Teach” menu, then select “Browse Courses”
- Search for “Beam Katas - Python”
- Expand the “Advanced Settings” and modify the “Location” and “Interpreter” appropriately
- Click “Join”
- Learn more about how to use the Education product
Code Examples
Dataflow Cookbook
The cookbook includes examples in Java, Python, and Scala (via Scio), provides ready-to-launch and self-contained Beam pipelines.
Java
- Snippets 1 - Commonly-used data analysis patterns such as how to use BigQuery, a CombinePerKey transform, remove duplicate lines in files, filtering, joining PCollections, getting the maximum value of a PCollection, etc.
- Snippets 2 - Additional examples on common tasks such as configuring BigQuery, PubSub, writing one file per window, etc.
- Complete Examples - End-to-end example pipelines such as an auto complete, a streaming word extract, calculating the Term Frequency-Inverse Document Frequency (TF-IDF), getting the top Wikipedia sessions, traffic max lane flow, traffic routes, etc.
- Pub/Sub to BigQuery - A complete example demonstrates using Apache Beam on Dataflow to convert JSON encoded Pub/Sub subscription message strings into structured data and write that data to a BigQuery table.
Python
- Snippets - Commonly-used data analysis patterns such as how to use BigQuery, Datastore, coders, combiners, filters, custom PTransforms, etc.
- Complete Examples - End-to-end example pipelines such as an auto complete, getting mobile gaming statistics, calculating the Julia set, solving distributing optimization tasks, estimating PI, calculating the Term Frequency-Inverse Document Frequency (TF-IDF), getting the top Wikipedia sessions, etc.
Beam Playground
- Beam Playground is an interactive environment to try out Beam transforms and examples without having to install Apache Beam in your environment. You can try the available Apache Beam examples at Beam Playground.
- Learn more about how to add an Apache Beam example/test/kata into Beam Playground catalog here.
API Reference
- Java API Reference - Official API Reference for the Java SDK.
- Python API Reference - Official API Reference for the Python SDK.
- Go API Reference - Official API Reference for the Go SDK.
Feedback and Suggestions
We are open for feedback and suggestions, you can find different ways to reach out to the community in the Contact Us page.
If you have a bug report or want to suggest a new feature, you can let us know by submitting a new issue.
How to Contribute
We welcome contributions from everyone! To learn more on how to contribute, check our Contribution Guide.
Last updated on 2024/11/20
Have you found everything you were looking for?
Was it all useful and clear? Is there anything that you would like to change? Let us know!